What an LLM Actually Is (and What It Isn't)
What a large language model really is: next-token prediction, tokens, training vs inference, and what LLMs are good and bad at — with a live demo.

Strip away the chat bubbles and the hype and a large language model does exactly one thing: it reads some text and guesses the next chunk of text. Then it does that again. And again. That's it. Everything you've seen an LLM do — write code, summarize a contract, answer a question, sound spookily human — falls out of that one move repeated thousands of times.
If you can hold that picture in your head, the rest of this series will make sense instead of feeling like magic. This first lesson builds the mental model. No math, no neural-network internals — just enough of how the thing works that you'll know when to trust it and when not to. By lesson 2 you'll be making real API calls in Python.
It's autocomplete, scaled up absurdly
You've used autocomplete. You type "see you to" and your phone suggests "tomorrow," because it learned that pattern from millions of messages. An LLM is the same idea, trained on a huge slice of the internet, books, and code, with a model big enough to track patterns your phone's keyboard could never dream of.
Give it The capital of Japan is and it predicts Tokyo — not because it looked anything up, but because in everything it read, that sentence almost always ended that way. Give it a half-written Python function and it predicts the rest, because it has seen a staggering amount of Python. The model doesn't "know" Tokyo is a city or that your function sorts a list. It knows what text tends to come next.
Here's the part that surprises people. This one trick — predict the next bit of text really well — is enough to produce things that look like reasoning, translation, and writing. You don't program in rules for each. They emerge from getting next-token prediction good enough on a big enough pile of text.
The mental model to keep
An LLM is a function: text in, a guess at the next token out. A chat reply is that function called over and over, each new token fed back in, until it predicts a "stop" signal. No database lookup, no live web search, no little person inside. Just prediction, on repeat.
That loop — predict a token, append it, feed the whole thing back in — is the entire engine:
Tokens: the unit it actually thinks in
I keep saying "next chunk of text," not "next word." That's deliberate. An LLM doesn't see letters or words — it sees tokens. A token is a piece of text, usually a common word or a fragment of one. cat might be one token. unbelievable might split into un, believ, able. Spaces and punctuation count too.
Before any text reaches the model, a tokenizer chops it into these pieces and turns each into a number, because the model only does math on numbers. The output comes back as token numbers, which get turned back into text for you. Play with it — type anything and watch it split:
Tokenizer
A teaching approximation — real models use a learned merge table, so exact counts differ. The pattern is real: short common words are one token, long words split into pieces, and English averages roughly four characters per token.
A loose rule of thumb for English: one token is about 4 characters, or roughly ¾ of a word. So 1,000 tokens is around 750 words. That number matters more than it looks, because tokens are the unit you get billed in and the unit the model's memory is measured in. We'll come back to both in the cost lesson.
Why 'count the r's in strawberry' trips them up
Tokens explain a famous LLM failure. Ask a model how many times the letter "r" appears in "strawberry" and it often gets it wrong. It never saw the individual letters — it saw a couple of tokens like straw and berry. Counting characters means reasoning about something it can't directly see. Same reason it's shaky at exact math: digits get chopped into tokens too.
Training vs inference: learned once, used forever
Two phases, and people mix them up constantly.
Training is the expensive, one-time part. The model reads its enormous pile of text and, billions of times over, adjusts its internal numbers (its weights) to predict the next token better. Weeks, racks of expensive hardware, a lot of electricity. When it's done you've got a finished model — a giant fixed set of numbers. That's the "GPT-4o" or "Llama" you hear about: a frozen snapshot of what it learned.
Inference is what happens every time you use it. You send text, the model runs it through those frozen weights, out comes a prediction. Fast and comparatively cheap. Crucially, inference doesn't change the model. Your chat teaches it nothing. Close the tab and it remembers nothing — the next conversation starts blank.
This one fact clears up a lot of confusion:
- A model has a knowledge cutoff — it only knows what was in its training data, frozen at some date. Ask about something that happened last week and it's guessing, or making things up, unless you hand it that info yourself.
- It can't "look something up" on its own. No live internet, no private files. (We give it those powers later with tool calling and RAG — that's most of this series.)
- "Fine-tuning" is more training, done separately and deliberately. Normal chatting is pure inference.
Quick check
During a normal chat, does the LLM learn from what you tell it?
The context window, in one line
If the model doesn't remember between sessions, how does it follow a conversation? Every turn, the app re-sends the whole chat so far as input. The context window is the cap on how much text (in tokens) it can take in at once — its working memory for this one call. Run past it and the oldest parts get dropped. That's the whole story for now; lesson 3 builds the message format that fills that window.
What LLMs are genuinely good at
Once you see it as a pattern-completing text engine, its strengths stop being surprising. It's great at jobs that are fundamentally text in, text out, with no single objectively-correct answer to nail:
- Drafting. Emails, outlines, first passes at docs, boilerplate code. A rough draft beats a blank page.
- Transforming. Rewrite this formally. Translate to Spanish. Convert this JSON to a CSV. It's excellent at reshaping text you already have.
- Summarizing. Give it a long article, ask for the gist. Good at this as long as you give it the source — don't ask it to summarize something it can't see.
- Classifying and extracting. "Is this review positive or negative?" "Pull the dates and names out of this email." Reliable, fast, cheap at scale.
- Code. It has read an enormous amount of code. Explaining a snippet, writing a function from a description, spotting a bug, writing tests — genuinely useful, with a review.
The thread tying these together: the model brings language fluency, you bring the facts and the judgment. That division of labor is how you build things that actually work.
What they're bad at (and where they lie)
The failures are just as predictable, and ignoring them is how people ship embarrassing bugs:
- Exact facts. It'll state a wrong date, a fake citation, or a made-up API method with total confidence. It's optimizing for plausible-sounding, not true. Plausible and true usually overlap — until they don't.
- Fresh information. Anything after its training cutoff, or anything private to you, it simply doesn't have. Unless you feed it in.
- Math and counting. It predicts what an answer looks like, which is not the same as computing it. For real arithmetic, give it a calculator (a tool — lesson 7) instead of trusting its head.
- Knowing what it doesn't know. This is the big one. A model has no built-in sense of its own uncertainty. Asked something it has no clue about, it produces a confident, fluent, completely fabricated answer. That's a hallucination, and it's not a bug you can patch away — it's a direct consequence of "always predict the most plausible next token."
Hallucination is the default failure mode
An LLM will never say "I don't have that one" unless something nudges it to. It would rather invent a plausible answer than admit a gap. So: never trust a factual claim, a quote, a statistic, or a code API from an LLM without checking it against a real source. Treat its output like a confident, fast, occasionally-wrong intern's first draft — useful, but you sign off on it. Half of this series is about engineering around exactly this: grounding answers in real data, validating output, and giving the model tools so it stops guessing.
You don't need the ML math for this
Here's the good news if "neural network" makes you nervous: you don't need any of it to build with LLMs. You won't compute a gradient or touch linear algebra. From here on, the model is a service you send text to and get text back from, over a normal HTTP API. The skill is using it well — clear prompts, structured output, grounding it in real data, wiring it into a program.
What you do need is Python. If you're comfortable calling functions and handling errors, you're ready. Not there yet? Start with Python for Beginners and come back, or skim where Python goes next to see how this fits in. Two optional external reads if you want more depth than this series goes into: the Wikipedia overview of large language models for the broad map, and Jay Alammar's Illustrated Transformer for a famously clear, visual look at the architecture underneath.
The takeaway
An LLM is a very good next-token predictor — autocomplete trained on an enormous amount of text. It thinks in tokens, not letters. It learned everything during training and changes nothing while you use it. It shines at drafting, transforming, summarizing, classifying, and code, and it falls down on exact facts, fresh info, math, and admitting uncertainty. Keep that honest picture and you'll build things that hold up; forget it and you'll ship a confident lie.
Enough theory. Next we get our hands on a real model: make your first LLM API call in Python — keys, the client setup we'll reuse all series, and your first response in about ten lines of code.

Written by
Rhythm Bhiwani
Engineer and relentless builder, happiest reverse-engineering hard problems until they click.
Enjoyed this?
Tap the heart to leave some love.
Be the first to react
Comments
Join the conversation.
Loading comments…

