The One-Sentence Explanation
A Large Language Model is a computer program trained on billions of examples of text that has learned, extremely well, to predict: "Given this sequence of words, what word is most likely to come next?"
That's it. Everything ChatGPT, Claude, and Gemini do โ answering questions, writing code, explaining concepts, translating languages โ emerges from doing this one thing very, very well, at massive scale.
What's a Token?
LLMs don't read words โ they read tokens. A token is roughly a word-chunk. Common words ("the", "is", "cat") are usually one token. Longer or rare words get split: "networking" might be two tokens ("network" + "ing"). Punctuation is often its own token.
Why does this matter? Because LLMs have a context window โ a maximum number of tokens they can "see" at once. An older model like GPT-3.5 had ~4,000 tokens (~3,000 words). Modern models handle 100,000+ tokens โ essentially entire books.
How Does Training Work?
Imagine giving someone every book, article, website, and forum post ever written and asking them to memorize all the patterns. That's roughly what LLM pre-training does:
- Gather data: Trillions of tokens from the internet, books, code repositories
- Feed it in batches: Show the model a sentence, hide the last word
- Make a prediction: The model guesses what word comes next
- Measure the error: How wrong was the prediction?
- Adjust the weights: Nudge millions of internal numbers to make a slightly better prediction next time
- Repeat: Billions of times, on thousands of GPUs, for weeks or months
After training, the model has compressed patterns from all that text into billions of parameters (numerical weights). GPT-4 reportedly has around 1.8 trillion parameters.
What Makes It "Large"?
The "large" in LLM refers to the number of parameters. Research showed that beyond a certain scale, models don't just get better at predicting text โ they develop emergent capabilities that nobody explicitly programmed:
- Solving multi-step math problems
- Writing working code from a description
- Reasoning about hypothetical scenarios
- Translating between languages it wasn't explicitly trained on
This emergence was surprising even to researchers. It suggests that "language prediction at scale" captures something deeper about knowledge and reasoning.
The Transformer: The Architecture Behind Everything
The breakthrough that made LLMs possible was the Transformer architecture, introduced in a 2017 Google paper titled "Attention Is All You Need." Before transformers, sequential models (RNNs, LSTMs) struggled with long contexts. Transformers solve this with self-attention.
Self-attention lets every word in a sentence look at every other word simultaneously and decide: "How relevant is each other word to understanding me?" When the model reads "The network crashed because it was overloaded", attention helps "it" correctly identify "network" as its referent โ even if the sentence is 100 words long.
This parallel processing also makes transformers highly efficient on modern GPUs, enabling the scale that makes LLMs work.
Fine-Tuning: From Text Predictor to Helpful Assistant
A pre-trained model is like a brilliant person who has read everything but doesn't know how to have a conversation. Fine-tuning transforms it into a useful assistant. The key technique is RLHF (Reinforcement Learning from Human Feedback):
- Human trainers rate different model responses for quality, safety, helpfulness
- A "reward model" learns what humans prefer
- The LLM is fine-tuned to maximize reward model scores
This is why Claude, ChatGPT, and Gemini feel like assistants rather than raw text predictors โ their underlying capabilities come from pre-training, but their behavior comes from fine-tuning.
Common Misconceptions
"LLMs search the internet." No โ they generate responses from patterns stored in their weights. Their knowledge has a training cutoff date. Some models have web search tools added on top, but that's separate from the LLM itself.
"They actually understand what they're saying." This is philosophically contested. LLMs are extremely good at pattern matching and generating contextually appropriate text โ whether that constitutes "understanding" is a deep question without a settled answer.
"Bigger is always better." Not anymore. Smaller, well-trained models (like Llama 3.2) often outperform much larger older models. Architecture, data quality, and fine-tuning matter as much as raw parameter count.
Key Takeaways
- LLMs predict the next token in a sequence โ all other capabilities emerge from doing this at scale
- The Transformer architecture (self-attention) is what makes modern LLMs possible
- Pre-training teaches patterns; fine-tuning (RLHF) teaches behavior
- Context window = how much the model can "see" at once
- Emergent capabilities appear at scale and weren't explicitly programmed