Pretraining is the first major training stage where a model learns general patterns from a massive dataset before being adapted to a specific task.
For LLMs, pretraining usually means training a Transformer on large amounts of text using next-token prediction. The result is a base model that is good at modeling language, but not necessarily good at following instructions.
Core Idea
Pretraining teaches broad capability before specialization.
Instead of training a model from scratch for every task, we train one large model on a general objective, then adapt it later with Supervised Fine-Tuning, preference tuning, Reinforcement Learning, or On-Policy Distillation.
The pretrained model learns:
- Grammar and syntax
- Facts and world knowledge
- Coding patterns
- Reasoning patterns
- Style and genre
- Relationships between concepts
LLM Pretraining Objective
Most modern LLMs are trained as Autoregressive language models.
Given a sequence of tokens:
The cat sat on theThe model learns to predict the next token:
matThe training objective is next-token prediction:
The model minimizes cross-entropy loss:
Where:
= current token = previous tokens = model parameters = sequence length
Why It Works
Next-token prediction seems simple, but doing it well requires learning a lot about the structure of the world.
To predict text accurately, the model has to infer:
- What topic is being discussed
- What facts are relevant
- What style or format should come next
- What reasoning step follows from previous steps
- What code or math expression is syntactically valid
This is why a model trained only to predict text can still develop useful general abilities.
Data
Pretraining data is usually large-scale and diverse.
Common sources include:
- Web pages
- Books
- Code
- Academic papers
- Q&A forums
- Documentation
- Multilingual text
The data is filtered, deduplicated, tokenized, and mixed into a training corpus.
Data quality matters because the model learns the patterns in the corpus. If the corpus contains spam, low-quality text, duplicated content, or biased examples, the model can absorb those patterns.
Base Model vs Chat Model
A pretrained base model predicts likely text continuations.
Example prompt:
Explain photosynthesis:A base model might continue with an encyclopedia paragraph, a quiz answer, a dialogue, or random web-style text depending on its learned distribution.
A chat model is usually a pretrained base model that has gone through Supervised Fine-Tuning and alignment training so it reliably responds as an assistant.
Pretraining vs SFT
Pretraining:
- Uses huge mostly unlabeled datasets
- Learns general language modeling
- Requires the most compute
- Produces a base model
- Uses smaller curated instruction-response datasets
- Teaches a specific behavior
- Requires much less compute
- Produces an instruction-following model
Scaling
Pretraining performance generally improves with:
- More parameters
- More high-quality tokens
- More compute
- Better data mixtures
- Better architecture and optimization
This is the idea behind scaling laws: model performance often follows predictable trends as compute, data, and parameter count increase.
However, scaling is not magic. Bad data, unstable optimization, poor evaluation, or weak post-training can still produce a model that performs poorly in practice.
Failure Modes
- Memorizing duplicated or sensitive training data
- Learning biases from the dataset
- Producing plausible but false continuations
- Weak instruction-following before SFT
- High compute and infrastructure cost
- Data contamination in benchmarks
- Poor performance on domains missing from the corpus
Practical Notes
- Pretraining is usually the most expensive part of building an LLM
- The base model’s capabilities strongly limit what post-training can recover
- Better data can be more valuable than simply adding more data
- Evaluation should include downstream tasks, not just validation loss
- Post-training changes behavior, but pretraining creates most of the raw capability