Pretraining

Pretraining is the first major training stage where a model learns general patterns from a massive dataset before being adapted to a specific task.

For LLMs, pretraining usually means training a Transformer on large amounts of text using next-token prediction. The result is a base model that is good at modeling language, but not necessarily good at following instructions.

Core Idea

Pretraining teaches broad capability before specialization.

Instead of training a model from scratch for every task, we train one large model on a general objective, then adapt it later with Supervised Fine-Tuning, preference tuning, Reinforcement Learning, or On-Policy Distillation.

The pretrained model learns:

Grammar and syntax
Facts and world knowledge
Coding patterns
Reasoning patterns
Style and genre
Relationships between concepts

LLM Pretraining Objective

Most modern LLMs are trained as Autoregressive language models.

Given a sequence of tokens:

The cat sat on the

The model learns to predict the next token:

mat

The training objective is next-token prediction:

The model minimizes cross-entropy loss:

Where:

= current token
= previous tokens
= model parameters
= sequence length

Why It Works

Next-token prediction seems simple, but doing it well requires learning a lot about the structure of the world.

To predict text accurately, the model has to infer:

What topic is being discussed
What facts are relevant
What style or format should come next
What reasoning step follows from previous steps
What code or math expression is syntactically valid

This is why a model trained only to predict text can still develop useful general abilities.

Data

Pretraining data is usually large-scale and diverse.

Common sources include:

Web pages
Books
Code
Academic papers
Q&A forums
Documentation
Multilingual text

The data is filtered, deduplicated, tokenized, and mixed into a training corpus.

Data quality matters because the model learns the patterns in the corpus. If the corpus contains spam, low-quality text, duplicated content, or biased examples, the model can absorb those patterns.

Base Model vs Chat Model

A pretrained base model predicts likely text continuations.

Example prompt:

Explain photosynthesis:

A base model might continue with an encyclopedia paragraph, a quiz answer, a dialogue, or random web-style text depending on its learned distribution.

A chat model is usually a pretrained base model that has gone through Supervised Fine-Tuning and alignment training so it reliably responds as an assistant.

Pretraining vs SFT

Pretraining:

Uses huge mostly unlabeled datasets
Learns general language modeling
Requires the most compute
Produces a base model

Supervised Fine-Tuning:

Uses smaller curated instruction-response datasets
Teaches a specific behavior
Requires much less compute
Produces an instruction-following model

Scaling

Pretraining performance generally improves with:

More parameters
More high-quality tokens
More compute
Better data mixtures
Better architecture and optimization

This is the idea behind scaling laws: model performance often follows predictable trends as compute, data, and parameter count increase.

However, scaling is not magic. Bad data, unstable optimization, poor evaluation, or weak post-training can still produce a model that performs poorly in practice.

Failure Modes

Memorizing duplicated or sensitive training data
Learning biases from the dataset
Producing plausible but false continuations
Weak instruction-following before SFT
High compute and infrastructure cost
Data contamination in benchmarks
Poor performance on domains missing from the corpus

Practical Notes

Pretraining is usually the most expensive part of building an LLM
The base model’s capabilities strongly limit what post-training can recover
Better data can be more valuable than simply adding more data
Evaluation should include downstream tasks, not just validation loss
Post-training changes behavior, but pretraining creates most of the raw capability

Ayush Garg

Recently Updated

Deepseek V4

On-Policy Distillation

Pretraining

Supervised Fine-Tuning (SFT)

Pretraining

Core Idea

LLM Pretraining Objective

Why It Works

Data

Base Model vs Chat Model

Pretraining vs SFT

Scaling

Failure Modes

Practical Notes

Graph View

Table of Contents

Backlinks

Ayush Garg

Recently Updated

Deepseek V4

On-Policy Distillation

Pretraining

Supervised Fine-Tuning (SFT)

Pretraining

Core Idea §

LLM Pretraining Objective §

Why It Works §

Data §

Base Model vs Chat Model §

Pretraining vs SFT §

Scaling §

Failure Modes §

Practical Notes §

Graph View

Table of Contents

Backlinks

Core Idea

LLM Pretraining Objective

Why It Works

Data

Base Model vs Chat Model

Pretraining vs SFT

Scaling

Failure Modes

Practical Notes