Pretraining is the first major training stage where a model learns general patterns from a massive dataset before being adapted to a specific task.

For LLMs, pretraining usually means training a Transformer on large amounts of text using next-token prediction. The result is a base model that is good at modeling language, but not necessarily good at following instructions.

Core Idea

Pretraining teaches broad capability before specialization.

Instead of training a model from scratch for every task, we train one large model on a general objective, then adapt it later with Supervised Fine-Tuning, preference tuning, Reinforcement Learning, or On-Policy Distillation.

The pretrained model learns:

  • Grammar and syntax
  • Facts and world knowledge
  • Coding patterns
  • Reasoning patterns
  • Style and genre
  • Relationships between concepts

LLM Pretraining Objective

Most modern LLMs are trained as Autoregressive language models.

Given a sequence of tokens:

The cat sat on the

The model learns to predict the next token:

mat

The training objective is next-token prediction:

The model minimizes cross-entropy loss:

Where:

  • = current token
  • = previous tokens
  • = model parameters
  • = sequence length

Why It Works

Next-token prediction seems simple, but doing it well requires learning a lot about the structure of the world.

To predict text accurately, the model has to infer:

  • What topic is being discussed
  • What facts are relevant
  • What style or format should come next
  • What reasoning step follows from previous steps
  • What code or math expression is syntactically valid

This is why a model trained only to predict text can still develop useful general abilities.

Data

Pretraining data is usually large-scale and diverse.

Common sources include:

  • Web pages
  • Books
  • Code
  • Academic papers
  • Q&A forums
  • Documentation
  • Multilingual text

The data is filtered, deduplicated, tokenized, and mixed into a training corpus.

Data quality matters because the model learns the patterns in the corpus. If the corpus contains spam, low-quality text, duplicated content, or biased examples, the model can absorb those patterns.

Base Model vs Chat Model

A pretrained base model predicts likely text continuations.

Example prompt:

Explain photosynthesis:

A base model might continue with an encyclopedia paragraph, a quiz answer, a dialogue, or random web-style text depending on its learned distribution.

A chat model is usually a pretrained base model that has gone through Supervised Fine-Tuning and alignment training so it reliably responds as an assistant.

Pretraining vs SFT

Pretraining:

  • Uses huge mostly unlabeled datasets
  • Learns general language modeling
  • Requires the most compute
  • Produces a base model

Supervised Fine-Tuning:

  • Uses smaller curated instruction-response datasets
  • Teaches a specific behavior
  • Requires much less compute
  • Produces an instruction-following model

Scaling

Pretraining performance generally improves with:

  • More parameters
  • More high-quality tokens
  • More compute
  • Better data mixtures
  • Better architecture and optimization

This is the idea behind scaling laws: model performance often follows predictable trends as compute, data, and parameter count increase.

However, scaling is not magic. Bad data, unstable optimization, poor evaluation, or weak post-training can still produce a model that performs poorly in practice.

Failure Modes

  • Memorizing duplicated or sensitive training data
  • Learning biases from the dataset
  • Producing plausible but false continuations
  • Weak instruction-following before SFT
  • High compute and infrastructure cost
  • Data contamination in benchmarks
  • Poor performance on domains missing from the corpus

Practical Notes

  • Pretraining is usually the most expensive part of building an LLM
  • The base model’s capabilities strongly limit what post-training can recover
  • Better data can be more valuable than simply adding more data
  • Evaluation should include downstream tasks, not just validation loss
  • Post-training changes behavior, but pretraining creates most of the raw capability