Supervised Fine-Tuning (SFT)

Supervised Fine-Tuning (SFT) is the process of taking a pretrained model and continuing training it on labeled examples of the behavior you want.

For LLMs, SFT usually means training on instruction-response pairs so the model learns to follow instructions, answer in a useful format, and behave more like an assistant instead of a raw text completion model.

Core Idea

Pretraining teaches a model broad language patterns from massive text corpora.

SFT teaches the model a specific input-output behavior.

Example:

Instruction: Explain backpropagation in simple terms.
Response: Backpropagation is how a neural network learns from its mistakes...

The model is trained to predict the response tokens given the instruction and previous response tokens.

Where It Fits

Typical LLM training pipeline:

Pretraining
Supervised Fine-Tuning
Preference tuning or Reinforcement Learning
On-Policy Distillation or model consolidation

SFT is often the first post-training stage because it gives the model a basic instruction-following policy before more advanced optimization methods are used.

Objective

SFT is usually trained with next-token prediction, the same basic objective as pretraining.

The difference is the data distribution: instead of random internet text, the data is curated instruction-response examples.

Given an instruction and target response , the model maximizes:

Equivalently, it minimizes cross-entropy loss over the target response tokens:

Where:

= instruction or prompt
= target response
= current target token
= previous response tokens
= model parameters

Data Format

SFT datasets usually contain examples like:

Instruction only
Instruction + input context
Desired response
Multi-turn conversation
Tool-use trace
Chain-of-thought or reasoning trace, when appropriate

For chat models, data is often formatted into roles:

system: You are a helpful assistant.
user: Summarize this article.
assistant: ...

The model is usually only trained to predict the assistant tokens, not the user or system tokens.

Why It Works

Pretrained models already know a huge amount about language, facts, coding, and reasoning patterns.

SFT does not teach everything from scratch. It nudges the model toward a desired behavior distribution:

Follow instructions
Answer in the expected style
Refuse unsafe requests
Use domain-specific terminology
Produce structured outputs
Match a product’s tone or workflow

This makes SFT much cheaper than pretraining because it updates an already capable model with a smaller, higher-quality dataset.

SFT vs Pretraining

Pretraining:

Uses massive unlabeled text
Learns general language modeling
Optimizes broad next-token prediction
Produces a base model

SFT:

Uses curated labeled examples
Learns target behavior
Still uses next-token prediction
Produces an instruction-following or domain-adapted model

SFT vs Preference Tuning

SFT teaches the model what a good answer looks like.

Preference tuning teaches the model which answer is better when multiple answers are possible.

SFT data:

Prompt -> Ideal response

Preference data:

Prompt -> Response A vs Response B -> Preferred response

SFT is usually simpler and more stable. Preference tuning can further improve helpfulness, reasoning, style, and alignment after the model already knows how to respond.

Benefits

Simple and stable training objective
Converts base models into instruction-following assistants
Can specialize a model for a domain or workflow
Requires much less compute than pretraining
Works well with synthetic data from stronger models

Failure Modes

Low-quality examples teach low-quality behavior
Too much narrow data can cause overfitting
The model may imitate formatting without gaining real capability
Conflicting examples can make behavior inconsistent
SFT alone does not optimize long-term rewards or user preferences
Fine-tuning on small datasets can make the model forget some general abilities

Practical Notes

Dataset quality usually matters more than dataset size
Diverse prompts reduce overfitting to one style
Strong SFT data should include edge cases, refusals, and hard examples
Evaluation should test behavior, not just training loss
SFT is often combined with LoRA or other parameter-efficient fine-tuning methods

Ayush Garg

Recently Updated

Deepseek V4

On-Policy Distillation

Pretraining