On-Policy Distillation

On-policy distillation is a way to train a student model from a stronger teacher while collecting examples from the student’s own current behavior distribution.

This is different from ordinary Knowledge Distillation, where the student is usually trained on a fixed dataset or on prompts sampled independently of the student. In on-policy distillation, the student generates actions/responses, the teacher labels or improves those same trajectories, and the student updates toward the teacher’s behavior on the states it actually visits.

Core Idea

Train the student on its own mistakes.

If a student model is only trained on teacher-generated data, it may perform well near the teacher’s distribution but fail when its own outputs drift into unfamiliar states. On-policy distillation reduces this mismatch by repeatedly:

Sampling tasks/prompts from a dataset
Letting the student act or generate responses
Asking the teacher to provide better actions, logits, feedback, or corrected completions for those same states
Updating the student to imitate the teacher on the student’s visited distribution

This is similar in spirit to DAgger in imitation learning: the learner creates the distribution, while the expert supplies supervision.

Why “On-Policy”?

In Reinforcement Learning, a policy is the agent’s behavior rule.

On-policy means the training data comes from the current policy being optimized
Off-policy means the training data comes from another policy, replay buffer, fixed dataset, or old model

For LLMs, the “policy” is the model’s next-token distribution. On-policy distillation means the student samples completions using its current parameters, then receives teacher supervision on those completions.

Objective

A common objective is to minimize the divergence between the teacher’s distribution and the student’s distribution on states sampled from the student:

Where:

= student policy
= teacher policy
= prompt or task
= partial completion generated by the student
= KL divergence between teacher and student token distributions

In practice, the teacher signal may be:

Full teacher logits
Sampled teacher completions
Corrected student responses
Preference labels
Reward-model scores
Verifier feedback

LLM Post-Training Use

For language models, on-policy distillation is useful when consolidating several stronger or specialized models into one deployable model.

Example flow:

Train or fine-tune specialist teachers for different domains
Sample prompts from each target domain
Let the student model generate answers
Route each generated answer to the relevant teacher, verifier, or reward model
Train the student to match the improved behavior

This can preserve expert capabilities while reducing inference cost, latency, and serving complexity.

Benefits

Reduces distribution shift between training and inference
Teaches the student how to recover from its own partial outputs
Makes distillation more adaptive than a fixed supervised dataset
Can merge multiple specialist teachers into a single general model
Often cheaper at inference time because only the student is deployed

Failure Modes

Expensive data generation because the teacher must be queried repeatedly
Student can reinforce bad trajectories if teacher feedback is weak or sparse
Requires careful sampling so the student does not overfit to easy prompts
Teacher labels may be inconsistent across domains or specialists
If the student is too weak, its on-policy samples may be too poor for efficient learning

Relation to RL

On-policy distillation sits between supervised learning and Reinforcement Learning.

It looks like supervised learning because the student is trained with direct teacher targets. But it is on-policy because the training states are generated by the student’s current behavior, not by a static dataset.

Compared with RL, it usually has lower variance because the model gets dense teacher supervision instead of only scalar rewards. Compared with standard distillation, it is more robust because the student learns on the distribution it will actually encounter at inference.

Ayush Garg

Recently Updated

Deepseek V4