On-policy distillation is a way to train a student model from a stronger teacher while collecting examples from the student’s own current behavior distribution.

This is different from ordinary Knowledge Distillation, where the student is usually trained on a fixed dataset or on prompts sampled independently of the student. In on-policy distillation, the student generates actions/responses, the teacher labels or improves those same trajectories, and the student updates toward the teacher’s behavior on the states it actually visits.

Core Idea

Train the student on its own mistakes.

If a student model is only trained on teacher-generated data, it may perform well near the teacher’s distribution but fail when its own outputs drift into unfamiliar states. On-policy distillation reduces this mismatch by repeatedly:

  1. Sampling tasks/prompts from a dataset
  2. Letting the student act or generate responses
  3. Asking the teacher to provide better actions, logits, feedback, or corrected completions for those same states
  4. Updating the student to imitate the teacher on the student’s visited distribution

This is similar in spirit to DAgger in imitation learning: the learner creates the distribution, while the expert supplies supervision.

Why “On-Policy”?

In Reinforcement Learning, a policy is the agent’s behavior rule.

  • On-policy means the training data comes from the current policy being optimized
  • Off-policy means the training data comes from another policy, replay buffer, fixed dataset, or old model

For LLMs, the “policy” is the model’s next-token distribution. On-policy distillation means the student samples completions using its current parameters, then receives teacher supervision on those completions.

Objective

A common objective is to minimize the divergence between the teacher’s distribution and the student’s distribution on states sampled from the student:

Where:

  • = student policy
  • = teacher policy
  • = prompt or task
  • = partial completion generated by the student
  • = KL divergence between teacher and student token distributions

In practice, the teacher signal may be:

  • Full teacher logits
  • Sampled teacher completions
  • Corrected student responses
  • Preference labels
  • Reward-model scores
  • Verifier feedback

LLM Post-Training Use

For language models, on-policy distillation is useful when consolidating several stronger or specialized models into one deployable model.

Example flow:

  1. Train or fine-tune specialist teachers for different domains
  2. Sample prompts from each target domain
  3. Let the student model generate answers
  4. Route each generated answer to the relevant teacher, verifier, or reward model
  5. Train the student to match the improved behavior

This can preserve expert capabilities while reducing inference cost, latency, and serving complexity.

Benefits

  • Reduces distribution shift between training and inference
  • Teaches the student how to recover from its own partial outputs
  • Makes distillation more adaptive than a fixed supervised dataset
  • Can merge multiple specialist teachers into a single general model
  • Often cheaper at inference time because only the student is deployed

Failure Modes

  • Expensive data generation because the teacher must be queried repeatedly
  • Student can reinforce bad trajectories if teacher feedback is weak or sparse
  • Requires careful sampling so the student does not overfit to easy prompts
  • Teacher labels may be inconsistent across domains or specialists
  • If the student is too weak, its on-policy samples may be too poor for efficient learning

Relation to RL

On-policy distillation sits between supervised learning and Reinforcement Learning.

It looks like supervised learning because the student is trained with direct teacher targets. But it is on-policy because the training states are generated by the student’s current behavior, not by a static dataset.

Compared with RL, it usually has lower variance because the model gets dense teacher supervision instead of only scalar rewards. Compared with standard distillation, it is more robust because the student learns on the distribution it will actually encounter at inference.