Group Relative Policy Optimizations

Blog Link: https://cameronrwolfe.substack.com/p/grpo

Group Relative Policy Optimization (GRPO) is heavily based on Proximal Policy Optimization (PPO)

Policy update:

Advantage:

PPO Loss:

GRPO Loss:

- refers to the model before we started the fine-tuning stage altogether

KL Divergence

When training models with PPO we incorporate KL Divergence between current and reference policy, the divergence serves as penalty encourages similarity between current and reference policies

KL divergence is computed by comparing token distributions from 2 LLMs for each token in a sequence

Easiest way to approximate divergence is the difference in log probabilities between policy and reference

Ayush Garg

Recently Updated

Group Relative Policy Optimizations

KL Divergence

Policy Gradient

Highly Opinionated Advice on How to Write ML Papers

Group Relative Policy Optimizations

KL Divergence

Graph View

Backlinks

Ayush Garg

Recently Updated

Group Relative Policy Optimizations

KL Divergence

Policy Gradient

Highly Opinionated Advice on How to Write ML Papers

Group Relative Policy Optimizations

KL Divergence §

Graph View

Backlinks

KL Divergence