Ayush Garg

Search

Recently Updated

Direct Preference Optimizaton
Feb 05, 2026
Fine-tune LLMs
Feb 05, 2026
Policy Search for Model Predictive Control with Application to Agile Drone Flight
Feb 04, 2026
Hi, I am Ayush 1️⃣9️⃣
Feb 02, 2026

❯

❯

Direct Preference Optimizaton

Direct Preference Optimizaton

Feb 05, 2026, 1 min read

Givens: instruction, response 1, response 2, human preference (between 1 or 2)

Direct Preference Optimization (DPO) given an instruction outputs 2 responses and based on the loss updates the model weights using backpropagation

DPO loss:

L_{D PO} = - l o g [\frac{p ( correc t )}{p ( re j ec t e d )}]

Graph View

Backlinks

Fine-tune LLMs

Created by Ayush Garg using Quartz , © 2026

GitHub
Linkedin
Blog
Twitter