Givens: instruction, response 1, response 2, human preference (between 1 or 2)
Direct Preference Optimization (DPO) given an instruction outputs 2 responses and based on the loss updates the model weights using backpropagation
DPO loss:
Givens: instruction, response 1, response 2, human preference (between 1 or 2)
Direct Preference Optimization (DPO) given an instruction outputs 2 responses and based on the loss updates the model weights using backpropagation
DPO loss: