Givens: instruction, response 1, response 2, human preference (between 1 or 2)

Direct Preference Optimization (DPO) given an instruction outputs 2 responses and based on the loss updates the model weights using backpropagation

DPO loss: