Watching Julia Turc’s videos
This video surrounds itself around Tulu3 made by allenai
Most models are open weights but Tulu3 is the first truly open source model where they open source everything
Instruction fine-tuning (SFT)
Instruction fine-tuning is the step that comes after pre-training the model, this stage is basically tuning the model to be changed from a text completion model to one that responds in chat format (in context of LLMs)
Training is done on instruction response pairs
Trained using Public Data + Synthetic Data
Training method: Next token prediction (similar to pre-training)
Preference fine-tuning
Instruction, response 1, response 2, human preference (between 1 or 2)
Teach the model to Pairwise ranking (Direct Preference Optimization)
Off-Policy DPO - Using an external LLM to generate responses and using DPO for updating model weights
On-Policy DPO - Using an external LLM as a judge to assume what the human preference would be
Reasoning Fine-Tuning
Reinforcement Learning with Verifiable Rewards (RLVR)
Instruction -> Model -> Response
The response is passed through a deterministic reward function (outputs 0 or 1) and uses PPO to update model weights
PPO is computationally expensive, it needs a reward model trained (estimates reward at token level) and a value function which determines how hard the problem is and how much should we “weigh” that reward
Training Data
Models require a HUGE amount of training data
In all of the datasets covered in the video the common denominator was instructions