Watching Julia Turc’s videos

This video surrounds itself around Tulu3 made by allenai

Most models are open weights but Tulu3 is the first truly open source model where they open source everything

https://allenai.org/tulu

Instruction fine-tuning (SFT)

Instruction fine-tuning is the step that comes after pre-training the model, this stage is basically tuning the model to be changed from a text completion model to one that responds in chat format (in context of LLMs)

Training is done on instruction response pairs

Trained using Public Data + Synthetic Data

Training method: Next token prediction (similar to pre-training)

Preference fine-tuning

Instruction, response 1, response 2, human preference (between 1 or 2)

Teach the model to Pairwise ranking (Direct Preference Optimization)

Off-Policy DPO - Using an external LLM to generate responses and using DPO for updating model weights

On-Policy DPO - Using an external LLM as a judge to assume what the human preference would be

Reasoning Fine-Tuning

Reinforcement Learning with Verifiable Rewards (RLVR)

Instruction -> Model -> Response

The response is passed through a deterministic reward function (outputs 0 or 1) and uses PPO to update model weights

PPO is computationally expensive, it needs a reward model trained (estimates reward at token level) and a value function which determines how hard the problem is and how much should we “weigh” that reward

Training Data

Models require a HUGE amount of training data

In all of the datasets covered in the video the common denominator was instructions