Paper Link: https://arxiv.org/abs/2001.08361
Performance depends strongly on scale:
- Number of non-embedding params in model (N)
- Size of the dataset, measured in tokens (D)
- Amount of compute, estimated in FLOPs(C)
In reasonable limits performance depends weakly on architectural hyperparams
Smooth Power Laws: When either # of model params, size of dataset, or amount of compute is a limiting factor, loss improves according to the power law
Universality of overfitting: Performance scales as we scale up N and D in tandem but enters diminishing returns if N or D is held fixed while the other increases, every time we increase the model size 8x we only need to increase the data by 5x to avoid a penalty (
Universality of training: Training curves follow predictable power laws whose parameters are independent of the model size. BY extrapolating early part of training we can predict the loss that’d be achieved if we trained for much longer
Sample Efficiency: Large models are more sample-efficient than small models, reaching same level of performance with fewer optimization steps
Convergence is inefficient: When working within a fixed compute budget (C) without restrictions on model size (N) or available data (D), we attain optimal performance by training very large models and stopping significantly short of convergence; Maximal compute-efficient training would be far more sample efficient than training small models to convergence. Data requirements growing as slowly as
Optimal Batch Size: The ideal batch size for training models is a power of the loss only and continues to be determinable by measuring the gradient noise scale. Roughly 1-2 million tokens at convergence for largest models we can train
This paper is very interesting, I didn’t read much past the summaries (not exactly optimizing for what they’re doing at the moment); Worth reading if/when at that stage