Scaling Laws for Neural Language Models

Paper Link: https://arxiv.org/abs/2001.08361

Performance depends strongly on scale:

Number of non-embedding params in model (N)
Size of the dataset, measured in tokens (D)
Amount of compute, estimated in FLOPs(C)

In reasonable limits performance depends weakly on architectural hyperparams

Smooth Power Laws: When either # of model params, size of dataset, or amount of compute is a limiting factor, loss improves according to the power law

Universality of overfitting: Performance scales as we scale up N and D in tandem but enters diminishing returns if N or D is held fixed while the other increases, every time we increase the model size 8x we only need to increase the data by 5x to avoid a penalty ()

Universality of training: Training curves follow predictable power laws whose parameters are independent of the model size. BY extrapolating early part of training we can predict the loss that’d be achieved if we trained for much longer

Sample Efficiency: Large models are more sample-efficient than small models, reaching same level of performance with fewer optimization steps

Convergence is inefficient: When working within a fixed compute budget (C) without restrictions on model size (N) or available data (D), we attain optimal performance by training very large models and stopping significantly short of convergence; Maximal compute-efficient training would be far more sample efficient than training small models to convergence. Data requirements growing as slowly as

Optimal Batch Size: The ideal batch size for training models is a power of the loss only and continues to be determinable by measuring the gradient noise scale. Roughly 1-2 million tokens at convergence for largest models we can train

This paper is very interesting, I didn’t read much past the summaries (not exactly optimizing for what they’re doing at the moment); Worth reading if/when at that stage

Ayush Garg

Recently Updated

Pareto Principle

Bits

Magnitude of a normalized floating-point number

Mixed Precision Training

Scaling Laws for Neural Language Models

Graph View

Backlinks