Llama 4’s release blog: https://ai.meta.com/blog/llama-4-multimodal-intelligence/

Mixture of Experts

The new Llama 4 Models are the first Llama models to use a Mixture of Experts technique

To learn more about this architecture (not Llama 4 specific) checkout Mixture of Experts

Llama 4 Specifications:

  • Each MoE Layer uses 128 router experts and a shared expert

Each token is sent to a shared expert and one of the 128 routed experts

While all the parameters are stored in memory only a subset of the total parameters are activated, improving inference efficiency by lowering model serving costs and latency

Early Fusion

Llama 4 implements this thing called Early Fusion

MetaP

MetaP is a training technique which sets hyper-parameters for each layer. Eg. learning rates and initialization scales.