Technical Report: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf
Introduction
V4 series retains Mixture of Experts and Multi-Token Prediction
They introduce:
- Hybrid Attention Mechanism combing Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA)
- CSA compresses the KV caches and performs DeepSeek Sparse Attention (DSA)
- HCA applied more aggressive compression to KV caches but keeps dense attention
- Manifold-Constrained Hyper-Connections to strengthen modeling capabilities
- Introduced Muon optimizer to training
Infrastructure optimizations:
- Design and implement a single fused kernel for MoE modules (overlaps computation, communication, and memory access)
- Employ TileLang to balance development prod and runtime efficiency
- Efficient batch-invariant and deterministic kernel libraries to ensure bitwise reproducibility across training and inference
- Incorporate FP4 quantization-aware training for MoE expert weights and indexer QK path to reduce memory and computation
- They extend autograd framework with tensor-level checkpointing for fine-grained recomputation control
DeepSeek-V4 series achieves significantly lower inference FLOPs and reduce KV cache size
DeepSeek-V4-Flash is trained on 32T tokens and DeepSeek-V4-Pro on 33T tokens
Post training deepseek featured two-stage paradigm:
- Cultivation of domain-specific experts
- Unified model consolidation (via on-policy distillation)
For each target domain an expert model is trained independently, base model undergoes Supervised Fine-Tuning on quality & domain specific data. Reinforcement Learning is applied using GRPO