Technical Report: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf

Introduction

V4 series retains Mixture of Experts and Multi-Token Prediction

They introduce:

  • Hybrid Attention Mechanism combing Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA)
    • CSA compresses the KV caches and performs DeepSeek Sparse Attention (DSA)
    • HCA applied more aggressive compression to KV caches but keeps dense attention
  • Manifold-Constrained Hyper-Connections to strengthen modeling capabilities
  • Introduced Muon optimizer to training

Infrastructure optimizations:

  • Design and implement a single fused kernel for MoE modules (overlaps computation, communication, and memory access)
  • Employ TileLang to balance development prod and runtime efficiency
  • Efficient batch-invariant and deterministic kernel libraries to ensure bitwise reproducibility across training and inference
  • Incorporate FP4 quantization-aware training for MoE expert weights and indexer QK path to reduce memory and computation
  • They extend autograd framework with tensor-level checkpointing for fine-grained recomputation control

DeepSeek-V4 series achieves significantly lower inference FLOPs and reduce KV cache size

DeepSeek-V4-Flash is trained on 32T tokens and DeepSeek-V4-Pro on 33T tokens

Post training deepseek featured two-stage paradigm:

  • Cultivation of domain-specific experts
  • Unified model consolidation (via on-policy distillation)

For each target domain an expert model is trained independently, base model undergoes Supervised Fine-Tuning on quality & domain specific data. Reinforcement Learning is applied using GRPO