Deepseek V4

Technical Report: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf

Introduction

V4 series retains Mixture of Experts and Multi-Token Prediction

They introduce:

Hybrid Attention Mechanism combing Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA)
- CSA compresses the KV caches and performs DeepSeek Sparse Attention (DSA)
- HCA applied more aggressive compression to KV caches but keeps dense attention
Manifold-Constrained Hyper-Connections to strengthen modeling capabilities
Introduced Muon optimizer to training

Infrastructure optimizations:

Design and implement a single fused kernel for MoE modules (overlaps computation, communication, and memory access)
Employ TileLang to balance development prod and runtime efficiency
Efficient batch-invariant and deterministic kernel libraries to ensure bitwise reproducibility across training and inference
Incorporate FP4 quantization-aware training for MoE expert weights and indexer QK path to reduce memory and computation
They extend autograd framework with tensor-level checkpointing for fine-grained recomputation control

DeepSeek-V4 series achieves significantly lower inference FLOPs and reduce KV cache size

DeepSeek-V4-Flash is trained on 32T tokens and DeepSeek-V4-Pro on 33T tokens

Post training deepseek featured two-stage paradigm:

Cultivation of domain-specific experts
Unified model consolidation (via on-policy distillation)

For each target domain an expert model is trained independently, base model undergoes Supervised Fine-Tuning on quality & domain specific data. Reinforcement Learning is applied using GRPO

Ayush Garg

Recently Updated

Pareto Principle

Bits

Magnitude of a normalized floating-point number

Mixed Precision Training

Deepseek V4

Introduction

Graph View

Backlinks

Ayush Garg

Recently Updated

Pareto Principle

Bits

Magnitude of a normalized floating-point number

Mixed Precision Training

Deepseek V4

Introduction §

Graph View

Backlinks

Introduction