one year on
DeepSeek releases DeepSeek-V2, a 236B MoE model with novel attention architecture
The open-weight model, which activates only 21B parameters per token and claims to cut KV cache by 93.3%, also comes with an OpenAI-compatible API and commercial-use terms.
DeepSeek today introduced DeepSeek-V2, a 236-billion-parameter Mixture-of-Experts language model that activates only 21B parameters per token, positioning it as a direct competitor to models like LLaMA 3 70B and Mixtral 8x22B. The code repository is MIT-licensed, and the model weights are available on Hugging Face under DeepSeek’s model license. The accompanying paper — submitted to arXiv on May 7 — details two architectural innovations: Multi-head Latent Attention, which compresses the key-value cache into a latent vector to reduce KV cache by 93.3%, and the DeepSeekMoE architecture for economical sparse computation.
Pre-trained on 8.1 trillion tokens with a 128K context window, DeepSeek-V2 posts competitive scores on English (MMLU 78.5), Chinese (C-Eval 81.7), and math (GSM8K 79.2) benchmarks. The chat version, fine-tuned with SFT and RL, scores 7.91 on Alignbench, trailing only GPT-4-1106-preview among evaluated models. DeepSeek also offers an OpenAI-compatible API through DeepSeek Platform, and says pay-as-you-go access is available at “an unbeatable price.”
The model series also supports commercial use, according to the model card.
The record
The team claims the model achieves 'top-tier performance among open-source models' while saving 42.5% of training costs and boosting throughput 5.76× over its predecessor.
One year later — open only if you can handle spoilers
DeepSeek-V2 indeed kicked off a dramatic price war among Chinese AI providers, with ByteDance, Alibaba, Baidu and Tencent all cutting API prices within weeks. The model’s latent attention mechanism later influenced several open-source projects, though DeepSeek remained a niche player in the West until the release of DeepSeek-V3 in late 2025.