one year on
Anthropic unveils Claude 3 family, claims Opus beats GPT-4 on key benchmarks
The three-tier model line — Haiku, Sonnet, and Opus — arrives with multimodal vision, a 200K context window, and Anthropic's claim that Opus beats OpenAI's GPT-4 on several benchmarks.
Anthropic today released Claude 3, a family of three models — Haiku, Sonnet, and Opus — that the company says sets new industry benchmarks across cognitive tasks. Opus, the most capable model, outperforms GPT-4 on undergraduate-level knowledge (MMLU), graduate-level reasoning (GPQA), and basic mathematics (GSM8K), among others. All three models accept text and image inputs, making Claude 3 Anthropic’s first multimodal system, capable of processing photos, charts, graphs, and technical diagrams.
Opus and Sonnet are available immediately on claude.ai and via the Anthropic API, which is now generally available in 159 countries. Haiku, the fastest and cheapest model, will be available soon. Pricing scales from $0.25 per million input tokens for Haiku to $15 for Opus, with output tokens costing $1.25 to $75. The models initially support 200,000-token context windows. Anthropic says all three can accept inputs exceeding 1 million tokens, and it may make that capability available to select customers.
Anthropic also highlighted improved accuracy and fewer refusals compared to previous versions. On the ‘needle in a haystack’ test, Opus achieved near-perfect recall, surpassing 99% accuracy, and in some cases identified the inserted sentence as artificial. The company said it will soon add citations, tool use, and more advanced agentic capabilities.
TechCrunch notes that while Claude 3 represents a clear leap, it cannot search the web, may hallucinate, and is not immune to bias. Anthropic has disabled facial recognition in images, and the models struggle with low-quality images and spatial reasoning.
The record
One year later — open only if you can handle spoilers
Claude 3 Opus quickly became a go-to model for coding and complex reasoning tasks, though later models from multiple labs soon surpassed it. The 'needle in a haystack' anecdote became a frequently cited example of emergent capabilities in large models.