one year on
Anthropic ships upgraded Claude 3.5 Sonnet, 3.5 Haiku, and 'computer use' public beta
The new models set coding benchmarks while a new API lets Claude manipulate screens and cursors like a human operator.
Anthropic today released an upgraded Claude 3.5 Sonnet and a new Claude 3.5 Haiku, alongside a public beta of ‘computer use’ — an API capability that lets the model perceive screens, move cursors, click buttons, and type text as a human would.
The updated Sonnet scores 49% on SWE-bench Verified, surpassing all publicly available models including OpenAI’s o1-preview and specialized coding agents. It also improves on retail and airline tool-use benchmarks. The new Haiku matches the performance of Anthropic’s previous largest model, Claude 3 Opus, on many benchmarks while retaining low latency, and scores 40.6% on SWE-bench.
Computer use, described as ‘a groundbreaking new capability’ in public beta, allows developers to direct Claude to interact with any standard software interface. On the OSWorld benchmark, the screenshot-only version scored 14.9%, nearly double the next-best AI system’s 7.8%. Anthropic acknowledges the feature is ‘experimental’ and ‘cumbersome and error-prone,’ but expects rapid improvement. Early adopters include Asana, Canva, Cognition, DoorDash, Replit, and The Browser Company.
Safety classifiers have been developed to detect misuse via computer use. The models are available on Anthropic’s API, Amazon Bedrock, and Google Cloud’s Vertex AI, with Haiku to follow later this month.
Reported improved reasoning up to 10% across DevSecOps tasks with no added latency.
Reported substantial improvements in coding, planning, and problem-solving compared to the previous version.
Noted Claude 3.5 Sonnet outperformed every model they had tested before for automating web-based workflows.
One year later — open only if you can handle spoilers
Computer use quickly became a standard API feature across major labs, though early reliability issues led to strict safety guardrails. Anthropic's lead in agentic benchmarks proved short-lived as competitors matched or exceeded SWE-bench scores within months.