The news, 365 days behind — on purpose Delayed live · replaying 2025

One Year Ago.AI

Remember how fast this is.

22OCT2024replayed
one year on
model launchAnthropic

Anthropic ships upgraded Claude 3.5 Sonnet, 3.5 Haiku, and 'computer use' public beta

The new models set coding benchmarks while a new API lets Claude manipulate screens and cursors like a human operator.

Anthropic today released an upgraded Claude 3.5 Sonnet and a new Claude 3.5 Haiku, alongside a public beta of ‘computer use’ — an API capability that lets the model perceive screens, move cursors, click buttons, and type text as a human would.

The updated Sonnet scores 49% on SWE-bench Verified, surpassing all publicly available models including OpenAI’s o1-preview and specialized coding agents. It also improves on retail and airline tool-use benchmarks. The new Haiku matches the performance of Anthropic’s previous largest model, Claude 3 Opus, on many benchmarks while retaining low latency, and scores 40.6% on SWE-bench.

Computer use, described as ‘a groundbreaking new capability’ in public beta, allows developers to direct Claude to interact with any standard software interface. On the OSWorld benchmark, the screenshot-only version scored 14.9%, nearly double the next-best AI system’s 7.8%. Anthropic acknowledges the feature is ‘experimental’ and ‘cumbersome and error-prone,’ but expects rapid improvement. Early adopters include Asana, Canva, Cognition, DoorDash, Replit, and The Browser Company.

Safety classifiers have been developed to detect misuse via computer use. The models are available on Anthropic’s API, Amazon Bedrock, and Google Cloud’s Vertex AI, with Haiku to follow later this month.

G
GitLab

Reported improved reasoning up to 10% across DevSecOps tasks with no added latency.

C
Cognition

Reported substantial improvements in coding, planning, and problem-solving compared to the previous version.

T
The Browser Company

Noted Claude 3.5 Sonnet outperformed every model they had tested before for automating web-based workflows.

One year later — open only if you can handle spoilers

Computer use quickly became a standard API feature across major labs, though early reliability issues led to strict safety guardrails. Anthropic's lead in agentic benchmarks proved short-lived as competitors matched or exceeded SWE-bench scores within months.

Replay thisPost on XRedditHNLinkedIn