one year on
OpenAI unveils o3 reasoning models, claiming step change on path to AGI
On the final day of its 12-day Shipmas event, OpenAI announced o3 and o3-mini — skipping o2 over a potential trademark conflict — with scores on key benchmarks that François Chollet calls a genuine breakthrough in novel-task adaptation.
OpenAI capped its 12-day Shipmas event with the unveiling of o3, a new family of reasoning models that the company says approaches artificial general intelligence in certain controlled conditions. The models — o3 and a smaller variant, o3-mini — are not yet publicly available; OpenAI is beginning safety testing and red-teaming today, with plans to launch o3-mini by late January and o3 sometime after.
On the ARC-AGI benchmark, designed to measure an AI’s ability to adapt to novel tasks, o3 scored 75.7% in a low-compute configuration and 87.5% in a high-compute mode that costs roughly $4,560 per task. On FrontierMath, a notoriously difficult math benchmark, o3 solved 25.2% of problems, dwarfing the previous state-of-the-art of 2%. OpenAI also reported that o3 achieved 96.7% on the 2024 American Invitational Mathematics Exam, missing only one question, and outperforms o1 by 22.8 percentage points on SWE-Bench Verified.
ARC-AGI co-creator François Chollet called the results “a genuine breakthrough” that demands serious scientific attention. But he stressed that o3 still fails on simple tasks and is not AGI. “You’ll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible,” he wrote. The ARC Prize Foundation plans to release a harder benchmark, ARC-AGI-2, in 2025 that early testing suggests will drop o3’s score below 30%.
According to The Information, OpenAI skipped o2 to avoid a potential conflict with British telecom provider O2, and CEO Sam Altman somewhat confirmed that during a livestream this morning. The company is also introducing “deliberative alignment,” a technique to align o3 models with safety principles. Researchers can sign up for a preview of o3-mini starting today. The announcement comes amid a wave of reasoning models from rivals like DeepSeek and Alibaba, and as OpenAI loses Alec Radford, the lead author of the academic paper that kicked off OpenAI’s GPT series, who announced this week that he is leaving to pursue independent research.
The record
Said o3 represents a significant breakthrough and a qualitative shift in AI capabilities, but cautioned that it is not AGI and still fails on very easy tasks.
Called o3 a breakthrough with a step function improvement on hardest benchmarks, and announced safety testing and red teaming are beginning now.
view the original post →Highlighted the rapid progress from o1 to o3, saying there is every reason to believe the trajectory will continue.
One year later — open only if you can handle spoilers
By mid-2026, o3 had not been widely released in the form tested on ARC-AGI; OpenAI released a different version in April 2025. The o3 family never became a public-facing product in the same way as GPT-4 or o1, as OpenAI shifted focus to its next-generation 'Orion' model. Chollet's ARC-AGI-2 benchmark did launch in 2025 and, as predicted, proved extremely challenging for all models.