On December 20th, OpenAI revealed that its model o3 achieved an 88% score on the demanding ARC-AGI benchmark, far surpassing the previous record of 55% and reaching the level of the human average. The company’s major breakthrough raises the possibility that artificial general intelligence (AGI) is closer than many imagined. However, the scientific community remains skeptical about the true extent of this progress.
The ARC-AGI benchmark evaluates the generalization capability of AI systems, measuring how many examples they need to adapt to new situations. Unlike models like GPT-4, which rely on millions of data points for common tasks, o3 demonstrates a remarkable “sample efficiency”. This implies that it can learn with very few data points, a crucial skill for solving new and uncommon problems (an approach considered a fundamental element of intelligence).
The ARC-AGI tests present visual problems in the form of grids, where the AI must deduce patterns that transform an initial grid into a final one. The model has to generalize rules from only three examples to apply them correctly in a fourth case, in a manner similar to the IQ tests used in schools, but at a much more complex and abstract level.

Although the technical details about the functioning of o3 are limited, it is speculated that its success lies in finding “weak rules”, that is, general and simple norms that maximize its adaptability to new situations. Some researchers compare this strategy to the method used by AlphaGo, the Google model that defeated the world champion of Go, an ancient board game that requires subtle and instinctive skill.
For now, o3 remains a mystery. OpenAI has shared few details beyond initial tests and private presentations. Once the model is available to the public, its economic impact and its real potential to revolutionize entire sectors can be assessed. If it proves to adapt like an average human, o3 could mark the beginning of a new era in artificial intelligence.