AI Advances: Less Significant Than They Appear?

47
0

A recent study by researchers at the Universidad Nacional de Educación a Distancia (UNED) in Spain raises an unsettling conclusion about the most advanced artificial intelligence (AI) models — among them OpenAI o3-mini and DeepSeek R-1. According to the report, the capabilities of these systems depend far more on memorization than on genuine reasoning, casting serious doubt on their real-world effectiveness in situations that demand complex judgment.

The technology community has trained its attention on developing AI systems that demonstrate true reasoning capabilities. Models such as those mentioned have been trained to respond to queries using “private chains of thought” — a procedure that allows them to deliberate before generating a response. Yet researchers at UNED warn that this method may be considerably less sophisticated than the industry claims.

The industry measures the ability of AI models to evaluate and quantify their own performance through standardized tests known as benchmarks. The authenticity of these metrics, however, is now being called into question. Julio Gonzalo, one of the study’s authors, observes that “when competitive pressure is intense, far too much attention is paid to benchmarks” — suggesting that companies may be manipulating results in their favor, and raising legitimate concerns about their reliability.

An Innovative Analysis Puts Benchmark Effectiveness to the Test

To probe the true effectiveness of these benchmarks, the UNED research team designed a methodologically innovative experiment. They modified traditional tests by introducing a generic response option — “None of the above” — forcing models to reason rather than simply identify answers based on memorized patterns. The approach yielded significant implications for understanding the genuine capabilities of AI systems.

The results were striking. The majority of large language models (LLMs) evaluated — including GPT-4 and Claude-3.5 — showed a marked decline in accuracy, with an average drop of 57 percent. These figures suggest that, despite being marketed as highly advanced, the actual performance of these models is materially overstated and fails to align with their claimed reasoning capabilities.

The study further revealed that language itself influences model effectiveness. Tests conducted in English continue to yield the strongest results, while performance deteriorates in Spanish and degrades further still in less common languages. This limitation is most pronounced in models with reduced neural processing architectures, pointing to a fundamental lack of adaptability across linguistic contexts.

Despite these limitations, developers across the sector are in active pursuit of new techniques to sharpen the reasoning capabilities of their AI models. A notable exception is OpenAI o3-mini — the only model to surpass one of the benchmarks, even as it too registered a loss of accuracy under the modified test conditions. DeepSeek-R1-70b, meanwhile, distinguished itself by recording the smallest performance decline across the adapted evaluations.

In sum, the UNED study delivers a compelling call for reflection on the true capabilities of artificial intelligence systems, and raises fundamental questions about the reliability of the benchmarks the industry relies upon. While developers continue their pursuit of more capable reasoning models, it is increasingly clear that memorization may lie at the core of these technologies. That reality invites the broader technology community to question and reassess the direction in which AI is being developed — and to demand greater transparency and rigor in the metrics used to measure its progress.

Compartir: