Why synthetic benchmarks are holding AI back

AI models have never looked better on paper. GPT-4, Claude, Gemini — they top leaderboard after leaderboard. MMLU scores climb past 90%. HumanEval gets saturated. New benchmarks appear, and within months, they too are nearly solved.

And yet, when these same models are deployed into real company workflows — reading messy documents, navigating CRM systems, coordinating across Slack and email — they break in ways that no benchmark predicted.

This is not a coincidence. It is a structural problem with how we evaluate AI.

The saturation problem

Benchmarks are being solved faster than we can create them. The 2025 Stanford HAI AI Index Report documents that traditional benchmarks like MMLU, GSM8K, and HumanEval have effectively reached saturation — AI appears "nearly perfect" on them, yet still produces incorrect answers and unwanted outputs in practice. New benchmarks introduced in 2023, including MMMU, GPQA, and SWE-bench, saw performance jump by 18.8, 48.9, and 67.3 percentage points within a single year, suggesting they too will soon become ceiling tests rather than discriminative measures.

A recent large-scale study by Akhtar et al. (2025), published at ICML, analyzed 60 AI benchmarks systematically and found that nearly half show high or very high saturation. Among older benchmarks (those over 60 months old), the saturation rate reaches 54.5%. Crucially, neither private test sets nor open-ended question formats were sufficient to prevent saturation — the mechanisms we assumed would keep benchmarks useful are failing.

When a benchmark is saturated, it stops telling us anything meaningful about the difference between models, or about whether a model is ready for production. It becomes a checkbox, not a test.

What gets measured gets gamed

There is a deeper problem than saturation: optimization pressure. Charles Goodhart's observation that "when a measure becomes a target, it ceases to be a good measure" applies with particular force to AI evaluation.

Rachel Thomas and David Uminsky (2022) argued this point in Patterns, writing that "current AI approaches have weaponized Goodhart's law by centering on optimizing a particular measure as a target." They document how over-optimizing metrics leads to manipulation, short-termism, and outcomes that look good on the metric while missing the underlying objective.

This is not theoretical. Gao, Schulman, and Hilton at OpenAI (2022) empirically measured Goodhart's Law in the context of RLHF, showing that optimizing against imperfect proxy reward models causes the true objective to degrade after a point — and that this relationship follows predictable scaling laws. The more aggressively you optimize for a proxy metric, the more reliably you degrade real performance.

Hsia et al. (2023) demonstrated this in NLP explanation benchmarks specifically, showing that standard metrics like ERASER comprehensiveness and EVAL-X scores can be inflated dramatically without altering model predictions or explanations on real test inputs. The scores go up. The model's actual behavior does not change.

The contamination problem

Even setting aside optimization pressure, there is a more direct problem: the models have seen the test data.

Deng et al. (2023), in a paper presented at NAACL 2024, developed a protocol to detect benchmark contamination in proprietary LLMs by testing whether models could guess missing answer options. The results: ChatGPT and GPT-4 achieved exact match rates of 52% and 57%, respectively, in guessing missing options on MMLU items — strong evidence that these models have memorized substantial portions of one of the most widely used benchmarks in the field.

A comprehensive survey by Xu et al. (2024) found contamination rates ranging from 1% to 45% across tested LLMs, categorizing contamination into four severity levels. Their conclusion is blunt: it is "impracticable to fully remove the risks associated with contamination" due to the scale of pre-training data and the proliferation of AI-generated content that may itself contain benchmark questions.

When a model scores 90% on a benchmark it has partially memorized, that score tells us about memorization, not capability. And since we often cannot determine the extent of contamination in proprietary models, we cannot know how much of any given score is genuine.

The real-world gap

If benchmark scores were just noisy but directionally correct, the problem would be manageable. But growing evidence suggests the gap between benchmark performance and real-world deployment is not just noise — it is systematic and large.

A striking example comes from clinical AI. Gong et al. (2025), in a systematic review published in the Journal of Medical Internet Research, found that diagnostic accuracy drops from 82% on traditional case vignettes to 62.7% on multi-turn patient dialogues — a 19.3 percentage-point decrease. And only 5% of the 761 LLM evaluation studies they reviewed assessed performance on real patient care data.

As Kyle Wiggers noted in TechCrunch (2024), the most commonly used benchmarks "haven't been adapted to reflect how models are used today." Researchers at the Allen Institute for AI have described an "evaluation crisis" — benchmarks like GPQA test PhD-level science while most real users write emails and cover letters. The mismatch between what we test and what we deploy is not a gap; it is a chasm.

McIntosh et al. (2024), in a critical assessment of 23 state-of-the-art LLM benchmarks published in IEEE Transactions on Artificial Intelligence, found significant limitations across the board: biases in question construction, inability to measure genuine reasoning, implementation inconsistencies across environments, and systematic failure to account for cultural norms.

What would better evaluation look like?

If static benchmarks are failing us, what should replace them?

One promising direction is human evaluation at scale. The Chatbot Arena platform (Chiang et al., 2024), presented at ICML, demonstrates that crowdsourced pairwise comparison achieves 89.1% agreement with expert raters and provides far better separability between models than standard benchmarks. With over 240,000 votes from 90,000 users across 100+ languages, it suggests that human preference is both scalable and more informative than static tests.

But human evaluation alone is not sufficient for agents — systems that take actions, use tools, and complete multi-step workflows. For these, we need evaluation environments that replicate the conditions of deployment: real tools, real data patterns, real organizational complexity.

This means moving from static question-answer pairs to dynamic environments where agents interact with realistic systems — reading documents with real formatting quirks, navigating CRM interfaces, coordinating across communication channels. It means building evaluation infrastructure from real company data, anonymized but structurally authentic, so that the test reflects the mess an agent will actually face.

The goal is not to make evaluation harder. It is to make it honest.

The cost of comfortable metrics

The current benchmark paradigm is not just inadequate — it is actively harmful. It creates false confidence. It directs research effort toward optimizing scores rather than solving real problems. It allows models to be marketed as "state-of-the-art" on metrics that bear little relation to the tasks organizations actually need performed.

The 2025 Stanford HAI report puts it plainly: AI appears "nearly perfect" on existing benchmarks yet "still gives incorrect answers, produces unwanted outputs, and is difficult to interact with." This gap between measured performance and experienced performance is not closing. If anything, as models get better at gaming benchmarks, it is widening.

The AI industry needs evaluation infrastructure built on real-world data, grounded in actual deployment conditions, and designed to expose failures rather than confirm successes.

References

Akhtar, M., Reuel, A., Soni, P., et al. (2025). When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation. ICML 2025.
Chiang, W.-L., Zheng, L., Sheng, Y., et al. (2024). Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. ICML 2024.
Deng, C., Zhao, Y., Tang, X., Gerstein, M., & Cohan, A. (2023). Investigating Data Contamination in Modern Benchmarks for Large Language Models. NAACL 2024.
Gao, L., Schulman, J., & Hilton, J. (2022). Scaling Laws for Reward Model Overoptimization. ICML 2023.
Gong, E. J., Bang, C. S., Lee, J. J., & Baik, G. H. (2025). Knowledge-Practice Performance Gap in Clinical Large Language Models. JMIR.
Hsia, J., Pruthi, D., Singh, A., & Lipton, Z. C. (2023). Goodhart's Law Applies to NLP's Explanation Benchmarks. EACL 2024.
McIntosh, T. R., et al. (2024). Inadequacies of Large Language Model Benchmarks in the Era of Generative AI. IEEE Trans. on AI.
Stanford HAI. (2025). The 2025 AI Index Report.
Thomas, R. & Uminsky, D. (2022). Reliance on metrics is a fundamental challenge for AI. Patterns, 3(5).
Xu, C., Guan, S., Greene, D., & Kechadi, M-T. (2024). Benchmark Data Contamination of Large Language Models: A Survey.