Ooak Data Research

Why synthetic benchmarks are holding AI back

Wed, 18 Mar 2026 00:00:00 GMT

AI models have never looked better on paper. GPT-4, Claude, Gemini — they top leaderboard after leaderboard. MMLU scores climb past 90%. HumanEval gets saturated. New benchmarks appear, and within months, they too are nearly solved.

And yet, when these same models are deployed into real company workflows — reading messy documents, navigating CRM systems, coordinating across Slack and email — they break in ways that no benchmark predicted.

This is not a coincidence. It is a structural problem with how we evaluate AI.

The saturation problem

Benchmarks are being solved faster than we can create them. The 2025 Stanford HAI AI Index Report documents that traditional benchmarks like MMLU, GSM8K, and HumanEval have effectively reached saturation — AI appears "nearly perfect" on them, yet still produces incorrect answers and unwanted outputs in practice. New benchmarks introduced in 2023, including MMMU, GPQA, and SWE-bench, saw performance jump by 18.8, 48.9, and 67.3 percentage points within a single year, suggesting they too will soon become ceiling tests rather than discriminative measures.

A recent large-scale study by Akhtar et al. (2025), published at ICML, analyzed 60 AI benchmarks systematically and found that nearly half show high or very high saturation. Among older benchmarks (those over 60 months old), the saturation rate reaches 54.5%. Crucially, neither private test sets nor open-ended question formats were sufficient to prevent saturation — the mechanisms we assumed would keep benchmarks useful are failing.

When a benchmark is saturated, it stops telling us anything meaningful about the difference between models, or about whether a model is ready for production. It becomes a checkbox, not a test.

What gets measured gets gamed

There is a deeper problem than saturation: optimization pressure. Charles Goodhart's observation that "when a measure becomes a target, it ceases to be a good measure" applies with particular force to AI evaluation.

Rachel Thomas and David Uminsky (2022) argued this point in Patterns, writing that "current AI approaches have weaponized Goodhart's law by centering on optimizing a particular measure as a target." They document how over-optimizing metrics leads to manipulation, short-termism, and outcomes that look good on the metric while missing the underlying objective.

This is not theoretical. Gao, Schulman, and Hilton at OpenAI (2022) empirically measured Goodhart's Law in the context of RLHF, showing that optimizing against imperfect proxy reward models causes the true objective to degrade after a point — and that this relationship follows predictable scaling laws. The more aggressively you optimize for a proxy metric, the more reliably you degrade real performance.

Hsia et al. (2023) demonstrated this in NLP explanation benchmarks specifically, showing that standard metrics like ERASER comprehensiveness and EVAL-X scores can be inflated dramatically without altering model predictions or explanations on real test inputs. The scores go up. The model's actual behavior does not change.

The contamination problem

Even setting aside optimization pressure, there is a more direct problem: the models have seen the test data.

Deng et al. (2023), in a paper presented at NAACL 2024, developed a protocol to detect benchmark contamination in proprietary LLMs by testing whether models could guess missing answer options. The results: ChatGPT and GPT-4 achieved exact match rates of 52% and 57%, respectively, in guessing missing options on MMLU items — strong evidence that these models have memorized substantial portions of one of the most widely used benchmarks in the field.

A comprehensive survey by Xu et al. (2024) found contamination rates ranging from 1% to 45% across tested LLMs, categorizing contamination into four severity levels. Their conclusion is blunt: it is "impracticable to fully remove the risks associated with contamination" due to the scale of pre-training data and the proliferation of AI-generated content that may itself contain benchmark questions.

When a model scores 90% on a benchmark it has partially memorized, that score tells us about memorization, not capability. And since we often cannot determine the extent of contamination in proprietary models, we cannot know how much of any given score is genuine.

The real-world gap

If benchmark scores were just noisy but directionally correct, the problem would be manageable. But growing evidence suggests the gap between benchmark performance and real-world deployment is not just noise — it is systematic and large.

A striking example comes from clinical AI. Gong et al. (2025), in a systematic review published in the Journal of Medical Internet Research, found that diagnostic accuracy drops from 82% on traditional case vignettes to 62.7% on multi-turn patient dialogues — a 19.3 percentage-point decrease. And only 5% of the 761 LLM evaluation studies they reviewed assessed performance on real patient care data.

As Kyle Wiggers noted in TechCrunch (2024), the most commonly used benchmarks "haven't been adapted to reflect how models are used today." Researchers at the Allen Institute for AI have described an "evaluation crisis" — benchmarks like GPQA test PhD-level science while most real users write emails and cover letters. The mismatch between what we test and what we deploy is not a gap; it is a chasm.

McIntosh et al. (2024), in a critical assessment of 23 state-of-the-art LLM benchmarks published in IEEE Transactions on Artificial Intelligence, found significant limitations across the board: biases in question construction, inability to measure genuine reasoning, implementation inconsistencies across environments, and systematic failure to account for cultural norms.

What would better evaluation look like?

If static benchmarks are failing us, what should replace them?

One promising direction is human evaluation at scale. The Chatbot Arena platform (Chiang et al., 2024), presented at ICML, demonstrates that crowdsourced pairwise comparison achieves 89.1% agreement with expert raters and provides far better separability between models than standard benchmarks. With over 240,000 votes from 90,000 users across 100+ languages, it suggests that human preference is both scalable and more informative than static tests.

But human evaluation alone is not sufficient for agents — systems that take actions, use tools, and complete multi-step workflows. For these, we need evaluation environments that replicate the conditions of deployment: real tools, real data patterns, real organizational complexity.

This means moving from static question-answer pairs to dynamic environments where agents interact with realistic systems — reading documents with real formatting quirks, navigating CRM interfaces, coordinating across communication channels. It means building evaluation infrastructure from real company data, anonymized but structurally authentic, so that the test reflects the mess an agent will actually face.

The goal is not to make evaluation harder. It is to make it honest.

The cost of comfortable metrics

The current benchmark paradigm is not just inadequate — it is actively harmful. It creates false confidence. It directs research effort toward optimizing scores rather than solving real problems. It allows models to be marketed as "state-of-the-art" on metrics that bear little relation to the tasks organizations actually need performed.

The 2025 Stanford HAI report puts it plainly: AI appears "nearly perfect" on existing benchmarks yet "still gives incorrect answers, produces unwanted outputs, and is difficult to interact with." This gap between measured performance and experienced performance is not closing. If anything, as models get better at gaming benchmarks, it is widening.

The AI industry needs evaluation infrastructure built on real-world data, grounded in actual deployment conditions, and designed to expose failures rather than confirm successes.

References

Akhtar, M., Reuel, A., Soni, P., et al. (2025). When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation. ICML 2025.
Chiang, W.-L., Zheng, L., Sheng, Y., et al. (2024). Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. ICML 2024.
Deng, C., Zhao, Y., Tang, X., Gerstein, M., & Cohan, A. (2023). Investigating Data Contamination in Modern Benchmarks for Large Language Models. NAACL 2024.
Gao, L., Schulman, J., & Hilton, J. (2022). Scaling Laws for Reward Model Overoptimization. ICML 2023.
Gong, E. J., Bang, C. S., Lee, J. J., & Baik, G. H. (2025). Knowledge-Practice Performance Gap in Clinical Large Language Models. JMIR.
Hsia, J., Pruthi, D., Singh, A., & Lipton, Z. C. (2023). Goodhart's Law Applies to NLP's Explanation Benchmarks. EACL 2024.
McIntosh, T. R., et al. (2024). Inadequacies of Large Language Model Benchmarks in the Era of Generative AI. IEEE Trans. on AI.
Stanford HAI. (2025). The 2025 AI Index Report.
Thomas, R. & Uminsky, D. (2022). Reliance on metrics is a fundamental challenge for AI. Patterns, 3(5).
Xu, C., Guan, S., Greene, D., & Kechadi, M-T. (2024). Benchmark Data Contamination of Large Language Models: A Survey.

The data problem in agentic AI

Wed, 11 Feb 2026 00:00:00 GMT

The AI industry is in the middle of a paradigm shift. The focus is moving from models that answer questions to agents that complete tasks — systems that read documents, send messages, update databases, navigate tools, and execute multi-step workflows autonomously.

But there is a problem that the excitement around agents tends to obscure: we do not have the data to train them.

The data infrastructure that powered the LLM revolution — web scrapes, books, code repositories — is fundamentally inadequate for agents. Agents need something different: multi-step interaction trajectories, tool-use demonstrations, multi-modal workflow data, and environments where they can learn through trial and error. This data barely exists, and building it is harder than most people realize.

Agents are not chatbots

The distinction matters because it determines what training data looks like.

A chatbot takes a prompt and returns a response. Its training data is input-output pairs. The internet is full of this data. It is comparatively easy to collect, clean, and scale.

An agent takes a goal and executes a sequence of actions across multiple tools and systems to achieve it. Its training data needs to capture trajectories — chains of reasoning, tool invocations, observations, corrections, and completions that unfold over many steps. This kind of data is vanishingly rare in the wild.

Yao et al. (2023) formalized this distinction with ReAct, an architecture that interleaves reasoning traces with action steps. The key insight: agents require fundamentally different data than standard LLMs. They need trajectories of interleaved thought and action, not just input-output pairs.

Schick et al. (2023) demonstrated a related point with Toolformer: LLMs can learn to use tools, but the training data must be augmented with API call annotations at positions where tool use helps predict future tokens. Standard text corpora contain no tool-use demonstrations. The data must be synthetically generated or manually curated — and neither approach scales cleanly.

The benchmark evidence

Recent agent benchmarks have made the data gap quantitatively visible.

WebArena (Zhou et al., 2023), published at ICLR 2024, evaluates agents on realistic web tasks using fully functional self-hosted websites. The best GPT-4-based agent achieved only 14.41% end-to-end task success, compared to 78.24% for humans. This is a 5x performance differential on tasks that most knowledge workers perform routinely.

OSWorld (Xie et al., 2024), presented at NeurIPS, extends this to real computer environments across Ubuntu, Windows, and macOS. The best model achieves 12.24% success versus 72.36% for humans. The primary failure modes: GUI grounding and operational knowledge — capabilities that require visual, multi-modal training data that barely exists.

VisualWebArena (Koh et al., 2024), published at ACL, finds that the best vision-language model agents achieve only 16.4% success versus 88.7% for humans on visually grounded web tasks. Text-only agent training data is fundamentally insufficient.

AgentBench (Liu et al., 2023), accepted at ICLR 2024, evaluated 29 LLMs across 8 diverse environments and found a significant performance gap between top commercial models and open-source alternatives. The authors identify poor long-term reasoning, decision-making, and instruction-following as the main obstacles — all capabilities that require multi-step interaction data to develop.

The pattern is consistent: models that perform well on static benchmarks struggle dramatically when placed in realistic, multi-step, multi-tool environments. The models are not incapable. They are undertrained — specifically, they lack the right kind of training data.

The reliability problem

Even when agents succeed on a task, they do not succeed reliably.

Yao et al. (2024) introduced tau-bench, a benchmark for tool-agent-user interaction in real-world customer service domains. Their findings: even state-of-the-art agents succeed on fewer than 50% of customer service tasks, and consistency drops to approximately 25% when the same task is repeated eight times.

For enterprise deployment, this is the critical metric. A system that works 50% of the time is not half as useful as one that works 100% of the time — it is essentially unusable for any workflow where errors have consequences.

Why synthetic data is not enough

The obvious response to the data scarcity problem is to generate synthetic training data — use a large model to produce agent trajectories, then fine-tune smaller models on those trajectories.

This approach works, up to a point. Chen et al. (2023) showed with FireAct that fine-tuning Llama2-7B with just 500 agent trajectories generated by GPT-4 leads to a 77% performance increase on HotpotQA. The data is clearly valuable. But it creates two problems.

First, there is a quality ceiling. Synthetic trajectories inherit the limitations of the generating model. They do not contain patterns the generating model has never seen — and the messy, ambiguous, context-dependent workflows of real companies are precisely what no model has been trained on.

Second, there is an entanglement problem. Chen et al. (2024) found with Agent-FLAN that current agent training corpora entangle format-following with agent reasoning, causing distribution shift. Worse, improving agent abilities through naive fine-tuning introduces hallucinations as a side effect.

Zeng et al. (2023), with AgentTuning, demonstrated that the quality and diversity of agent trajectory data is the bottleneck, not model architecture. Their AgentLM-70B became comparable to GPT-3.5-turbo on unseen agent tasks, confirming that the right training data can close capability gaps that no amount of architectural innovation addresses.

What agent data actually needs to look like

Anthropic's research team (2024), in their guide to building effective agents, makes the point that the practical bottleneck is not model architecture but "the quality of tool definitions, context engineering, and evaluation data." The data and evaluation layer — not the model layer — is the binding constraint.

A 2025 analysis by Wing Venture Capital frames this in market terms: RL environments are playing for AI agents the same role that EDA played for silicon design. Anthropic alone is estimated to be spending tens of millions annually on RL environments, with 3-5x growth expected into 2026.

This points to what agent training data actually needs:

Real-world grounding. Training data sourced from actual company workflows — documents with real formatting inconsistencies, conversations with organizational context, project management tools with actual task dependencies.
Multi-modal coherence. Documents, conversations, and tool data that are interconnected — the Slack message references the Jira ticket which references the Google Doc.
Expert-level task calibration. Tasks designed to challenge the current frontier of model capability, continuously updated as models improve.
RL environments, not static datasets. Agents learn through interaction — taking actions, observing outcomes, adjusting strategies. The shift from datasets to environments is as fundamental as the shift from chatbots to agents.

The infrastructure gap

The data problem in agentic AI is not a temporary inconvenience that will be solved by scaling existing approaches. It is a structural gap in the AI infrastructure stack. We have world-class models, increasingly capable architectures, and billions of dollars in deployment demand. What we lack is the data infrastructure to connect model capability to real-world performance.

Closing this gap requires building something new: pipelines that source real-world data through established partnerships, anonymize it while preserving structural fidelity, and transform it into RL environments where agents can train against the complexity they will actually face.

The model layer has had its revolution. The data layer is next.

References

Anthropic. (2024). Building Effective Agents.
Chen, B., et al. (2023). FireAct: Toward Language Agent Fine-tuning.
Chen, Z., et al. (2024). Agent-FLAN: Designing Data and Methods of Effective Agent Tuning. ACL 2024.
Koh, J. Y., et al. (2024). VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks. ACL 2024.
Liu, X., et al. (2023). AgentBench: Evaluating LLMs as Agents. ICLR 2024.
Schick, T., et al. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools.
Wing Venture Capital. (2025). RL Environments for Agentic AI.
Xie, T., et al. (2024). OSWorld: Benchmarking Multimodal Agents. NeurIPS 2024.
Yao, S., et al. (2023). ReAct: Synergizing Reasoning and Acting. ICLR 2023.
Yao, S., et al. (2024). tau-bench: A Benchmark for Tool-Agent-User Interaction.
Zeng, A., et al. (2023). AgentTuning: Enabling Generalized Agent Abilities for LLMs.
Zhou, S., et al. (2023). WebArena: A Realistic Web Environment. ICLR 2024.

From sandbox to production: the gap nobody talks about

Thu, 22 Jan 2026 00:00:00 GMT

Here is a pattern that plays out across the AI industry, over and over:

A team builds an AI system. It works impressively in the development environment. They demo it to stakeholders. Everyone is excited. They push it toward production. It fails — not catastrophically, not in a way that makes headlines, but persistently, in small ways that erode trust and eventually lead the project to be quietly shelved.

The industry has a name for this now: the sandbox-to-production gap. But we have been remarkably slow to treat it as the structural problem it is.

The numbers are worse than you think

The failure rate of AI projects in production is not a secret, but the scale is still surprising.

A 2024 RAND Corporation study, based on interviews with 65 data scientists and engineers, estimates that 80% of AI projects fail — double the failure rate of non-AI IT projects.

In 2024, Gartner predicted that at least 30% of generative AI projects would be abandoned after proof of concept by end of 2025, citing poor data quality, inadequate risk controls, escalating costs, and unclear business value.

By 2025, S&P Global's Voice of the Enterprise survey made that prediction look optimistic: 42% of companies had abandoned the majority of their AI initiatives before reaching production — up from 17% the year before.

NTT DATA's 2024 global survey of 2,300+ leaders across 34 countries found that 70-85% of GenAI deployments fail to meet their desired ROI, despite 83% of organizations claiming a well-defined GenAI strategy.

McKinsey's 2025 State of AI survey confirms the pattern from the other direction: while 78% of organizations now use AI in at least one business function, only about 5.6% report that more than 5% of their organization's EBIT is attributable to AI. Adoption is nearly universal. Value creation is not.

Why demos work and production doesn't

The sandbox-to-production gap is not random. It has specific, identifiable causes — and most of them are about data, not models.

Underspecification

D'Amour et al. (2022), in a landmark paper from Google published in the Journal of Machine Learning Research, introduced the concept of "underspecification": ML pipelines return many predictors with equivalently strong held-out performance in the training domain, but these predictors can behave very differently in deployment.

This means that good test-set performance does not guarantee good deployment performance — not because the model is bad, but because the evaluation does not constrain the model's behavior enough to ensure it generalizes correctly. The authors demonstrated this across computer vision, NLP, medical imaging, clinical risk prediction, and genomics.

Data cascades

Sambasivan et al. (2021), in a study published at CHI based on interviews with 53 AI practitioners, identified "data cascades" — compounding negative downstream effects from data quality issues that are invisible during development but devastating in production.

92% of practitioners had experienced data cascades. The root cause: conventional AI/ML practices systematically undervalue data quality work in favor of model work. As one practitioner put it (giving the paper its title): "Everyone wants to do the model work, not the data work."

Hidden technical debt

Sculley et al. (2015), in one of the most cited ML systems papers, observed that only a tiny fraction of real-world ML systems consists of the ML model code itself. The surrounding infrastructure — data pipelines, feature stores, serving systems, monitoring, configuration management — is vast and prone to what they call "hidden technical debt."

This debt includes boundary erosion, entanglement, hidden feedback loops, and undeclared consumers. None of these failure modes appear in the sandbox. All of them appear in production.

The organizational gap

The sandbox-to-production gap is not purely technical. It is also organizational.

Nahar et al. (2022), in a study published at ICSE based on interviews with 45 practitioners from 28 organizations, found that the transition breaks down at the boundary between data science and software engineering — miscommunication about model assumptions, inadequate documentation, and misaligned practices.

A 2024 analysis from the Brookings Institution by Fleming, Li, and Thompson quantifies a specific aspect of this gap: while 80% of computer vision tasks are technically automatable, only 23% are economically viable when customization costs are included.

Lakhani, Spataro, and Stave (2026), writing in the Harvard Business Review, identify seven organizational frictions that prevent AI from scaling beyond pilots. Among them: organizations become "pilot-rich but transformation-poor"; legacy workflows ("process debt") prevent AI from operating reliably; and tribal knowledge hoarding creates invisible dependencies.

The evaluation problem at the root

These causes — underspecification, data cascades, technical debt, organizational friction — share a common root: the evaluation environment does not match the deployment environment.

When we test models on clean benchmarks with well-formed inputs, we learn whether the model has a capability. We do not learn whether it can exercise that capability in the conditions where it will actually operate.

When we evaluate agents in sandboxes with synthetic data, we learn whether the agent can complete a task in controlled conditions. We do not learn whether it can handle the messy Slack threads, the ambiguous CRM entries, the documents that don't follow templates, the organizational structures that affect who can access what information.

What closing the gap requires

The solution is not better models. The current generation of models is already capable of far more than production deployment rates suggest. The solution is better data infrastructure:

Sourcing data from real companies, not generating it synthetically. Real organizational data has properties — inconsistency, ambiguity, implicit context, access control structures — that synthetic data cannot replicate.
Anonymizing without simplifying. The anonymization process must preserve the structural complexity of the original data. An anonymized digital twin should be just as messy and interconnected as the original.
Building dynamic environments, not static test sets. Agents need to be evaluated in environments where they take actions and observe consequences, not just answer questions.
Calibrating for the frontier. Evaluation environments that can be solved by current models are not useful for long. Tasks need continuous calibration against the latest models.

The cost of the gap

Every AI project that succeeds in the sandbox but fails in production represents wasted engineering time, wasted compute, and — most importantly — wasted organizational trust. When a team demos an impressive AI system that then fails to deliver, the organization does not just lose a project. It loses confidence in the next project.

The S&P Global finding — that the proportion of organizations citing positive impact from GenAI fell across every objective assessed between 2024 and 2025 — is not a sign that AI does not work. It is a sign that the gap between what AI can do in controlled conditions and what it delivers in production is eroding the confidence needed to invest in deployment.

Closing this gap is not a research problem in the traditional sense. The models are good enough. The architectures are capable enough. What is missing is the data infrastructure that connects model capability to deployment reality — evaluation environments built from the real world, not from our idealized version of it.

References

D'Amour, A., et al. (2022). Underspecification Presents Challenges for Credibility in Modern ML. JMLR, 23, 1-61.
Fleming, M., Li, W., & Thompson, N. C. (2024). The Last Mile Problem in AI. Brookings Institution.
Gartner. (2024). Gartner Predicts 30% of Generative AI Projects Will Be Abandoned After PoC.
Lakhani, K. R., Spataro, J., & Stave, J. (2026). The "Last Mile" Problem Slowing AI Transformation. HBR.
McKinsey & Company. (2025). The State of AI: Global Survey.
Nahar, N., et al. (2022). Collaboration Challenges in Building ML-Enabled Systems. ICSE '22.
NTT DATA. (2024). Between 70-85% of GenAI Deployments Are Failing to Meet Their Desired ROI.
Paleyes, A., Urma, R.-G., & Lawrence, N. D. (2022). Challenges in Deploying Machine Learning. ACM Computing Surveys.
Ryseff, J., De Bruhl, B. F., & Newberry, S. J. (2024). Root Causes of Failure for AI Projects. RAND Corporation.
S&P Global. (2025). AI Experiences Rapid Adoption, But with Mixed Outcomes.
Sambasivan, N., et al. (2021). Data Cascades in High-Stakes AI. CHI '21.
Sculley, D., et al. (2015). Hidden Technical Debt in ML Systems. NeurIPS 2015.