From sandbox to production: the gap nobody talks about

Here is a pattern that plays out across the AI industry, over and over:

A team builds an AI system. It works impressively in the development environment. They demo it to stakeholders. Everyone is excited. They push it toward production. It fails — not catastrophically, not in a way that makes headlines, but persistently, in small ways that erode trust and eventually lead the project to be quietly shelved.

The industry has a name for this now: the sandbox-to-production gap. But we have been remarkably slow to treat it as the structural problem it is.

The numbers are worse than you think

The failure rate of AI projects in production is not a secret, but the scale is still surprising.

A 2024 RAND Corporation study, based on interviews with 65 data scientists and engineers, estimates that 80% of AI projects fail — double the failure rate of non-AI IT projects.

In 2024, Gartner predicted that at least 30% of generative AI projects would be abandoned after proof of concept by end of 2025, citing poor data quality, inadequate risk controls, escalating costs, and unclear business value.

By 2025, S&P Global's Voice of the Enterprise survey made that prediction look optimistic: 42% of companies had abandoned the majority of their AI initiatives before reaching production — up from 17% the year before.

NTT DATA's 2024 global survey of 2,300+ leaders across 34 countries found that 70-85% of GenAI deployments fail to meet their desired ROI, despite 83% of organizations claiming a well-defined GenAI strategy.

McKinsey's 2025 State of AI survey confirms the pattern from the other direction: while 78% of organizations now use AI in at least one business function, only about 5.6% report that more than 5% of their organization's EBIT is attributable to AI. Adoption is nearly universal. Value creation is not.

Why demos work and production doesn't

The sandbox-to-production gap is not random. It has specific, identifiable causes — and most of them are about data, not models.

Underspecification

D'Amour et al. (2022), in a landmark paper from Google published in the Journal of Machine Learning Research, introduced the concept of "underspecification": ML pipelines return many predictors with equivalently strong held-out performance in the training domain, but these predictors can behave very differently in deployment.

This means that good test-set performance does not guarantee good deployment performance — not because the model is bad, but because the evaluation does not constrain the model's behavior enough to ensure it generalizes correctly. The authors demonstrated this across computer vision, NLP, medical imaging, clinical risk prediction, and genomics.

Data cascades

Sambasivan et al. (2021), in a study published at CHI based on interviews with 53 AI practitioners, identified "data cascades" — compounding negative downstream effects from data quality issues that are invisible during development but devastating in production.

92% of practitioners had experienced data cascades. The root cause: conventional AI/ML practices systematically undervalue data quality work in favor of model work. As one practitioner put it (giving the paper its title): "Everyone wants to do the model work, not the data work."

Hidden technical debt

Sculley et al. (2015), in one of the most cited ML systems papers, observed that only a tiny fraction of real-world ML systems consists of the ML model code itself. The surrounding infrastructure — data pipelines, feature stores, serving systems, monitoring, configuration management — is vast and prone to what they call "hidden technical debt."

This debt includes boundary erosion, entanglement, hidden feedback loops, and undeclared consumers. None of these failure modes appear in the sandbox. All of them appear in production.

The organizational gap

The sandbox-to-production gap is not purely technical. It is also organizational.

Nahar et al. (2022), in a study published at ICSE based on interviews with 45 practitioners from 28 organizations, found that the transition breaks down at the boundary between data science and software engineering — miscommunication about model assumptions, inadequate documentation, and misaligned practices.

A 2024 analysis from the Brookings Institution by Fleming, Li, and Thompson quantifies a specific aspect of this gap: while 80% of computer vision tasks are technically automatable, only 23% are economically viable when customization costs are included.

Lakhani, Spataro, and Stave (2026), writing in the Harvard Business Review, identify seven organizational frictions that prevent AI from scaling beyond pilots. Among them: organizations become "pilot-rich but transformation-poor"; legacy workflows ("process debt") prevent AI from operating reliably; and tribal knowledge hoarding creates invisible dependencies.

The evaluation problem at the root

These causes — underspecification, data cascades, technical debt, organizational friction — share a common root: the evaluation environment does not match the deployment environment.

When we test models on clean benchmarks with well-formed inputs, we learn whether the model has a capability. We do not learn whether it can exercise that capability in the conditions where it will actually operate.

When we evaluate agents in sandboxes with synthetic data, we learn whether the agent can complete a task in controlled conditions. We do not learn whether it can handle the messy Slack threads, the ambiguous CRM entries, the documents that don't follow templates, the organizational structures that affect who can access what information.

What closing the gap requires

The solution is not better models. The current generation of models is already capable of far more than production deployment rates suggest. The solution is better data infrastructure:

Sourcing data from real companies, not generating it synthetically. Real organizational data has properties — inconsistency, ambiguity, implicit context, access control structures — that synthetic data cannot replicate.
Anonymizing without simplifying. The anonymization process must preserve the structural complexity of the original data. An anonymized digital twin should be just as messy and interconnected as the original.
Building dynamic environments, not static test sets. Agents need to be evaluated in environments where they take actions and observe consequences, not just answer questions.
Calibrating for the frontier. Evaluation environments that can be solved by current models are not useful for long. Tasks need continuous calibration against the latest models.

The cost of the gap

Every AI project that succeeds in the sandbox but fails in production represents wasted engineering time, wasted compute, and — most importantly — wasted organizational trust. When a team demos an impressive AI system that then fails to deliver, the organization does not just lose a project. It loses confidence in the next project.

The S&P Global finding — that the proportion of organizations citing positive impact from GenAI fell across every objective assessed between 2024 and 2025 — is not a sign that AI does not work. It is a sign that the gap between what AI can do in controlled conditions and what it delivers in production is eroding the confidence needed to invest in deployment.

Closing this gap is not a research problem in the traditional sense. The models are good enough. The architectures are capable enough. What is missing is the data infrastructure that connects model capability to deployment reality — evaluation environments built from the real world, not from our idealized version of it.

References

D'Amour, A., et al. (2022). Underspecification Presents Challenges for Credibility in Modern ML. JMLR, 23, 1-61.
Fleming, M., Li, W., & Thompson, N. C. (2024). The Last Mile Problem in AI. Brookings Institution.
Gartner. (2024). Gartner Predicts 30% of Generative AI Projects Will Be Abandoned After PoC.
Lakhani, K. R., Spataro, J., & Stave, J. (2026). The "Last Mile" Problem Slowing AI Transformation. HBR.
McKinsey & Company. (2025). The State of AI: Global Survey.
Nahar, N., et al. (2022). Collaboration Challenges in Building ML-Enabled Systems. ICSE '22.
NTT DATA. (2024). Between 70-85% of GenAI Deployments Are Failing to Meet Their Desired ROI.
Paleyes, A., Urma, R.-G., & Lawrence, N. D. (2022). Challenges in Deploying Machine Learning. ACM Computing Surveys.
Ryseff, J., De Bruhl, B. F., & Newberry, S. J. (2024). Root Causes of Failure for AI Projects. RAND Corporation.
S&P Global. (2025). AI Experiences Rapid Adoption, But with Mixed Outcomes.
Sambasivan, N., et al. (2021). Data Cascades in High-Stakes AI. CHI '21.
Sculley, D., et al. (2015). Hidden Technical Debt in ML Systems. NeurIPS 2015.