The data problem in agentic AI — Ooak Data Research

The AI industry is in the middle of a paradigm shift. The focus is moving from models that answer questions to agents that complete tasks — systems that read documents, send messages, update databases, navigate tools, and execute multi-step workflows autonomously.

But there is a problem that the excitement around agents tends to obscure: we do not have the data to train them.

The data infrastructure that powered the LLM revolution — web scrapes, books, code repositories — is fundamentally inadequate for agents. Agents need something different: multi-step interaction trajectories, tool-use demonstrations, multi-modal workflow data, and environments where they can learn through trial and error. This data barely exists, and building it is harder than most people realize.

Agents are not chatbots

The distinction matters because it determines what training data looks like.

A chatbot takes a prompt and returns a response. Its training data is input-output pairs. The internet is full of this data. It is comparatively easy to collect, clean, and scale.

An agent takes a goal and executes a sequence of actions across multiple tools and systems to achieve it. Its training data needs to capture trajectories — chains of reasoning, tool invocations, observations, corrections, and completions that unfold over many steps. This kind of data is vanishingly rare in the wild.

Yao et al. (2023) formalized this distinction with ReAct, an architecture that interleaves reasoning traces with action steps. The key insight: agents require fundamentally different data than standard LLMs. They need trajectories of interleaved thought and action, not just input-output pairs.

Schick et al. (2023) demonstrated a related point with Toolformer: LLMs can learn to use tools, but the training data must be augmented with API call annotations at positions where tool use helps predict future tokens. Standard text corpora contain no tool-use demonstrations. The data must be synthetically generated or manually curated — and neither approach scales cleanly.

The benchmark evidence

Recent agent benchmarks have made the data gap quantitatively visible.

WebArena (Zhou et al., 2023), published at ICLR 2024, evaluates agents on realistic web tasks using fully functional self-hosted websites. The best GPT-4-based agent achieved only 14.41% end-to-end task success, compared to 78.24% for humans. This is a 5x performance differential on tasks that most knowledge workers perform routinely.

OSWorld (Xie et al., 2024), presented at NeurIPS, extends this to real computer environments across Ubuntu, Windows, and macOS. The best model achieves 12.24% success versus 72.36% for humans. The primary failure modes: GUI grounding and operational knowledge — capabilities that require visual, multi-modal training data that barely exists.

VisualWebArena (Koh et al., 2024), published at ACL, finds that the best vision-language model agents achieve only 16.4% success versus 88.7% for humans on visually grounded web tasks. Text-only agent training data is fundamentally insufficient.

AgentBench (Liu et al., 2023), accepted at ICLR 2024, evaluated 29 LLMs across 8 diverse environments and found a significant performance gap between top commercial models and open-source alternatives. The authors identify poor long-term reasoning, decision-making, and instruction-following as the main obstacles — all capabilities that require multi-step interaction data to develop.

The pattern is consistent: models that perform well on static benchmarks struggle dramatically when placed in realistic, multi-step, multi-tool environments. The models are not incapable. They are undertrained — specifically, they lack the right kind of training data.

The reliability problem

Even when agents succeed on a task, they do not succeed reliably.

Yao et al. (2024) introduced tau-bench, a benchmark for tool-agent-user interaction in real-world customer service domains. Their findings: even state-of-the-art agents succeed on fewer than 50% of customer service tasks, and consistency drops to approximately 25% when the same task is repeated eight times.

For enterprise deployment, this is the critical metric. A system that works 50% of the time is not half as useful as one that works 100% of the time — it is essentially unusable for any workflow where errors have consequences.

Why synthetic data is not enough

The obvious response to the data scarcity problem is to generate synthetic training data — use a large model to produce agent trajectories, then fine-tune smaller models on those trajectories.

This approach works, up to a point. Chen et al. (2023) showed with FireAct that fine-tuning Llama2-7B with just 500 agent trajectories generated by GPT-4 leads to a 77% performance increase on HotpotQA. The data is clearly valuable. But it creates two problems.

First, there is a quality ceiling. Synthetic trajectories inherit the limitations of the generating model. They do not contain patterns the generating model has never seen — and the messy, ambiguous, context-dependent workflows of real companies are precisely what no model has been trained on.

Second, there is an entanglement problem. Chen et al. (2024) found with Agent-FLAN that current agent training corpora entangle format-following with agent reasoning, causing distribution shift. Worse, improving agent abilities through naive fine-tuning introduces hallucinations as a side effect.

Zeng et al. (2023), with AgentTuning, demonstrated that the quality and diversity of agent trajectory data is the bottleneck, not model architecture. Their AgentLM-70B became comparable to GPT-3.5-turbo on unseen agent tasks, confirming that the right training data can close capability gaps that no amount of architectural innovation addresses.

What agent data actually needs to look like

Anthropic's research team (2024), in their guide to building effective agents, makes the point that the practical bottleneck is not model architecture but "the quality of tool definitions, context engineering, and evaluation data." The data and evaluation layer — not the model layer — is the binding constraint.

A 2025 analysis by Wing Venture Capital frames this in market terms: RL environments are playing for AI agents the same role that EDA played for silicon design. Anthropic alone is estimated to be spending tens of millions annually on RL environments, with 3-5x growth expected into 2026.

This points to what agent training data actually needs:

Real-world grounding. Training data sourced from actual company workflows — documents with real formatting inconsistencies, conversations with organizational context, project management tools with actual task dependencies.
Multi-modal coherence. Documents, conversations, and tool data that are interconnected — the Slack message references the Jira ticket which references the Google Doc.
Expert-level task calibration. Tasks designed to challenge the current frontier of model capability, continuously updated as models improve.
RL environments, not static datasets. Agents learn through interaction — taking actions, observing outcomes, adjusting strategies. The shift from datasets to environments is as fundamental as the shift from chatbots to agents.

The infrastructure gap

The data problem in agentic AI is not a temporary inconvenience that will be solved by scaling existing approaches. It is a structural gap in the AI infrastructure stack. We have world-class models, increasingly capable architectures, and billions of dollars in deployment demand. What we lack is the data infrastructure to connect model capability to real-world performance.

Closing this gap requires building something new: pipelines that source real-world data through established partnerships, anonymize it while preserving structural fidelity, and transform it into RL environments where agents can train against the complexity they will actually face.

The model layer has had its revolution. The data layer is next.

References

Anthropic. (2024). Building Effective Agents.
Chen, B., et al. (2023). FireAct: Toward Language Agent Fine-tuning.
Chen, Z., et al. (2024). Agent-FLAN: Designing Data and Methods of Effective Agent Tuning. ACL 2024.
Koh, J. Y., et al. (2024). VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks. ACL 2024.
Liu, X., et al. (2023). AgentBench: Evaluating LLMs as Agents. ICLR 2024.
Schick, T., et al. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools.
Wing Venture Capital. (2025). RL Environments for Agentic AI.
Xie, T., et al. (2024). OSWorld: Benchmarking Multimodal Agents. NeurIPS 2024.
Yao, S., et al. (2023). ReAct: Synergizing Reasoning and Acting. ICLR 2023.
Yao, S., et al. (2024). tau-bench: A Benchmark for Tool-Agent-User Interaction.
Zeng, A., et al. (2023). AgentTuning: Enabling Generalized Agent Abilities for LLMs.
Zhou, S., et al. (2023). WebArena: A Realistic Web Environment. ICLR 2024.