Research
We publish what we learn
Our research covers evaluation methodology, dataset design, and the gap between benchmark performance and real-world capability.
Why synthetic benchmarks are holding AI back
The AI industry over-indexes on synthetic benchmarks that don't reflect real-world complexity. Models that ace leaderboards routinely fail in production.
March 18, 2026
10 min read
The data problem in agentic AI
AI agents need fundamentally different data than chatbots — multi-step interaction trajectories, tool-use demonstrations, and multi-modal workflow data. This data barely exists.
February 11, 2026
11 min read
From sandbox to production: the gap nobody talks about
80% of AI projects fail in production. The problem is not the models — it's the gap between evaluation conditions and deployment conditions.
January 22, 2026
12 min read