TL;DR: BentoLabs AI is the monitoring and learning layer for long-running agents. We detect when agents silently fail or drift from the user's goal, system prompt, or tool contracts, show affected users and root cause, and suggest the prompt, skill, or harness fix. Run 101 is measurably better than run 1.
Hello everyone, we're Abhinav and Kaushik, co-founders of BentoLabs AI.
After spending two years at Emergent (YC S24), building and operating production coding agents used by 5M+ users, helping scale from 0 to $100M ARR and topping SWE-Bench twice. We realised as more teams deploy agents, keeping them reliable, observable, debuggable and continuously improving in production becomes mission-critical. Most teams don't have a system for it. They have engineers.
Most production agents fail silently. You have 10,000 traces a day and zero visibility into reasoning drift until a support ticket pops up. That's the easier half! The harder problem is that nothing your agent figures out on one run carries into the next. An agent might solve a complex edge case on run 47, but because nothing carries forward, it burns your budget rediscovering the same fix on run 48.
This makes production teams spend hours reading logs in one tab just to manually patch prompts in another, on a loop that never closes.
BentoLabs AI The monitoring and learning layer for long-running agents.
Monitoring: BentoLabs finds every failure instance across your production traces, classifies it, and tracks it before the support ticket pops up.
Learning: BentoLabs captures what your agent figures out on every run and makes sure it carries into the next one. Recurring failures get fixed once and stay fixed. Hard-won solutions become reusable. Run 101 starts from everything runs 1 through 100 learned, not from zero.
Terminal-Bench 2.0 (Internal Run): we validated our recursive learning engine on Terminal-Bench 2.0, one of the most demanding agentic-shell benchmarks in the field today. Where the official Claude Sonnet 4.5 baseline scores 42.2% pass@1. Our agent scored 52.4%, with the same agent, same model, same budget.A +10.2 percentage-point lift, statistically significant (p < 0.05), with 13 tasks showing wins ≥ +20 pp and only 3 showing losses (Deep Dive)
ARC-AGI-3 (Internal Run): To see our engine in action, we took on ARC-AGI-3 (25 interactive puzzle games, the hardest agent benchmark we could find). While frontier models score 0.2–0.3% out of the box, 3 games the agent had never solved across ~30 prior runs were cracked for the first time. (Deep Dive)
Why we're building BentoLabs
We were the engineers who were the fix for the last 2 years at Emergent, where we spent thousands of hours staring at traces, and watching teams burn hours of their best engineers' time on patching prompts, correcting tool definitions and experimenting with different skills, hoping something will stick. We are building BentoLabs AI to move past this repetitive cycle.
We're already working with teams at unicorn scale. Every conversation with a team running agents in production confirms the same thing: the problem is universal. BentoLabs gives them the operational leverage to scale their agent ecosystems without scaling human firefighting alongside them.
If your engineers are doing the logs-and-patches rotation or your agents keep hitting the same failures, let's have a chat. Or Email: abhinav@bentolabs.ai/ kaushik@bentolabs.ai