Your agent isn't dumb. Your repair loop is missing.

Most teams diagnosing agent failures reach for the same lever: upgrade the model. Makes sense — the agent couldn't complete the task, the model is what did the thinking, so get a smarter model. IBM Research and RPI published a paper this week that challenges that instinct at the level of architecture, not just degree.

The paper is Evoflux (arXiv:2606.12674), and its claim is specific: the reason small models fail at tool-using agent tasks isn't model intelligence. It's that real tool environments are unforgiving — catalogs change, schemas break, intermediate outputs have dependencies — and most agent stacks have no mechanism to fix a plan after it starts going wrong. 1

The actual failure mode

IBM tested compact models (1.5B–4B parameters) against MCP-Bench, a benchmark built around MCP (Model Context Protocol — the industry-standard protocol for connecting AI agents to tools and data sources, now at 97 million downloads 2). 28 live servers, 250 tools: biomed APIs, NASA data, Google Maps, academic search, crypto DEX, weather services. 3 The baseline result: roughly 3% of tool workflows executed successfully. Not 30%. Three.

The failure pattern wasn't "model doesn't know how to call a function." It was:

The model generates a plausible-looking plan that breaks when the first tool returns unexpected data
The plan references a tool that doesn't exist in the current catalog
Parameters pass syntax validation but fail the actual API schema check
A downstream step depends on output from an upstream step the model assumed would succeed

Narwal Speaks, a practitioner with 30 years of Fortune 50 experience who analyzed the paper, put it plainly: "Small tool agents do not fail because they lack function-calling syntax. They fail because real tool catalogs change, schemas are unforgiving, intermediate outputs have dependencies, and plausible plans often collapse at execution time." 4

コンテンツカードを読み込んでいます…

The instinctive fix — fine-tune on tool-calling examples — doesn't work here. IBM tried it. With 177 training examples, supervised fine-tuning (SFT) got Llama-3.2-3B to about 5% execution success. Adding preference optimization (SFT+DPO) on the same data dropped back toward zero-shot levels. The trained Qwen3.5-4B checkpoint produced nearly zero executable workflows. 1

The authors' conclusion: a few hundred teacher traces can teach a model workflow format. They cannot teach it how to recover when the real environment doesn't cooperate.

What Evoflux actually does

Instead of training, Evoflux adds an evolutionary repair loop at inference time. When a model generates a workflow, Evoflux doesn't just execute it — it runs a search process that:

Generates a population of workflow candidates
Tries to execute each one against the real tool environment
Scores results across six dimensions (task completion, tool selection, parameter accuracy, dependency handling, parallel efficiency)
Makes targeted edits to the strongest candidates — swapping tools, adjusting parameters, inserting validators, reordering steps
Runs the cycle again, adaptively increasing exploration when stuck

The result on the same compact models: Llama-3.2-3B goes from ~3% to ~17% execution success. Qwen3.5-4B goes from ~3% to ~24%. 1 Still not solved — but a 5–8× lift over the baseline without touching model weights, and without needing data that doesn't exist.

github.com · GitHub リポジトリ

IBM/Evoflux

https://github.com/IBM/Evoflux

コンテンツカードを読み込んでいます…

The key design property: when the tool catalog changes, the repair loop adapts automatically. A fine-tuned model trained on yesterday's schema breaks silently. A repair loop that executes against the live catalog just... finds the new path.

The broader signal

Evoflux isn't an isolated paper. This week's HuggingFace daily batches had 19 out of 44 papers on June 12 and 18 out of 44 on June 13 in the agents/search/tool-use category. 5 The field is converging on execution reliability as the hard problem, not model capability.

On the same day Evoflux dropped, a separate team from Imperial College London published Pythagoras-Prover, showing that a 4B model with the right training strategy can beat DeepSeek-Prover-V2-671B on formal math benchmarks — 86.1% vs. 82.4% pass@32 on MiniF2F-Test, with 167× fewer parameters. 6 The two papers are independent, but they point at the same structural claim: scale is not the only variable. Inference-time compute and training methodology are the levers that are moving now.

Microsoft ships product-level validation of this thesis too. MagenticLite, released in May, is a complete agentic system built entirely around small models: a 14B orchestrator and a 9B computer-use model that reaches state-of-the-art on Online-Mind2Web. 7 Their design principle: agentic capability depends on tool orchestration and action, not model knowledge alone. That's not a research claim — it's a shipping product from a company with every incentive to use bigger models if bigger actually meant better.

And the community is running its own tests. A Reddit benchmark (r/LocalLLM) pitted 26 local models through an 8-level tool-calling reliability gauntlet — format compliance, tool selection, multi-step chaining, error recovery, long context stability. Only 14 of 26 survived. The finding that resonated most: "Capability leaderboards tell you a model is 'smart,' but they say nothing about whether it can survive a tool calling loop without breaking the JSON, calling the wrong tool, hallucinating an ID, or dropping the role halfway through." 8

3 PM actions

1. Audit your agent failure mode before your next model upgrade. Before the next "let's try GPT-5 instead" conversation, document whether the current failures are planning failures (wrong strategy) or execution failures (right plan, broken environment). Execution failures — bad JSON, wrong tool name, broken parameter chain, hallucinated dependency — are Evoflux's target, and they're solvable without a model upgrade. Planning failures need a smarter model. Know which you have.

2. Add MCP-Bench to your evaluation stack. Evoflux is open-source under Apache 2.0 and available now at github.com/IBM/Evoflux. The benchmark it runs on — MCP-Bench with 28 live servers and 250 tools — is at github.com/Accenture/mcp-bench. If your team runs agent workflows against real APIs, the benchmark lets you measure execution reliability directly rather than inferring it from accuracy scores on chat benchmarks that correlate with almost nothing in production tool-calling loops.

3. Re-examine the "train vs. search" decision for your agent stack. The Evoflux authors' finding is a useful decision rule: if you have hundreds of teacher traces (not thousands), inference-time search beats fine-tuning. Fine-tuning on scarce data actively makes things worse. The correct investment isn't more training data — it's a repair loop that executes against your actual tool catalog. Narwal Speaks called this the right frame for the whole category: "Most agent roadmaps over-invest in training traces and under-invest in repair loops." 4 That's a lever your team can pull today, with the tools and models you already have.

Your agent isn't dumb. Your repair loop is missing.

The actual failure mode

What Evoflux actually does

The broader signal

3 PM actions

参考ソース