The Harness Is the Product Now | Abhishek Shankar's Blog

For two years the reflex when an agent failed was to reach for a bigger model. The agent couldn't plan, so you swapped GPT-4 for the next tier, paid four times the token rate, and the failure went away. That move worked because raw capability was the binding constraint. It isn't anymore.

What nobody building agents wants to admit out loud is that the model has quietly become the cheap part. The expensive part, the part that actually decides whether your agent finishes the task, is the thin layer of scaffolding wrapped around it: the control loop, the tool router, the retry logic, the prompt that tells a 7B model exactly what to do next. I call that layer the harness. The harness is where the work moved.

The convergence already happened

I named Model Convergence Pressure on the Map as raw capability flattening while the gains migrate to topology, rubrics, memory, and scaffolding. The evidence has since gotten blunt.

In December a team published Confucius Code Agent and ran the cleanest version of the experiment I have seen. Claude 4.5 Sonnet, the weaker model, paired with their scaffold, resolved 52.7% of the benchmark. Claude 4.5 Opus, the stronger model, paired with Anthropic's own scaffold, resolved 52.0%. The weaker brain with the better harness won. Hold the model fixed and the same scaffold pushed Claude 4 Sonnet to a 74.6% resolve rate on SWE-Bench-Verified, above the strongest open-source system on identical hardware.

Read those numbers twice. The difference between a frontier model and the tier below it, on a real task, was erased by the wrapper. Not narrowed. Erased, and then reversed.

Go smaller and it gets stranger. AWS researchers fine-tuned a 350-million-parameter model, OPT-350M, on tool-calling data and hit a 77.55% pass rate on ToolBench. The baselines it beat included ChatGPT with chain-of-thought at 26%. A model five hundred times smaller, doing one thing, beat the general-purpose giant at that one thing by fifty points. An NVIDIA position paper in June made the structural version of the claim: on benchmarks that measure agentic utility rather than trivia recall, small models routinely outperform large ones, and the only real barriers left are inertia and marketing budgets.

What a thin harness actually is

The instinct, once you accept this, is to build a thick harness. Resist it. The teams that win build the thinnest possible layer that does four things and nothing else.

It routes. Most steps in an agent loop are not hard. Classifying an email, extracting a field, deciding which of three tools to call: a 7B model does these for a fraction of a cent. A thin harness sends the easy steps to the cheap model and escalates only the genuinely hard ones. ToolOrchestra and the MaAS line of work show orchestration beating monolithic systems on both accuracy and cost, which means the routing layer is not a tax you pay for going cheap. It is where the performance comes from.

It constrains. A cheap model fails when you hand it an open-ended instruction and pray. It succeeds when the harness narrows the decision to a typed choice. Give it a schema, not a paragraph. The OPT-350M result is not a story about a clever model. It is a story about a task compressed until a tiny model could not get it wrong.

It loops with verification. The L0 system took Qwen2.5-7B from 30% to 80% on SimpleQA using nothing but a read-eval-print loop and verifiable rewards. Same model, same weights. The lift came entirely from letting the model act, checking the result, and feeding the check back. The harness supplies the checking the cheap model cannot supply for itself.

It stays disposable. The harness is not the asset; the spec it enforces is, and the harness exists only to enforce that spec. When the next cheap model ships, and it ships every quarter, you should be able to swap the weights and keep the scaffold. A harness welded to one model's quirks is a liability with a short shelf life.

Where the moat moved

This reframes the whole build-versus-buy question for anyone shipping agents on a budget. The expensive frontier model is a commodity you rent by the token, and its advantage over the cheap tier is a wasting asset. The harness is the thing you own. It encodes your task decomposition, your tool definitions, your verification rubric, your routing policy. None of that transfers to a competitor when they rent the same model you do.

The companies that overpay are the ones still treating the model as the product and the harness as glue code an intern can write between other tickets. They have the ratio backwards. The model is the glue, rented and interchangeable. The harness is the product.

Watch which side of that line your spend falls on. If your agent bill is mostly frontier tokens and your scaffold is a prompt template someone wrote in an afternoon, you are subsidizing a capability gap that the harness was supposed to close. Build the harness, shrink the model, and the gap closes for free.