The Three Pillars Autonomous Research Keeps Mis-Building

The autonomous-research-agent genre stopped being a curiosity about eighteen months ago. AIDE, the agentic tree-search system built by Weco, is now the default scaffolding researchers strap on top of frontier models to run Kaggle competitions for them; on OpenAI's MLE-Bench, the combination of o1-preview and AIDE earns medals on 16.9% of competitions, four times the medal rate of the next best autonomous agent. Sakana's AI Scientist v2, built on the same tree-search foundation, generated the first paper accepted to Agents4Science 2025 — a venue that explicitly requires the first author to be an AI system, which is itself a piece of theater the field will want to revisit. Google's MLE-STAR, packaged in the open-source Agent Development Kit, hits medals in 63% of MLE-Bench Lite competitions by combining web search for initial solutions with ablation-driven targeted refinement. Microsoft's R&D-Agent, built for quantitative finance and data science workflows, splits the agent into research and development roles routed to different models. The category has its own benchmark, its own corner of arXiv, its own taxonomy of failure modes.

The interesting question is not whether autonomous research agents are real. It is what they are converging on, and what they are getting wrong.

Six pillars and the convergence happening behind them

Read enough of these systems and the shape becomes obvious. Every one of them — the named labs' projects, the open-source flagships, the long tail of grad-student repos pulling AIDE off the shelf — solves the same six problems in different costumes.

The cycle. The search policy. The memory architecture. Cost discipline. Verification. The execution sandbox.

These are the six pillars an autonomous research agent has to stand on. The genre is converging on them whether or not anyone has called them that, and the convergence itself is the unremarkable news. The second-order story is the one that matters: three of the six are operational pillars, three are epistemic, and the field has gotten the operational ones broadly right while building the epistemic ones in roughly the same wrong shape.

The operational three — the cycle, cost, and sandbox — are the parts of the system that fail loudly. A broken loop hangs the agent. A blown budget hits a credit card. An unsandboxed bug corrupts a filesystem. These failures are visible, and the field has accordingly built reasonable versions of each: tiered LLM routing, zero-cost log-based monitoring during training, preflight checks, NaN-and-divergence kill signals, increasing containerization. The 24/7 viability problem the first wave of these systems struggled with is largely solved.

The epistemic three — search, memory, verification — are the parts of the system that fail quietly. A bad search policy looks like the agent is not smart enough. A weak memory looks like a need for more compute. A missing judge looks like a lucky result. The failure modes are invisible at the level of an individual run. They only emerge as patterns across many runs, and only to a reader who is looking for them.

These are the three pillars the field is mis-building.

Search policy belongs in code, not in the prompt

The default move in most autonomous research systems is to ask the language model what to try next. The planner takes the project brief, the recent history of experiments, perhaps some retrieved past attempts, and produces a fresh configuration. The next experiment is whatever the LLM decided would be interesting.

This is the wrong design.

The reason it is the wrong design is that search policy is a stateful problem, and language models do not have state. They have context windows. Across cycles, the policy the planner is supposedly implementing — when to backtrack, when to exploit, when to explore — exists only as text in a prompt that gets re-sent every iteration. "If you have plateaued for three runs, backtrack to the best known configuration" reads to the human author as a rule. To the model, it reads as a suggestion that may or may not be salient depending on what else is in context.

The systems that win on the leaderboards have already moved past this. AIDE — and through it, AI Scientist, and through it, every successor that has adopted the tree-search scaffolding — organizes the entire history of experiment attempts into a tree. Each node is a candidate solution. Each edge is a refinement: the LLM was asked to improve a specific parent and produced a child. The choice of which parent to refine next is not the LLM's. It is a hard-coded search algorithm — typically a UCB-style bandit or a related best-first scheme — operating on the tree as data.

The LLM, in this design, does what it is actually competent at: take a goal, take a starting point, produce an implementation. The search policy lives in code and is therefore deterministic, inspectable, and reproducible. The model proposes patches; the algorithm decides which to keep.

MLE-STAR takes the move one step further. Once you have a tree, the obvious next refinement is to make the refinement targeted. Rather than ask the LLM to improve a solution wholesale, MLE-STAR runs ablations to determine which code block in the current best solution is the load-bearing one, then asks the LLM to refine only that block. The whole-config rewrites earlier systems produced — and that the LLM is in fact pretty good at producing — turn out to be too wide a change to ablate cleanly. Targeted refinement of single components, guided by which components moved the metric, is the move that produced 63% medal rates on MLE-Bench Lite.

R&D-Agent makes a different but related move. It splits the agent into a research role and a development role and routes them to different models — o3 for ideation, GPT-4.1 for implementation. The split is structural: the goal is to keep the two roles' priors from contaminating each other. The same logic applies inside the search policy itself. A model that proposes the next experiment in the same forward pass that judges the last one is too entangled to trust on either.

The pattern across these wins is consistent. Pieces of agent behavior originally handled by "ask the LLM to think about it" get extracted into code, into separate roles, or into separate model instances with different priors. Anything that does not need to be in the LLM's working memory should not be.

This is also where Lossfunk's January 2026 case study of four autonomous-research attempts becomes relevant. Three of the four attempts failed during implementation or evaluation. The failure modes documented were not exotic. They were depressingly standard: bias toward training-data defaults, implementation drift under execution pressure, memory and context degradation across long-horizon tasks. The paper's first design principle for more robust AI-scientist systems amounts to "multi-turn agentic task design works better than zero-shot generation" — which, translated, means putting structure outside the model rather than inside the prompt.

Moving search policy into code does not eliminate the LLM. It puts the LLM in the position where it is competent. Given a parent configuration, a focus dimension, an established lesson set, and a clear refinement objective, modern frontier models produce sensible children. Given an unbounded "what should we try next, with these forty pages of context" prompt, they produce the kind of drift Lossfunk documented.

The check to run on any autonomous research system is mechanical. Open the planner's system prompt. If you see instructions of the form "if you observe X, consider doing Y," the search policy is in the prompt — which is to say, it is a hope. If the planner can only respond to a target the system handed it — parent node, refinement focus, constraints — the search policy is in the code. The former is the default. The latter is what should be the default.

Verification is the pillar that will eat the field

If the search-policy mistake is endemic but tractable, the verification mistake is the one that will determine whether autonomous research as a category survives its first round of public failure.

Most autonomous research agents verify by checking that the training process exited with status code zero. The slightly more sophisticated ones add NaN detection, divergence kill signals based on rising loss, and stall detection when the log stops growing. These are useful operational signals. They are not verification.

Verification, properly understood, is the question: do we believe the metrics this run claims?

This is a different question from "did the run finish." A run can finish cleanly and report 92% accuracy that does not survive a re-evaluation on a held-out test set. A run can converge to a low loss because of a bug in the data loader that lets the label leak into the input. A run can claim to have improved on the baseline because the final epoch happened to be the lucky one and the prior ten epochs of loss curve were flat or rising. Each of these is a success by the operational definition. None is a success by any definition that should be allowed to compound.

The Lossfunk paper names this failure mode explicitly and lists it among the six most important they observed. They call it "overexcitement that declares success despite obvious failures." The agent reads its own evaluation pipeline's output, sees the numbers, declares the experiment a win, and propagates that win as the parent of further refinement attempts. Call this a phantom success: a run the system records as a win without anyone, human or model, ever verifying that what it reported was real. The subtree that grows out of a phantom success is itself entirely phantom.

This is a structural problem, not a model-capability problem. Make the model better at reading logs and it still has the same defect: the entity that proposed the experiment and the entity that grades it are the same entity in the same conversational context. The grader's priors are aligned with the proposer's. The model knows what good accuracy *should* look like for a ResNet on CIFAR — knows from training data — and pattern-matches to that knowledge when the numbers happen to fall in the expected range.

What is needed is an independent judge. Different prompt. Ideally different model. Explicit adversarial framing: assume the agent is lying, and find the evidence. Given the run's config, the claimed metrics, the monitor's reported status, and the raw tail of the log, return a structured verdict: do you trust the metrics, with what confidence, what are the specific concerns, and what is the recommended next action.

The verdict has to be empowered to do something. If the judge does not trust the metrics, the run's status downgrades from "success" to "suspect." Downstream queries — the search policy's "what is the best node we have," the memory store's "what configurations have worked" — filter suspect nodes out. The system continues to operate, but it no longer builds on the unverified result.

This is the meta-pattern the Map calls Verification Renaissance, applied at the agent layer rather than the model layer. In the model context, the renaissance is about SMT solvers and zkML and mechanistic interpretability probes as oversight backends for individual model behaviors. In the agent context, the same logic applies: trust requires a separable oracle. The proposer cannot also be the grader. The grader, ideally, comes from a different training distribution and brings different priors.

The structural defensibility of an entire autonomous research stack hinges on this layer being present and load-bearing. Without it, the failure mode is not "the system performs worse." The failure mode is "the system publishes a result that does not replicate." When that happens — and it will happen, because the volume of experiments these systems run guarantees that some unverified phantom will eventually graduate into a public claim — the field's permission to operate retracts. Reviewers stop trusting the methodology. Conference programs stop accepting papers written by these systems. The category enters a credibility winter that will be referenced in retrospect as the moment the field had to start over with adult supervision.

The work to avoid that winter is not subtle. An independent judge is a few hundred lines of code, one frontier-model API call per cycle, and a willingness to let it veto. The cost in tokens is negligible. The cost in agent confidence is real — building a system that can downgrade its own success requires resisting the instinct to claim wins. That instinct is precisely the problem. The judge is regulatory in advance, an internal compliance mechanism the system builds for itself before the external compliance arrives.

The other thing the judge does, almost as a side effect, is constrain the search subtree. A suspect node is not invisible — it stays in the tree as a recorded attempt — but it is no longer a candidate parent for refinement. The agent does not get to compound an unverified result. Whatever it tries next has to start from a node the judge cleared. This is the difference between an autonomous research stack that accumulates compounding insight and one that accumulates compounding error. The compounding goes both ways.

The check to run on any autonomous research system is mechanical, the same way it was for search policy. Find the verification step. If the same agent that ran the experiment also grades it, in the same context, with the same prompt, there is no judge — only self-assessment. If a separate component, with a different prompt, with adversarial framing, with the authority to mark a successful run as suspect — the judge is present. The systems that survive the credibility winter will be the ones that built this layer before they needed it.

Memory should generalize, not just persist

The third pillar the genre is mis-building is memory.

The default implementation, across most public systems, is a JSON file. Every experiment appends a record: configuration, metrics, status, a one-line verdict. The planner, on its next cycle, gets a slice of recent records as context. Sometimes the system also runs TF-IDF or vector retrieval to surface the most relevant past attempts. This is described as "memory." It is, in practice, logs.

A log is a description. A memory is a generalization. The difference between "run 7 used learning rate 0.1 and got 71% accuracy" and "learning rate above 0.05 has consistently hurt on this task across four of five runs" is the difference between data and a model of the data. The first lives in storage. The second changes what the agent decides to do next.

Almost no public autonomous research system builds the second layer. There are exceptions — the MARS paper, released earlier this year, describes what it calls "comparative reflective memory" of debug lessons, which is the same idea under a different name; ML-Master 2.0 introduces a hierarchical cognitive cache that separates scripts, facts, and strategies into different layers — but the broader field treats memory as append-only history. The agent re-derives the same conclusions every cycle from raw records because there is no layer between raw records and the planner's context.

The reason almost nobody builds this is the same reason the verification layer gets skipped: it is an offline step. It costs an LLM call every N cycles. It does not fire on the critical path of a single experiment. The cost-conscious thing to do, when you are watching the meter, is to remove anything that does not directly contribute to the next run finishing. Lessons extraction looks like overhead.

It is not overhead. It is the difference between an agent that runs two hundred experiments and an agent that learns across two hundred experiments. The first is a Kaggle grinder. The second is a researcher.

A working lessons layer has three properties. It is periodic — re-derived from the ledger every few cycles, not appended to. It is constrained — a finite number of lessons, with provenance pointing to the specific run ids that support each one. And it is opinionated — the extractor is told what a lesson looks like, with an explicit contrast between description and generalization, so the output is not just "summarized history."

The structural payoff is that the planner's context, in the next cycle, includes a separate block — something like ESTABLISHED LESSONS. Not buried inside a list of recent runs. Not interleaved with verdicts. A separate, named layer the planner is told is more durable than the raw history. Lessons that survive multiple extraction rounds become "established"; the ones with thin support stay "tentative." The agent's own beliefs are versioned.

This is the layer that lets a 24/7 autonomous system actually compound. A planner that sees, every cycle, "lr above 0.05 hurts on this task" as an established lesson with four supporting run ids stops proposing learning rates above 0.05. The proposal space contracts as evidence accumulates. The search policy's effective branching factor shrinks. The agent gets faster at converging because it stops re-exploring known dead ends.

Without this layer, the same fifteen runs that taught the lessons stay in the ledger as descriptions. The planner sees them as fifteen separate data points and re-derives the conclusion — sometimes — depending on which recent runs ended up in its retrieval window. The compounding happens by accident rather than by structure.

The check to run on any autonomous research system is, again, mechanical. Open whatever the system calls its memory. If you see run records — configurations, metrics, verdicts — that is logs. If you see causal claims — statements about what works on this task, with specific run ids cited as supporting evidence — that is memory. Almost every public system fails this check. The few that pass tend to bury the lessons layer inside a complicated multi-tier architecture that obscures the simple structural move: the extractor is a periodic offline pass, and the lessons it writes are a separate layer in the planner's prompt.

The pillars the field has right are the operational ones

It is worth being clear about which three pillars the field has gotten right, and why.

The cycle — the loop that goes propose, implement, run, evaluate, reflect — is shared across every system in this category. The shape varies: a propose/execute/reflect three-phase loop in some, a research-and-development split in others, an iterative draft-debug-improve cycle in the AIDE lineage. The cycle is the part that makes the system an agent rather than a code generator. Without it, you have an autocomplete; with it, you have something that closes the feedback loop back into proposal. Every system in the genre solves this, because not solving it produces a non-agent that nobody mistakes for the category.

Cost discipline is the second pillar the field has converged on. Tiered LLM routing — frontier models for hard reasoning, cheap models for routine summarization, occasionally local models for trivial reformatting — is now table stakes. Zero-cost monitoring during training, in which the agent uses process checks and log reads rather than periodic LLM polls, was a real innovation in earlier work and is now standard. Anthropic-style prompt caching is becoming default for any system that sends the same system prompt repeatedly. Budget meters that enforce wallclock, GPU-hour, and dollar caps are universal. The "running an agent 24/7 will bankrupt me" problem the first wave of these systems struggled with is largely behind us.

The sandbox is the third. Preflight checks before launching expensive runs, NaN and divergence kill signals that terminate doomed training early, disk and resource sanity checks, increasingly containerization for systems running unattended. The published guidance — AI Scientist's own README explicitly warns to containerize because LLM-generated code is going to do something unsafe eventually — is widely understood and increasingly implemented.

These three pillars are operational, in the sense that their failure modes are visible at the level of a single run. A broken loop hangs the agent and the operator notices. A blown budget hits the credit card and the finance person notices. A bug that escapes the sandbox corrupts the filesystem and somebody pages somebody else. Operational pillars get fixed because the failures get noticed.

The three pillars the field is mis-building — search policy, verification, memory — are epistemic. Their failure modes are not visible at the level of a single run. A bad search policy produces an autonomous session that runs to completion and yields a result that happens to be worse than it could have been. A missing judge produces a session that yields a result that does not replicate. A logs-only memory produces a session that re-explores known dead ends and looks slow rather than incompetent.

The pattern is consistent across agent systems generally, not just this category. Operational pillars get fixed first. Epistemic pillars get fixed last. Same dynamic, different domains. The autonomous research category is the current illustration.

An exhibit, not an answer

This essay has an artifact attached, which is part of why I wrote it.

The system I built while working through these three pillars is open source, Apache 2.0, in a layout that should drop into most existing training repos. It implements a tree-search coordinator that picks parent nodes via UCB with ablation-focus rotation. It implements an adversarial judge that runs after every successful experiment, inspects the claimed metrics against the raw log, and downgrades success to suspect when the metrics do not pass scrutiny. It implements a hierarchical memory architecture with a lessons extractor that fires every five cycles and produces causal generalizations with run-id provenance. The three pillars I have argued the field is mis-building, wired together in roughly the shape I think they should be wired.

The framework also implements the three pillars the field has gotten right — a clean propose-implement-run-evaluate cycle, tiered LLM routing with real cost tracking, preflight gates and kill-signal-aware monitoring. None of that part is original. It is competent operational scaffolding of the kind that has become standard in the genre, included because the structural argument requires the operational pillars to be present for the epistemic ones to land cleanly. A judge layer on top of a system that cannot even keep its own subprocess alive is not a judge layer. It is a press release.

The choice to release the code is not the news peg of this piece. The structural claim is the piece. The code is what one structural claim looks like when implemented competently — not as a product, not as a benchmark contender, but as an exhibit. Drop it into a project, point it at a training script, swap the LLM provider, and the wiring becomes legible. The bits I have argued for in the abstract become bits you can read in `core/tree.py` and `core/search.py` and `core/judge.py` and `core/memory.py`.

Plenty is still missing from the implementation, and it is worth naming what. There is no web-bootstrap step of the kind MLE-STAR uses to seed initial solutions from current SOTA on the open web — the framework starts from the language model's priors, which inherits the training-data-defaults failure mode the Lossfunk paper documented. There is no real containerization; the engineer runs subprocesses in the project directory, which is fine for a controlled environment and unsafe for true 24/7 unattended operation. Prompt caching is not wired into the Anthropic client, which leaves cost on the table. Per-record token counts are tracked only at the tier level. The coordinator is sequential; multi-GPU parallel workers would require an asyncio rewrite.

These are operational gaps. They matter for using the framework in production and not at all for the structural argument. The point of the exhibit is the three things the field is mis-building, not the additional operational polish the field has already figured out.

The framework will get those operational gaps fixed, eventually, and so will the broader field. Web bootstrap, sandboxing, prompt caching, parallel workers — these are all standard moves that someone will copy from a more polished open-source system within months. The harder thing to copy is the structural commitment to keeping search policy in code, judges adversarial, and memory generalized. That commitment shows up in the architecture, not in any single feature. A system can implement all the operational polish in the world and still have its planner deciding what to try next based on a vibe.

The exhibit is here so that the structural commitment is concretely demonstrated, not just argued. Read the architecture. Disagree with the specific implementation choices. But notice that the choices are present and named — and notice how much of the public field's code does not even have the named choices to disagree with.

Three checks before trusting any of these systems

Three mechanical checks to run on any autonomous research system before deciding whether to deploy it, fork it, or take its results seriously.

The search-policy check. Open the planner's system prompt. Read it. If you see English-language rules of the form "if you have plateaued for three runs, consider backtracking to the best known configuration" or "prefer exploration when consecutive failures exceed two" — the search policy is in the prompt. It is, in other words, a hope. The model may follow these rules; the model may not. Across long sessions, with growing context, with the accumulated drift the Lossfunk paper documented, the model will not follow them as consistently as a hard-coded algorithm would. If, instead, the planner is given a specific target — refine this parent node, focused on this dimension — and the choice of parent and dimension happens in a non-LLM component, the search policy is real. This separates compounding systems from drift-prone ones.

The verification check. Find the verification step. Trace which component grades the experiment. If it is the same component that ran the experiment, in the same context, with the same prompt — there is no verification, only self-assessment. If it is a separate component, with a different prompt, adversarial framing, and the authority to downgrade a successful run to a suspect one — verification is present. This separates systems that will eventually publish irreproducible results from systems that will catch the phantom successes before they propagate.

The memory check. Open whatever the system calls its memory. If you see run records — configurations, metrics, verdicts — laid out as a chronological sequence, that is logs. If you see causal claims — generalizations about what works on this task, with specific run ids cited as supporting evidence — that is memory. This separates systems that re-derive the same conclusions every cycle from systems that compound their conclusions across cycles.

Each of these checks takes under a minute, assuming the codebase is readable. They will reject the majority of public autonomous research systems available today. That is not a criticism of those systems; it is a description of the state of the genre. The pillars I have described are not difficult to implement. They are simply not yet defaults.

Where this goes

Autonomous research agents are converging on a standard architecture. The operational pillars — cycle, cost, sandbox — will become invisible plumbing within a year. The differentiation that determines which systems survive past the first public failure of an autonomous-generated result that does not replicate is happening in the three pillars the field has built in roughly the same wrong shape.

In three years the stack will look standard. The systems that survive will be the ones that moved their search into code, kept their judges adversarial, and let their memory generalize across runs. The autonomous research category will mature the moment those three become as routine as code review. Right now, most systems are running merges without anyone reading the diff.

The code

I have shared my code at Deep Researcher , Apache 2.0.

I'm releasing it because the structural argument above lands better with a working exhibit than without. The license is permissive on purpose — fork it, vendor it into your own stack, build a product on top. I haven't patented anything in it and won't.

A note on the kind of collaboration I'd actually find useful, because "PRs welcome!" is the most useless sentence in open source.

Easy yes: the operational gaps named in the README's "what's missing" section. Anthropic prompt caching, web-bootstrap of initial solutions, real Docker sandboxing, parallel workers via asyncio, per-record token counts wired from `resp.usage`. These are standard moves the field has already figured out and I left them as future work because the piece is about the three epistemic pillars, not the operational polish. A PR that closes any of them gets a same-day review.

Talk first: changes to `core/tree.py`, `core/search.py`, `core/judge.py`, or `core/memory.py`. These are the load-bearing structural commitments — if you think the search policy should be MCTS instead of UCB, or that the judge should run on a separately fine-tuned verifier, I want that conversation in an issue before a PR. Not because I'll reject the idea — I might love it — but because the structural choices are the part of the framework that shouldn't change without argument.

Genuinely would love this: someone running it against MLE-Bench. I haven't, because I don't have the harness set up. Whoever does it first gets co-authorship on the inevitable follow-up post.