The Averaging Tax — Why Class Conditioning Isn't a Feature

There is a way to read the literature on class-conditional flow matching that makes it sound like a minor convenience. You train the flow, concatenate a label embedding to the network, get controllable generation. The label is described as a control knob. The conditioning is described as added functionality. The unconditional model is treated as the base case and conditional generation as the augmented case — a richer interface on top of an otherwise complete object.

That reading misses what is actually going on.

The unconditional flow model is not a complete object. It is a system trying to solve a regression problem whose minimizer at every point is a mean over incompatible targets — a mean that, at points where multiple modes route through the same intermediate state, is not a velocity any real trajectory uses. Class conditioning doesn't add a feature to a working system. It removes a contradiction in a system that, without it, is paying a tax on every parameter update.

I want to name that tax — the averaging tax — and use it to argue something larger. The same mathematical structure that makes class conditioning load-bearing for diffusion and flow models is the structure that explains why the action in generative AI has moved away from the base model and toward the conditioning architecture. Scaffolding, RLHF, retrieval, agentic loops, memory, persona — these are not capability additions either. They are interventions that reduce the averaging tax on systems that would otherwise be doing the same thing flow models do at multimodal intermediate points: averaging directions that no real path takes. The labs and operators winning right now are the ones who have, often without naming it, internalized this. The ones still trying to win the model-capability race are losing for reasons they haven't articulated.

The squared-error trap

Flow matching, in its modern form, learns a velocity field. At every point in space and every moment along the noise-to-data interpolation, the model predicts a velocity — an arrow telling a particle which way to drift to push it from noise toward data. Training is a regression: the model's predicted velocity is compared against the velocity that a particular training trajectory wants at that location, and the difference is penalized with a squared error. That formulation, due to Lipman and collaborators in 2022 and rapidly adopted across the diffusion community, is now the dominant training paradigm for state-of-the-art generative models in pixel and latent space.

This is a precise enough description that we can ask: what does the model converge to, in the limit of infinite capacity and infinite data?

The minimizer of a squared-error loss is always a conditional mean. The optimal predictor at any input is the expectation of the target conditional on that input. For flow matching, the predictor's input is the position-time pair (x_t, t), and the target is the velocity v that the trajectory through x_t at time t wants to use. The optimal velocity field, then, is E[v | x_t, t]: the average velocity across all training trajectories that pass through x_t at time t.

This is fine when only one kind of trajectory passes through any given (x_t, t). At early times near the noise side of the interpolation, points are highly random and almost every data point has some probability of routing through them — but the directions, averaged out, are coherent because the noise side is symmetric in a structured way. At late times near the data side, each x_t is close to a single data point and only one direction matters. The averaging problem lives in the middle.

In the middle, x_t can sit near the intersection of paths from genuinely different modes. A blob that is halfway between noise and a 3 is also halfway between noise and an 8 — and the trajectories that produce these two end states pass through neighborhoods that overlap heavily. At a point where both routes are plausible, the optimal predictor is the mean of two directions: one toward the 3, one toward the 8. The mean does not point at a 3. It does not point at an 8. It points at a region of velocity space that no real trajectory uses.

This is the averaging tax. It is not a bug in the loss. It is the loss working correctly. The squared-error objective faithfully delivers the conditional mean of the target, and at multimodal points the conditional mean is a compromise. The model is not confused. It is giving the mathematically optimal answer to a question that has, embedded in it, a contradiction the question doesn't acknowledge.

The visible symptoms of the averaging tax are familiar to anyone who has trained a diffusion or flow model. Sampling produces blurry intermediates that resolve slowly toward sharpness. Mode coverage suffers — generated samples cluster near the centers of high-density regions and avoid the spaces between modes. Long sampling chains help, because they let the velocity field's small biases compound into something close to a trajectory, but they don't fix the underlying issue. The velocity field at multimodal points is, by construction, not a trajectory.

The cleanest way to see the tax in isolation is to construct a deliberately pathological case. Take a two-mode dataset in two dimensions: one Gaussian at (1, 1), one at (−1, −1), both with small variance. Run the standard flow-matching procedure: interpolate each data point linearly with a sample of pure noise, train a network to predict the velocity at every intermediate point along the interpolation. At the origin at time t = 0.5, the dataset has two trajectories passing through — one heading northeast toward (1, 1), one heading southwest toward (−1, −1). The optimal velocity field at the origin, by the conditional-mean formula, is exactly the zero vector. The trained network, after enough updates, will produce velocities arbitrarily close to zero at that point. Samples that pass near the origin during inference will stall there. The model has learned the correct answer to the wrong question, and the wrong question stalls the sampler.

Why scale doesn't fix it

A reasonable first response to the averaging tax is to assume it gets paid down by scale. More parameters in the network, more samples during training, longer noise schedules — surely something in the regime where everything gets larger eventually washes the conflict out.

It does not.

The conditional mean is a property of the target, not the model. No matter how expressive the function class, the loss is minimized when the function output equals E[v | x_t, t]. Adding parameters does not change what that expectation is. It only changes how close the model can get to it. An infinite-capacity network trained on infinite data converges to the exact averaging behavior we see in undertrained small models — just more precisely.

More data has the same character. The averaging tax is paid at multimodal intermediate points, and more data adds more multimodal intermediate points. The trajectories from a thousand classes of training images cross more often than the trajectories from ten classes, not less. The denser the data distribution, the more pervasive the intersections. ImageNet-scale conditional generation is more averaging-prone than MNIST-scale conditional generation, not less — the conditional structure is what saves it.

Architectural changes — U-Net to transformer (the DiT line of work), pixel-space to latent-space (the Stable Diffusion family), x-prediction to v-prediction to flow-matching parameterizations (across the entire field) — change practical training dynamics but not the limit. They affect inductive biases, sample efficiency, the shape of training error as a function of compute. They don't change what the network converges to. Every diffusion or flow architecture you can construct converges to E[v | x_t, t] given enough capacity and enough data, and that expectation is what it is regardless of how you parameterize the network.

The way you can tell this isn't a model problem is that the averaging tax shows up most cleanly in the simplest models, in the most analytically tractable settings. The two-Gaussian toy above is the proof in pure form: you can write the optimal velocity field analytically, you can train any network of any size against the targets, and the saddle at the midpoint will still produce zero-velocity outputs that stall the sampler. There is no model fix for this. The fix is structural: you need to condition the predictor on additional information that distinguishes the two modes.

The implication is sharper than the field's usual framing suggests. When the original DDPM paper and its descendants moved from unconditional to class-conditional generation and observed enormous quality jumps, they were not unlocking more capacity in their models. They were exempting their models from paying the averaging tax on the dimensions they could now condition out. The class-conditional ImageNet diffusion models were not larger or differently trained from their unconditional siblings. They were trained on the same data with the same architecture, but with c in the conditioning channel. The architectural difference was tiny — a few projection layers and embeddings. The quality difference was enormous. That gap is the averaging tax made visible. The unconditional model wasn't worse because it was less capable. It was worse because it was paying a tax the conditional model didn't have to pay.

What conditioning actually does

Class conditioning changes the predictor's input from (x_t, t) to (x_t, t, c). The optimal predictor changes accordingly: from E[v | x_t, t] to E[v | x_t, t, c]. The function family is the same. The training procedure is the same. What has changed is the set being averaged over at the inner expectation.

This is the substance of the entire technique. Conditioning narrows the set. At the multimodal intermediate point where the unconditional model averages over trajectories heading to a 3 and trajectories heading to an 8, the class-conditional model averages over only one. The "average over many" becomes "average over one class." For a well-separated class structure, the conditional set is often near-unimodal — every trajectory in it heads roughly the same direction. The conditional mean is a real direction.

The model has not learned anything new. The capacity is unchanged. The data is unchanged. The architectural footprint of the conditioning is tiny — a few embeddings, a few projection layers, maybe some FiLM modulation, or in the modern style a few additional tokens in a DiT's attention stack. The conditioning channel is a side input, not a structural enlargement.

But the regression problem the model is now solving is materially different. At every (x_t, t, c) triple, the target is sharper. The variance across targets at that input is lower. The squared-error loss can drive the predictor closer to a useful direction at every step of training. The gradients carry more signal because the signal isn't averaged into mush before the loss sees it. The Jacobian of the loss with respect to the model's parameters is, at every point in training, more informative about what those parameters should be.

This is why class-conditional models converge faster, generate sharper samples, achieve better FID at the same parameter count, and tolerate shorter sampling schedules. None of this is a different model behaving smarter. It is the same model facing a less contradictory target. The faster convergence is not better optimization; it is optimization of a better-posed problem. The sharper samples are not better generation; they are generation that is no longer averaged across modes the conditioning has removed. The shorter sampling chains work because the velocity field is more trajectory-like at every point, so the integrator can take larger steps without accumulating averaging-induced drift.

I want to dwell on this because the framing matters. There is a habit, in describing class conditioning, of treating it as a "control" mechanism — as if the model would otherwise generate fine but the user wants to ask for a specific class. The "guidance" literature reinforces this framing: classifier-free guidance is presented as a steering technique, a way to push a sample toward stronger class adherence at sampling time. All of this is true at the level of user-facing functionality. None of it captures what conditioning does to the training problem.

Conditioning is not a control knob bolted onto a working system. It is the math fix that makes the training problem well-posed at points where, without it, the optimal answer is a compromise that no real trajectory honors. Removing the conditioning channel doesn't strip a feature off a working model. It re-introduces a contradiction that the model then has to absorb into its weights as best it can — usually by producing the blurred, mode-averaged outputs that conditional models don't.

The "you don't add capability, you remove a contradiction" framing has the right thermodynamic shape. A model paying the averaging tax is burning capacity to fit an inconsistent target. Conditioning doesn't give the model more capacity. It un-burns the capacity that the contradiction was consuming. Same network, same compute, less waste.

This is also why the field's recent obsession with conditioning quality — better text encoders for text-to-image models, richer caption synthesis, structured conditioning via segmentation masks or depth maps, multi-stage conditioning in cascaded diffusion — produces gains that look disproportionate to the architectural changes involved. Each improvement to the conditioning channel is, at some multimodal intermediate point that the model used to average over, a partition of that point into separable conditional sub-sets. The total averaging burden on the network drops. The downstream quality rises. The base model didn't change. The conditioning got better at carving the problem.

The pattern generalizes

The framing I've just developed for class conditioning generalizes to most of the techniques that have produced the visible gains in generative AI over the last three years. None of them are capability additions. All of them are conditioning-set narrowings — different mechanisms for shrinking the set over which the model is implicitly averaging at the points where the averaging tax bites.

Start with RLHF. The standard description has it teaching the model to be helpful and harmless, to follow instructions, to align with human preferences. But the underlying mechanism, viewed through the conditioning lens, is more specific. A pre-trained language model is solving an averaging problem at every token: across the implicit distribution of continuations that could follow the current context, predict the next token. The continuations across that distribution are wildly heterogeneous. They include helpful answers and dismissive deflections, careful reasoning and confabulation, polite refusals and offensive runs. The cross-entropy loss at each token is minimized by an averaging — predicting a token distribution that reflects the marginal frequency of each possible continuation type. The model isn't being "unhelpful" at a point where it produces unhelpful output. It is delivering the conditional mean over a contradictory set of continuations.

RLHF doesn't make the model smarter. It narrows the conditioning set. After RLHF, the implicit context the model is conditioning on at each token has been pushed toward "continuations that an instruction-tuned, preference-aligned model would generate." The conditional set has been re-shaped to exclude the deflections, the confabulations, the offensive runs. The model's outputs sharpen — not because it has new capabilities, but because the target distribution at every token has fewer modes competing for the loss's attention. The "alignment" framing buries this. The framing makes RLHF sound like value installation, when mechanistically it is closer to conditioning-set surgery.

The same logic applies to scaffolding. An agent scaffold — a system prompt, a tool registry, a memory, a planner, a checker — does not give the underlying model new capability. It conditions the model's output distribution at every step. ReAct-style scaffolding doesn't let the model do anything the model couldn't do; it just narrows what the model is implicitly averaging over when generating each token. The conditioning is now "the next action given the scratchpad and the prior trajectory and the tool outputs and the goal" instead of "the next token given the context." The set being averaged over is smaller. The result is a less averaged answer. The agentic frameworks producing the best demos right now — Claude Code, Cursor's agent mode, Devin, the various coding-agent harnesses being shipped weekly — are not running on dramatically better base models than the ones available a year ago. They are running on dramatically better conditioning architectures.

Prompt engineering, viewed this way, becomes legible as exactly what it is: a hand-tuning of the implicit conditioning set. A specific prompt — long, structured, with examples and constraints — narrows the conditional distribution the model is averaging over more than a short, generic prompt does. "Write a function that takes a list and returns the maximum" is a wider conditioning set than "Write a Python function in pure functional style with a type hint and a docstring that takes a list of integers and returns the maximum." The longer prompt isn't a richer instruction. It is a smaller conditioning set. The averaging tax at every token is lower. Few-shot prompting is the same move with a different mechanism: each example is a conditioning narrowing.

Chain-of-thought reasoning has the same structure. By forcing the model to emit intermediate tokens that condition subsequent ones, CoT decomposes a one-shot averaging problem into a sequence of narrower averaging problems. At each step, the conditioning set is tighter than it would have been for the final answer alone. The mathematical structure of the conditional expectation makes this transparent: the chain doesn't unlock latent reasoning capacity; it factors the joint distribution into a chain where each conditional has fewer modes competing for the cross-entropy. The o1 and o3 family's "thinking tokens" are an industrialization of this insight — paid-for, learned-via-RL, conditioning-set-narrowing tokens.

The same view explains why retrieval-augmented generation works as well as it does despite producing only a modest improvement in "capability" by traditional benchmarks. RAG does not teach the model anything. It augments the conditioning channel with retrieved documents. At every token, the implicit average is now over continuations consistent with the retrieved evidence, not over the much wider set of continuations consistent with the bare query. The averaging tax drops. Hallucinations drop. The model's outputs sharpen — not because it has new knowledge, but because the conditioning set is smaller.

Persona conditioning. Style conditioning. Tool-restricted generation. Constrained decoding. Each of these is a different mechanism for shrinking the set the model averages over. Each delivers gains that look like capability but are actually contradiction-removals.

The name for this pattern — the conditioning fix — wants to be the same across every domain it appears in. It is the same intervention class, expressed in different substrate. Flow models pay the averaging tax in pixel space; language models pay it in token space; agents pay it in action space; multi-agent systems pay it across role boundaries. The fix in every case is structurally the same: narrow the conditioning set, reduce the averaging tax, watch the system sharpen without anyone having touched the underlying capacity. The intervention's surface form varies; the underlying mathematical move does not.

Where the gains live now

The agent-native landscape is in the middle of a quiet transition that this framing makes legible. Raw model capability — measured by parameter count, training compute, or benchmark scores on the canonical suite — is converging. The frontier labs are clustering. The gap between the best open-weight models (Llama, Mistral, Qwen, DeepSeek) and the best closed models is closing. The rate of new capabilities-per-quarter is no longer where the interesting curve is.

This is one of the eleven meta-patterns the Map has named: Model Convergence Pressure. The base layer is flattening. The interesting variance is moving up the stack.

What I've been describing in this piece is the mathematical explanation for why it's moving up the stack. The base model is paying an averaging tax on everything it can't condition out. As base models approach the limits of what unconditional training can deliver — and there is good reason to believe several major labs are within an order of magnitude of those limits on the dimensions that matter — the marginal returns on more parameters and more data collapse. The marginal returns on better conditioning structure do not. The ceiling on a base model is the optimal conditional mean given its training conditioning. The ceiling on a conditioning architecture is wherever the next narrowing of the conditional set lives.

Concretely, the gains the frontier labs are now extracting are conditioning-layer gains. The shift in OpenAI's product trajectory from "release a more capable base model" to "release a better-scaffolded reasoning model" reflects this. The o1, o3 line are not larger models in the way GPT-4 was larger than GPT-3. They are differently conditioned models — conditioned on multi-step chains of reasoning during both training and inference, with the conditioning structure designed to reduce the averaging tax on the reasoning problem specifically. The benchmark deltas are real. The parameter deltas are modest. The change lives in the conditioning structure.

Anthropic's investments in tool use, computer use, agentic harnesses — these are conditioning architecture investments. The model behind them is incrementally better than its predecessor. The harness around it is qualitatively new. The user-facing capability — Claude that can drive a browser, that can write and run code in the same session, that can chain calls to dozens of tools — is mostly the product of harness, not weights. The Claude Code product, in particular, is a conditioning-architecture argument: take a strong base model, surround it with a scaffolding stack that narrows the conditioning set at every step of a coding task, and the user experience is qualitatively different from running the same model behind a bare chat interface. Same model, different conditioning, different category of product.

DeepMind's agent-native projects, Cohere's enterprise scaffolding, Meta's agentic SDK work — same pattern. Even Google's Gemini lineup, which one would expect to lean hardest on raw scale, is increasingly leaning into "configured" modes: Deep Research, Code Assist, Workspace integration. Each is a conditioning architecture wrapped around a base model. The base model is largely the same across these surfaces. The behavior is not, because the conditioning is not.

The market consequence of this is that the strategic moat for agent-native companies is no longer "having a slightly better base model" — that moat is shrinking with every new release. The moat is the conditioning architecture: which scaffolding, which memory, which retrieval, which agentic loop, which RLHF rubric, which persona, which tool wiring. The variance across products at the same model capability is now huge, and it lives in the conditioning, not the model. Compare two coding agents built on the same underlying foundation model. The difference in their utility is not the model. It is the conditioning architecture — the prompts, the tool registry, the memory primitives, the error-recovery harness, the verification loop, the long-context strategy. All of that is conditioning. All of it is averaging-tax reduction.

The Model Convergence Pressure pattern compresses this into one phrase: gains shift from model capability to topology, rubrics, memory, scaffolding. The mathematical structure of the averaging tax explains why. The base model isn't broken — but it is, at every point where its conditioning is too wide, paying a tax it can't escape with more weights. The conditioning is the only side the tax can be reduced from.

This has consequences for who wins. The largest labs have the best models. The labs and operators with the best conditioning architectures have the best products. These are not the same set, and the second set has been getting more interesting than the first set over the last twelve months. Anthropic's Claude Code, OpenAI's Operator, the agentic harnesses that companies like Cognition, Adept, Sierra, and Cursor are building — these are conditioning-architecture products. Their value is not the raw model. It is the architecture around the model that narrows what the model is averaging over at each step. The market is starting to price this in. The recent valuation pattern — agentic products at multiples that would be hard to justify on a "thin wrapper on a base model" thesis — is the market recognizing that the wrapper is the product.

For builders, this means the strategic question is not "which model" but "what conditioning structure." For investors, this means the durable companies are the ones with proprietary conditioning architectures, not the ones with API access to whichever base model. For regulators, this means the agent identity questions, the autonomy levels, the audit trails — these all need to be specified in terms of the conditioning structure, not the underlying weights, because the weights are increasingly fungible and the structure is what determines behavior.

The averaging tax is invisible until you frame it. Once framed, it explains an enormous amount of what is happening in the field.

What to watch for

The piece's claim is that conditioning architecture is now the binding constraint on agent-native performance, and that this is true for the same mathematical reason class conditioning is load-bearing for flow models. Three forward-looking implications follow.

First, expect a wave of conditioning primitives that don't yet have settled names. Memory schemas designed to keep the agent's effective conditioning set narrow without explicit retrieval. Trajectory-consistency conditions that enforce coherence across long agentic chains. Workflow priors that condition the model on the expected shape of the next several steps. Verification harnesses that condition each step on the validity of the previous step. Persona schemas that condition the model on a stable identity across long sessions. These will look, at first, like minor agentic-system features. They are the same kind of intervention class as RLHF and chain-of-thought — interventions that reduce the averaging tax on a previously-unconditioned dimension. The agentic frameworks that ship clean primitives for these will compound advantages over the ones that don't. LangGraph, DSPy, Inspect, the various structured-generation libraries — these are early instances of the pattern; the mature instances haven't shipped yet.

Second, expect the discipline of "fine-tuning" to be subsumed into the discipline of conditioning architecture design. Fine-tuning is one method, among many, for narrowing the conditioning set. It is not the most expressive method, it is not the most flexible, and it ties the conditioning permanently to weights. For most agent-native applications, the binding intervention is not "fine-tune the model on this data" but "construct a conditioning architecture that elicits the desired behavior at inference time." We are watching the center of gravity move from fine-tuning labs to conditioning-architecture teams. The job title "prompt engineer" is the embryonic form of this; the mature form will have a different name — conditioning architect is my candidate, though I expect the field to land on something else — and a much larger surface area. It will include scaffolding design, retrieval architecture, memory schema design, agentic harness construction, persona engineering, and the verification-loop design that the Verification Renaissance meta-pattern named.

Third, expect the regulatory framing of agent systems to discover the conditioning architecture as the load-bearing object. Right now, AI regulation tends to focus on model capability — what the model can do, what training data it saw, what the parameter count is, where the compute came from. This will age badly. Two systems with the same base model and radically different conditioning architectures will exhibit radically different behavior, and the regulators that haven't internalized this will keep writing rules for the wrong layer. The agent identity work — bearer tokens, AP2, x402, the work I argued elsewhere is the protocol-level frontier of 2026 — is one early instance of regulation reaching the right layer. The conditioning architecture is the next layer the regulators need to reach. The insurance underwriters will get there first; they always do, because they pay for the failure modes directly. The Munich Re, Lloyd's, and Allianz facilities pricing agent identity hygiene today will be pricing conditioning architecture hygiene within eighteen months — measuring the narrowness and verifiability of the conditioning structure as a coverage variable, the same way they currently measure identity primitives.

The averaging tax framing is, in this sense, a frame I expect to become operationally useful — not just a clever way to read flow matching papers, but a way to look at any generative system and ask: where is this system averaging? What is in the conditioning set that shouldn't be? What is outside the conditioning set that should be? The answers to those questions will tell you where the next gain is hiding. For a trading agent, the questions become: which market regimes is it averaging across that should be conditioned out, which counterparty identities, which time horizons. For a coding agent: which codebases, which language conventions, which deployment contexts. For a customer-service agent: which prior conversations, which user histories, which product variants. Every productive agentic application has a list of dimensions the underlying model is currently averaging over and that the conditioning architecture has not yet split apart. Each of those splits is a future gain.

The argument that won the conditional flow matching paper — that you don't fix a multimodal averaging problem by making the network bigger, you fix it by making the conditioning set smaller — is the same argument that is going to win the agent-native decade. The frontier labs are gradually internalizing it; you can see the shift in their public roadmaps and their hiring patterns. The market is starting to price it; you can see the shift in which kinds of companies are getting funded at which multiples. The regulators are nowhere close, but the standards bodies and the underwriters are. The agentic stack is being rebuilt around the recognition that the model isn't where the bottleneck lives. The conditioning architecture is.

Class conditioning is the simplest possible instance of the pattern. A label, an embedding, a conditioning channel. One bit of information that changes which set the loss is averaging over and, by changing the set, changes everything downstream — convergence, sharpness, mode coverage, sample quality. The agent-native stack is full of dimensions on which the same move is available, and most of them haven't been named yet. The averaging tax framing is a way to find them. The conditioning architecture is the layer where they get implemented. The labs and operators who name them first will own the next phase. The ones who keep trying to win the model-capability race will discover, slowly and then suddenly, that the race ended without an announcement.

The model isn't where the bottleneck lives. The conditioning architecture is.