How Subquadratic Won by Giving Up on Replacing Transformers

The dominant story about subquadratic architectures for the last five years has been that they were coming for the transformer. Mamba was going to do to softmax attention what softmax did to the LSTM. Linear attention would erase the n² wall. State-space models, RWKV, RetNet — somewhere on the Pareto frontier was the heir apparent, and the only question was which lab would crown it.

That story was wrong. Not because the architectures don't work — they do, better than they have any right to, and the gap to softmax on standard benchmarks is now measured in fractions of a point. The story was wrong because it framed the contest as replacement. The actual outcome is something stranger and more interesting: subquadratic architectures won by surrendering. They stopped trying to be transformers. They became the substrate that transformers run on top of, in a ratio that is starting to look uncannily empirical.

The ratio is 3:1. Three linear-time layers for every one full-attention layer, plus or minus. Kimi Linear lands there. Qwen3-Next lands there. Tsinghua's HALO/HypeNet lands there. ByteDance's systematic study from September 2025 explicitly recommends a ratio between 3:1 and 6:1 across the architectures it evaluates. This is not a coordinated decision — these labs are competing — and it is not an architectural taste. It is the answer to an empirical question nobody asked clearly until 2025: what fraction of the tokens an agent emits actually need full softmax recall?

The giveaway is Mamba-3, published March 16, 2026 by Albert Gu, Tri Dao, and seven collaborators at CMU and Cartesia. Read the abstract carefully. It does not claim to beat transformers on quality. It claims to push the performance-efficiency Pareto frontier. The first sentence of its motivation is about test-time compute scaling, not about modeling. The architects of the most-cited subquadratic family explicitly call their newest model "inference-first." That is what surrender sounds like in academic register — and it is what victory looks like in a market increasingly governed by inference cost.

The five-year project to dethrone the transformer was the wrong project. The right project, which fell out of the wreckage of the first one, is to build the inference tier of the agent economy. The hybrid 3:1 architecture is that tier. The math underneath it is finally honest about what it is. And the labs that figure out the next move — per-token routing instead of per-layer routing — will own the substrate that every agent in 2027 runs on.

The five-year wrong question

The taxonomy of subquadratic alternatives is long enough to be its own subfield. Linear attention with kernel feature maps (Katharopoulos, 2020). Performers with random Fourier features (Choromanski, 2020). Linformer's low-rank projection of the key-value matrices (Wang, 2020). Reformer and its locality-sensitive hashes (Kitaev, 2020). Longformer and BigBird with their sliding-window-plus-global tricks. Sparse Transformers with O(n√n) factorizations (Child, 2019). S4 and the structured state-space line (Gu, 2022). RWKV. RetNet. Hyena. Mamba (Gu and Dao, December 2023). Mamba-2 with the structured state-space duality (Dao and Gu, May 2024). DeltaNet. Gated DeltaNet. xLSTM. Kimi Delta Attention. The list is closer to fifty entries than to ten.

For five years, almost every one of them lost the same way. They were trained on the same datasets transformers were trained on. They were benchmarked on the same tasks transformers had been optimized for — chat, completion, recall-heavy QA over fixed contexts. And in those settings, they performed worse by exactly the margin that matters: small enough to be defensible in a paper, large enough to be career-ending in a frontier lab. A 2.3% gap on HellaSwag is publishable. It is also a no-go for the team picking what to ship.

This was not, as the literature often framed it, a "capability gap" inherent to fixed-state recurrence. It was a selection effect. The benchmarks measured what transformers were good at, because transformers had defined the benchmarks. Recall-intensive QA over a fixed context — the canonical needle-in-a-haystack test — is precisely the workload softmax attention was designed for. Asking a fixed-state recurrent model to compete on that test is like asking a stream processor to win a database benchmark. It is not the wrong design; it is the wrong test.

The first serious paper to articulate this in a generalizable way was Han et al. (2024), which proved that linear attention is not injective — different queries can produce identical attention weights, causing specific failure modes on recall-intensive tasks. That is a real limitation. But the framing in 2020 through 2024 was that this limitation made linear attention fundamentally weaker. The framing that emerged in 2025 was different and sharper: it makes linear attention fundamentally different, and the question of which kind of compute to use depends on what the workload actually requires.

Jiaoda Li and Ryan Cotterell published a paper on May 1, 2026 that made this concrete in formal terms. They characterize the expressivity of local attention and show that fixed-precision transformers with global attention correspond to a specific fragment of linear temporal logic — one with a single past operator. Local attention corresponds to a different fragment of the same logic, one with a second temporal operator. The two are not weaker and stronger; they are different. And the punchline of the paper, validated on formal language recognition and natural language modeling, is that hybrid global-local transformers outperform their global-only counterparts. The math finally caught up to what every lab building production models had already discovered empirically by mid-2025.

The five-year wrong question was "can subquadratic beat softmax?" The right question, which the field stumbled into rather than asked, is "what fraction of tokens does softmax need to handle, and what fraction can be handled by something cheaper?" The first question has no clean answer. The second has an answer that converges across independent labs to roughly one in four.

The 3:1 ratio is a measurement

Kimi Linear, released by Moonshot AI in October 2025, interleaves Kimi Delta Attention with full attention layers in a uniform 3:1 ratio. The paper reports that this hybrid structure reduces memory and KV-cache usage by up to 75% during long-sequence generation while preserving global information flow through the full-attention layers. Note the language: preserving, not approximating. The full-attention layers are still doing the work softmax was always good at — and the linear layers are doing the work that didn't need softmax in the first place.

Qwen3-Next, released by Alibaba in late 2025, uses Gated DeltaNet plus full attention in approximately the same ratio. The architecture builds on Mamba-2-style gating combined with DeltaNet's rank-one corrective updates, and the full-attention layers use Multi-Head Latent Attention to compress the KV space. ByteDance's "Hybrid Linear Attention Done Right" paper (Chen, Thai, Zhou, Zhang, Shen, Wang, Xiao, Han, Liu — January 2026, arXiv:2601.22156) systematically evaluates hybrid ratios across architectures and concludes that 25% of layers being full attention is the sweet spot — that's exactly the 3:1 ratio. The ByteDance team's earlier systematic study from September 2025 across HGRN-2 and Gated DeltaNet recommends a ratio between 3:1 and 6:1 to achieve transformer-level recall efficiently.

This is the kind of empirical convergence that means something. These labs are competing. They use different linear-attention substrates — KDA, Gated DeltaNet, HGRN-2. They use different full-attention variants — MLA, standard MHA, sliding-window attention. They use different training recipes, different datasets, different compute budgets. The thing they all converge on is the ratio. That convergence is not architectural. It is a measurement of the workload.

The workload they are all implicitly measuring is not chat. It is agentic. An agent emits orders of magnitude more tokens per task than a human chat user does — long planning traces, exhaustive tool-use scratchpads, dense reasoning chains, multi-step reflection loops, retrieved documents shipped through as context. Most of those tokens are intermediate — they exist to support a downstream computation, and they will be summarized, discarded, or compressed within a few hundred tokens. They do not need to be available to a recall query 200,000 tokens later. They need to flow through the substrate cheaply.

A fraction of those tokens do need to be available later. The user's original request. The constraint the agent extracted in step 1. The intermediate result the agent is now trying to reconcile with another intermediate result. The retrieved document the agent is currently citing. These tokens need full softmax recall — anywhere in the context, against any query, exactly, not approximately. An agent that gets the wrong line from the retrieved document is producing a hallucination. An agent that misremembers the original constraint is producing a regression.

The 3:1 ratio says: roughly three quarters of an agent's tokens are flow-state, and roughly one quarter need full softmax. Different agentic workloads will move that ratio — a code-review agent might run at 6:1 because most of its tokens are scaffolding around a few load-bearing recall points; a long-context QA agent might run at 2:1 because retrieval needs are denser per token. But across the workloads currently being shipped, the ratio is stable enough that independent labs land on it without coordinating.

The Ling Team's Ring-linear-2.0 release in October 2025 quantified what this buys. The hybrid 104B-parameter model with 6.1B active parameters delivers, compared to a 32B dense transformer baseline, one tenth the inference cost. Compared to its own dense predecessor in the Ring series, the cost dropped by more than half. That number is not a marginal efficiency improvement. It is the kind of order-of-magnitude shift that decides which architecture wins the next generation of agentic deployments — not because of capability, but because of unit economics.

The instinct of the 2020-2024 literature was that hybrid was a compromise — a way to give up some efficiency to recover some capability. The view emerging in 2025-2026 is the inverse. Hybrid is not a compromise. It is the correct shape for an agent workload, and the all-attention architecture was the actual compromise — a one-size-fits-all design that paid full softmax cost for every token regardless of whether the token needed it.

The math stopped pretending

For most of the post-Mamba era, the various subquadratic architectures looked like a zoo. Linear attention with various kernels. State-space models with various structured matrices. Recurrent variants with various gating schemes. Delta-rule updates with various retention mechanisms. Each paper introduced a slightly different recurrence and motivated it with a slightly different intuition. The field was rich in operators and poor in unification.

The unification arrived through a frame called Test-Time Regression. The crispest articulation is in Preconditioned DeltaNet (Lahoti and collaborators, April 2026, arXiv:2604.21100), which states it plainly: every modern recurrent operator can be interpreted as performing online least-squares regression — at test time, on the live KV stream — that learns a linear or nonlinear map from keys to values. The differences between operators reduce to three axes. First, the parameterization of the memory module — a single matrix, a structured matrix, a small MLP. Second, the loss the recurrence is implicitly optimizing — squared error, normalized squared error, regularized variants. Third, the optimizer doing the optimization — vanilla SGD, SGD with momentum-equivalent retention, SGD with weight-decay-equivalent decay, preconditioned SGD with curvature information.

That sentence is doing a lot of work. It says the entire zoo of subquadratic operators is a parameter space of online-learning algorithms, with the KV stream as the training data. Linear attention is plain SGD on a linear regression. DeltaNet adds a learning rate. Gated DeltaNet adds momentum. Mamba-2's selectivity is a data-dependent rescaling of the gradient. Titans, the Google paper from late 2024 that scaled to 2M-token context, makes the regressor a small neural network and updates it via what the authors call "surprise" — which is itself the gradient norm of the inner loss. TTT-MLP does the same with multiple optimization steps. Preconditioned DeltaNet adds second-order information — Newton's method instead of gradient descent for the inner optimization.

This is not a metaphor. The equations line up exactly. And once they line up, two implications follow.

The first implication: architecture is no longer a discrete choice between Mamba and DeltaNet and linear attention. It is a continuous design space over which loss, which optimizer, which parameterization of the regressor. Future architectures will be points in this space, not new species. The Mamba-3 paper is consistent with this view. Its three contributions are a better discretization scheme — trapezoidal rather than Euler, a numerical-methods improvement that doubles the accuracy of the underlying continuous-time approximation — a complex-valued state transition that bridges to data-dependent rotary embeddings, and a MIMO formulation that raises arithmetic intensity for inference. None of these is a fundamentally new operator. They are upgrades to an existing online-learning recipe.

The complex-RoPE bridge in Mamba-3 deserves a second look because it is the kind of move that is going to repeat across the field. Mamba-2 restricted the state transition matrix A to real-valued scalars in order to align with matmul accelerators. Real scalars can only encode exponential decay of the state — useful for forgetting, useless for tracking. Complex eigenvalues encode rotation, which is what the model needs for state-tracking tasks like parity, modular arithmetic, and permutation composition over S5 — exactly the tasks Merrill, Petty, and Sabharwal proved practical SSMs collapse on under finite precision. Mamba-3's contribution is to recover complex dynamics without losing matmul-friendliness, by showing that data-dependent rotary embeddings — the same RoPE that has been standard in transformers since 2022 — can be derived from the complex-valued SSM formulation. The transformer's positional encoding and the SSM's state transition turn out to be the same thing, viewed from different angles. This is the kind of unification that happens when a field's separate research traditions converge on a common underlying mathematics. Expect more such bridges, particularly between the RoPE family and the gated-delta family. They are not coincidences; they are the same operator showing up in different notations.

The second implication: everything we know about training neural networks is reusable for designing the memory update. If the inner recurrence is online SGD, then momentum, regularization, learning rate schedules, second-order methods, Adam-style adaptive step sizes — every optimization idea the field has accumulated over forty years can be ported into the architecture. Preconditioned DeltaNet does this explicitly. Gated DeltaNet did it implicitly when it imported Mamba-style retention. The design space is enormous and almost untouched. Expect Adam-as-memory-update papers within the next two quarters. Expect cosine-schedule analogs. Expect dropout-equivalent stochasticity. The architecture research community has not internalized this yet, but it will, and the production architectures of 2027 will read more like an optimization paper than a neural-network paper.

There is a corollary worth stating directly. The question "what kind of recurrence does the model use" is converging to the question "what kind of online learner is the model running over its own context." That is a much sharper question than the one the field was asking in 2022, and it is the question that makes architecture continuous rather than discrete.

Meanwhile, the theoretical lower bounds have hardened in the opposite direction. Shreya Gupta, Boyang Huang, Barna Saha, Yinzhan Xu, and Christopher Ye published "Subquadratic Algorithms and Hardness for Attention with Any Temperature" — accepted as an ICLR 2026 poster on April 25, 2026 — which extends the earlier Alman-Song result. The headline: under the Strong Exponential Time Hypothesis, exact softmax attention requires n^(2 − o(1)) time even when the head dimension is exponentially small in n. In plain English: at the head dimensions actual production models use, there is no truly subquadratic algorithm for exact softmax attention. Every paper claiming subquadratic attention is, mathematically, approximating.

That is a hard floor. It says the contest between approximate-subquadratic and exact-quadratic was never going to be settled by a smarter algorithm. The only way to get below n² is to give something up. The 3:1 hybrid architecture is one principled answer to what to give up: give up exact softmax for three quarters of the tokens, and keep it for the quarter that actually needs it. That answer falls out of the theory cleanly. It was hiding in the hardness results all along.

Inference economics did the selection

The most underrated sentence in the Mamba-3 abstract is the first one: "Scaling inference-time compute has emerged as an important driver of LLM performance, making inference efficiency a central focus of model design alongside model quality." That sentence would not have been written in 2022. It would not have cleared a peer review pass in 2023. By March 2026, it is the standard motivation paragraph for any architecture paper that wants to be taken seriously by a frontier lab.

Inference economics is what selected for the architectures that survived. Through 2024, the bottleneck for frontier LLMs was training compute. Architectures that were cheaper at training but slightly worse at quality lost — because training compute was the budget, and quality was the deliverable. Through 2025, the bottleneck shifted. Test-time compute scaling — chain-of-thought reasoning, iterative refinement, agentic loops, deep-research workflows — emerged as a primary axis of capability. OpenAI's o-series, Anthropic's reasoning models, DeepSeek's R1, Google's Deep Think, the Qwen and Kimi reasoning lines — every frontier system shipped on the premise that you could spend ten or a hundred or a thousand times more inference compute per query and get measurably better answers. Inference compute became the dominant production cost rather than the dominant training cost.

That shift changed which architectures looked good. An architecture that is 5% worse on quality but 90% cheaper at inference was a disaster in 2023, when inference was a small fraction of total cost. It is a no-brainer in 2026, when an agent might spend $5 of inference per task and the lab's gross margin depends on whether that cost is $0.50 instead. The math flipped. Subquadratic architectures, which had spent five years being told they were almost good enough, suddenly found themselves on the right side of the unit economics.

The Ring-linear-2.0 numbers — one tenth the inference cost of a 32B dense baseline — are the cleanest example. They are not alone. Kimi Linear's 75% reduction in KV cache during long-sequence generation. Mamba-3 hitting Mamba-2's perplexity with half the state size, which means half the per-token memory bandwidth, which is the actual bottleneck for autoregressive decoding on modern accelerators. DeepSeek Sparse Attention shipping in V3.2-Exp in late 2025, with reported gains around 3x cheaper prefill and 7x cheaper decoding at 128K context. Token Sparse Attention (Jo, Kang, Song, Kim — ICML 2026, arXiv:2602.03216) delivering 3.23x attention speedup at 128K context with less than 1% accuracy degradation.

The lab-economics implication is that inference substrate has become a competitive moat. Anthropic, OpenAI, Google, DeepSeek, Moonshot, ByteDance, Alibaba — every frontier lab is now building its own inference architecture, with subquadratic substrates increasingly central. The capabilities are converging — every frontier model is more or less GPT-5.5-class on standard benchmarks, which is the Model Convergence Pressure frame from the Map — but the inference cost per agentic task can vary by an order of magnitude depending on architecture choices. That is the moat. The lab with the cheapest tokens at acceptable quality wins the agent economy, because the labs above them in capability have to charge more, and the labs below them in capability cannot afford the quality.

This is the deeper context for the site's Reasoning as Billing Axis frame. Test-time compute being a billing tier presumes that test-time compute can be served at margin. Without a subquadratic substrate, you cannot actually serve unlimited-reasoning tiers at margin — you can sell them at a loss as a market position, but the unit economics force rationing. With a subquadratic substrate, you can serve them as a real product. The architecture choice is the economics choice is the product choice.

There is a critique of this framing that pushes back on the "subquadratic is winning" narrative, articulated most sharply in a January 2026 LessWrong post. The critique notes that DeepSeek Sparse Attention does not actually reduce the KV cache size during decoding, because the 2048 tokens it attends to are different for every generated token and only known when that token is generated. So the inference savings are real but partial — you save on attention compute, but you still pay full memory for the KV cache. The critique is correct as far as it goes. It does not undermine the architectural shift; it sharpens what the shift is for. The first generation of subquadratic methods — Performer, Linformer, Reformer — hit walls in actual production because they optimized for FLOPs while ignoring memory bandwidth. The current generation — DSA, KDA, Mamba-3, NAtS-L, the Subquadratic Inc. SSA architecture — is being designed against the real production constraints, including KV cache memory bandwidth and per-batch throughput on commodity accelerators, not just FLOP count. The benchmark that selects them is no longer asymptotic complexity. It is the inference bill.

The case in front of us

The starkest illustration of where this is heading shipped on May 5, 2026. A startup called, unhelpfully for an essay like this one, Subquadratic announced a model with a 12-million-token context window that outperforms GPT-5.5 on the multi-reference retrieval benchmark MRCR v2 — 12M tokens, against a benchmark where the previous best score from a frontier lab was 74.0%, and where Claude Opus 4.7 sits at 32.2%. The company's CTO Alex Whedon describes the architecture as "Selective Sparse Attention" or SSA — a successor to DeepSeek's NSA and DSA line that addresses what is starting to look like the central bug of the current sparse-attention generation: the indexer that decides which keys to attend to is itself quadratic.

The indexer trap is worth understanding because it is the technical reason "subquadratic" has been a contested word for the last year. DeepSeek's NSA, which won the ACL 2025 best paper award, routes attention to a small subset of selected keys. The attention over those selected keys is genuinely sparse — but the lightning indexer that scores every query against every key to decide which keys to select is doing n × n work. The asymptotic complexity is dominated by the indexer, not by the attention it routes. So a paper can plausibly claim sparse attention while the underlying compute is still O(n²) — just with a smaller constant.

SSA's pitch — and I am taking the company's word for it pending technical writeups, with appropriate skepticism — is that it does what DSA tried to do without the indexer trap. The architectural details have not been fully published yet. The benchmark numbers, if they hold up under independent replication, are the kind that change the shape of the field: 12M tokens of context where retrieval actually works, on inference economics that are presumably better than running softmax attention over 12M tokens, which is currently infeasible on any commodity hardware regardless of cost.

The deeper signal is what the company chose to be. Subquadratic the architecture became Subquadratic the company. Not "advanced attention" or "next-gen LLM" — the literal complexity-class name became a brand. That is what happens when a technical concept stops being a research direction and starts being a product category. It is also what happens when capital starts pricing in the inference-substrate thesis. The agent economy needs a tier of compute that softmax attention cannot serve at margin, and the company that names itself after the floor of that tier is making a particular bet about which side of the unit-economics divide the next ten years sit on.

The case also illustrates a Compliance as Differentiation -adjacent move that I do not yet have a clean name for. Substrate as Differentiation — the pattern where the underlying compute architecture becomes a competitive moat that capability convergence cannot easily erase. The capability of a 70B-parameter frontier model is fairly fungible across labs by mid-2026. The inference economics of running that 70B model on a 3:1 hybrid substrate versus a pure softmax substrate is not fungible. It is a 10x cost difference. That difference is what makes a sustainable agent business possible at margins below 70%. It is the same shape as Anthropic's bet on Claude Code's runtime economics, the same shape as OpenAI's bet on its inference stack, the same shape — in a different layer — as DeepSeek's bet on training efficiency. The lab that gets the cheapest tokens at acceptable quality wins.

Per-token routing is what comes next

The 3:1 hybrid ratio is layer-level routing. Decide, at architecture design time, that every fourth layer will be full attention and the rest will be subquadratic, and send every token through every layer. That is the current consensus. It is also a transitional consensus, because the next move is obvious and several papers are already chasing it.

The next move is per-token routing — deciding, at inference time, on a token-by-token basis, whether a given token in a given layer needs full softmax or can be served by a subquadratic operator. The intuition is straightforward: not every token needs the same treatment. A boilerplate connective word in the middle of a generated response does not need full recall against every prior token. A unique entity name that the agent will reference 50,000 tokens later does. The current layer-level hybrid wastes softmax cost on the connectives and underfunds it on the entity names. A learned per-token router would fix both.

NAtS-L — Neural Attention Search Linear, published February 4, 2026 by Difan Deng and collaborators at Leibniz Hannover — is the cleanest articulation. The paper applies both linear attention and softmax attention operations within the same layer on different tokens, with a learned routing mechanism that decides per-token which operator to use. Tokens that can be encoded into fixed-size hidden states are routed to linear attention. Tokens that contain information likely to be needed for long-term retrieval are routed to softmax. The architecture targets a specific failure mode of the layer-level approach: every token pays softmax cost in every fourth layer regardless of whether it needs to, and no token pays softmax cost in three out of four layers regardless of whether it would benefit from it.

Token Sparse Attention attacks the same problem from a different angle — per-head per-token selection at each layer, with a reversible design that scatters the attention output back into the original sequence dimension so token relevance can be re-evaluated across layers. The system gets the 3.23x speedup I mentioned earlier with under 1% accuracy degradation. The paper's framing is mature in a specific way: the authors note that their approach is complementary to existing sparse-attention kernels and dense-attention implementations. They are not trying to win; they are trying to integrate. That is what a maturing field looks like.

What this implies for the next generation of agent infrastructure is concrete. The layer-level 3:1 ratio is a coarse approximation of the right answer. The right answer is a learned routing function that operates per-token, per-head, per-layer — and the routing function can itself be trained via the same Test-Time Regression principles that produce the underlying recurrences. The compute substrate is becoming continuous in two senses: the operator space is continuous (TTR unification) and the routing space is continuous (per-token rather than per-layer).

The lab that ships production-quality per-token routing first will own the substrate. That lab is not necessarily the lab that ships the best transformer or the best Mamba. It is the lab that figures out, at scale, stably, in a way that holds up across diverse agent workloads, which tokens need recall and which do not. That is a research problem. It is also an infrastructure problem — you need kernels that can switch operators per-token without losing throughput. It is also a pricing problem — per-token routing decisions translate directly into per-token compute cost, and the lab needs to bill in a way that reflects that without exposing the user to architecture-level concepts.

There is a second-order implication for the site's Model Convergence Pressure frame. The frame says that model raw capability is converging across labs and that gains are shifting to topology, rubrics, memory, and scaffolding. Per-token routing is exactly the kind of topology gain the frame predicts. The labs that diverge from the pack in 2026 and 2027 will diverge on substrate, not on base-model quality. The standard benchmarks will not show the divergence cleanly — every lab will still post GPT-5.5-class results — but the inference bills will. So will the unit economics of the products built on top.

The substrate the transformer was never designed for

The five-year project to dethrone the transformer was the wrong project, but it taught the field something valuable. It taught the field what an agentic workload actually looks like — not the workload transformers were designed for, the workload the agent economy is now generating. Most tokens are flow. A minority of tokens are anchor. Treating them the same is what made softmax attention the bottleneck of the inference tier. Treating them differently is what subquadratic architectures are finally doing.

The 3:1 ratio is the empirical token budget of the agent economy in 2026. It will refine — to 6:1 for some workloads, to 2:1 for others, eventually to per-token routing that makes the ratio per-token rather than per-architecture. The math underneath has stopped pretending to be a zoo of operators and started being honest about what it is: online learning over the KV stream, with a continuous design space inherited from forty years of optimization research. The economics underneath has stopped pretending that training compute is the binding constraint. It is inference compute, served at agent-economy scale, billed per token, with margins set by substrate choice.

Subquadratic architectures did not kill the transformer. They found out what fraction of a transformer's job nobody actually needed.