You Can't Buy Sonnet | Abhishek Shankar's Blog

The $5,000 AI mini PC market has a quiet structural problem. The buyer is shown spec sheets that imply Sonnet-class capability is one tier of unified memory away, one bandwidth notch up, one more parameter count, one more quantization step. The math always falls just short. It will always fall just short. The model the buyer actually wants isn't on the menu, and the model that is on the menu loses more than the benchmark gap suggests — particularly on the part of coding work that hardware reviews never measure.

This isn't a complaint about open-weight models. They've gotten remarkable. Qwen3-Coder-Next, an 80-billion-parameter mixture-of-experts with 3 billion active weights per token, hits 70.6 on SWE-bench Verified — within ten points of Claude Sonnet 4.6's 79.6 on a benchmark that didn't exist three years ago. The open-weight ecosystem in 2026 is producing models that, in isolation, are genuinely impressive.

What's broken is the buyer's mental model. The spec sheet doesn't tell the truth about what these machines can do. The category name itself — "AI mini PC" — covers two completely different products with the same numbers on the box. The benchmark that's used to compare local-to-cloud measures the wrong work. The arithmetic that gets the buyer to "$5K saves me API fees" gets the arithmetic wrong. And the operational reality of running a model on hardware you own is qualitatively different from running it via API, in ways that buyer guides systematically ignore.

The right way to think about a $3,000-to-$5,000 local AI purchase in 2026 is structural. The economics, the hardware, the software, the model class, and the workflow all need to line up. Most of the time they don't, and the buyer ends up with the wrong machine for the wrong workload at the wrong point in the price cycle. This is a structural map of how to think about every layer of that decision — and where each layer breaks.

The frame that resolves the confusion is hybrid. Local handles volume; cloud handles difficulty. Anyone running serious coding work local-only is doing it wrong — not because the local models are bad, but because the architectural assumption that you need one capability tier for all work is wrong. The right question to ask before spending money on hardware isn't can local match Sonnet. It can't, and it won't this year. The right question is what fraction of my coding work can move to local without quality loss. That number is real, it isn't small, and it isn't the same question hardware reviews are answering.

The hybrid frame also resolves the economic argument. Local-instead-of-cloud rarely beats hybrid-with-cloud on cost when the analysis includes depreciation, opportunity cost, and the fact that cloud models keep improving for free during the local hardware's useful life. The decoupling — a modest laptop for portability, a dedicated AI box for inference — is structurally cheaper and operationally easier than a top-spec MacBook Pro that tries to be both.

And the hybrid frame resolves the model question. The reason 128GB unified-memory boxes suddenly make sense in 2026, when a year ago they were a curious tradeoff, isn't that the hardware got better. It's that the model class changed. Mixture-of-experts architectures rewrote the rules of what "fits" on local hardware — and the right hardware for the right MoE looks different from the right hardware for dense models. The boxes are catching up to the architecture, not the other way around.

What follows is a layer-by-layer structural map of where local AI hardware for coding stands in May 2026, what the spec sheets get wrong, what the operational reality looks like, and what to actually do.

The category itself is misleading

"128GB mini PC" sounds like one product. It's actually two products with the same number on the box. The first is the unified-memory AI workstation class — AMD Strix Halo (the Ryzen AI Max+ 395) and Apple Silicon. The second is the Intel SO-DIMM class — Core Ultra 200S series boxes with 128GB of DDR5 memory in standard SO-DIMM slots. The number on the marketing material is identical. The use case differs by an order of magnitude.

Strix Halo runs LPDDR5X-8000 on a 256-bit memory bus, advertised at 256 GB/s theoretical bandwidth and measuring around 215 GB/s in practice — about 94% of the theoretical ceiling. The Apple M4 Max gets to 546 GB/s on its full 12-core variant, or 410 GB/s on the binned 10-core. The M3 Ultra clears 819 GB/s. The Intel SO-DIMM class, even at DDR5-7200 in dual-channel, lands around 115 GB/s. The CUDIMM configurations that push DDR5 to 10,000 MT/s can theoretically hit 160 GB/s but are rare in actual mini-PC builds.

This matters because the speed at which a local model generates tokens is bandwidth-bound. Generating one token from a dense transformer requires reading every active weight from memory, once, per token. The arithmetic is mechanical: peak tokens-per-second equals memory bandwidth divided by the size of the active weight set. A 42GB dense model on the Strix Halo's measured 215 GB/s bandwidth caps out at about 5.1 tokens per second. Real benchmarks hit 4.8. The number is essentially a physical constant, not a software optimization target.

Run the same arithmetic on an Intel 128GB box. The same 42GB model on ~115 GB/s memory bandwidth caps out at 2.7 tokens per second. Run it on Apple's M3 Ultra at 819 GB/s and the cap is 19.5 tokens per second. The "128GB" buyer in all three cases has 128 GB of memory, technically. Only one of those three buyers has a coherent local AI machine. The other two are buying memory capacity for a category — virtual machines, large datasets, browser-tab maximalism — that has nothing to do with AI inference.

Most buyer confusion starts here. The marketing collapses the distinction because the number is identical and the categorical difference isn't visible on the spec sheet. The reviewer benchmarks the Intel box on its actual category — multi-VM workloads, parallel compilation, mixed productivity — and concludes it's a great mini PC. The reviewer benchmarks the Strix Halo on AI workloads and concludes it's a great mini PC. The buyer reads both reviews, sees "128GB" twice, and picks on price. The price-pick on a number that doesn't measure the workload is the most common mistake in this market.

RAM isn't one number, it's three

The deeper structural error is treating memory as a single quantity. For AI inference, three numbers matter, and they're independent. Capacity is whether the model and its KV cache fit at all. Bandwidth is how fast the chip can read weights to generate the next token. Location is whether memory is system RAM (slow, but cheap and plentiful), discrete VRAM (fast, expensive, capped at 32GB on consumer cards), or unified (medium-fast, medium-priced, large).

The priority order for local AI on a buy-decision is capacity, then bandwidth, then iGPU class, then NPU TOPS. Capacity is the binary first filter — can the model even load? Bandwidth determines how fast the model runs once loaded. iGPU class determines whether the compute side keeps up with the bandwidth side (and on Strix Halo, the Radeon 8060S is overprovisioned relative to bandwidth, which means bandwidth is reliably the bottleneck rather than compute). NPU TOPS is decoration, for reasons we'll get to in a moment.

The marketing collapses these on purpose. "128GB" is capacity. "256 GB/s" is bandwidth. "90 NPU TOPS combined!" is something else entirely. They appear on the spec sheet as if they're equally meaningful. Two of them describe real product capabilities; the third is a synthetic benchmark in a different unit that doesn't predict any actual workload. Treat them as one composite "memory spec" and you'll buy the wrong thing.

Unified memory is the magic word

A 24GB RTX 4090 can't load a 70B model at any quantization that preserves quality. The model doesn't fit, period. A 128GB Strix Halo or Apple Silicon box loads it comfortably. Discrete GPUs win on tokens-per-second when the model fits — 1,008 GB/s of GDDR6X bandwidth on the 4090, 1,792 GB/s on the GDDR7-equipped 5090 — but the capacity wall is the harder constraint at the high end. Unified memory wins not by being faster but by making the big-model question solvable at all.

This is the architectural pivot that 2026's local hardware reflects. For dense models in the 7B-30B range, where the model fits in 24GB or less, the RTX cards remain the right answer — the 5090's 1.79 TB/s of bandwidth gets you tokens-per-second that no unified-memory box can match. For dense models in the 70B class, unified memory is the only option in the consumer price range — and as we'll see, the right answer there is to not run dense 70B models at all, because mixture-of-experts now offers a better point on the curve.

The discrete-vs-unified choice is structural, not preference. If your target model fits in 24-32GB at acceptable quantization, discrete wins on speed. If your target model needs 40-100GB of memory at any quantization that preserves quality, discrete is unavailable in the consumer tier and unified is the only architecture that solves the problem. The "best of both worlds" hardware doesn't exist at consumer prices, and the marketing that implies it does is selling against an imagined product.

Inference speed is bandwidth divided by active weights

The single most useful piece of intuition for local AI hardware is this formula: peak tokens-per-second equals memory bandwidth divided by the size of the active weight set per token. It's the bandwidth-bound part of generation, and it sets a hard physical ceiling that no software optimization can break.

The implication is that two boxes with the same model loaded but different bandwidth produce predictably different throughputs in fixed ratio. Two boxes with the same bandwidth but different models — say, a dense 70B versus an 80B-A3B MoE — produce throughputs that differ by the ratio of active weight sets. A box that doesn't change but loads a different model changes its tokens-per-second by the same ratio.

This intuition resolves a lot of buyer confusion. The buyer who reads "Strix Halo runs 70B at 5 tokens per second" and concludes the hardware is slow is comparing to the wrong reference. The hardware is running at 94% of its bandwidth ceiling. The model is the bottleneck, not the hardware. Switch the same hardware to an MoE with one-tenth the active weight set and the throughput jumps by an order of magnitude. The "slow" hardware is suddenly competitive on a different model class — same chip, same memory, same drivers.

The formula also predicts what kind of hardware upgrades produce what kind of throughput gains. Doubling memory capacity (64GB to 128GB) doesn't change tokens-per-second at all; it only changes which models can load. Doubling memory bandwidth doubles tokens-per-second for any loaded model. Doubling iGPU compute does very little for token generation (the workload is bandwidth-bound) but helps prompt processing materially.

The spec sheet lies in three places

Beyond the category collapse, the spec sheets for local AI hardware lie in three specific places that hurt buyers. The first lie is about throughput. The second is about acceleration. The third is about sustained performance.

"Tokens per second" is two numbers, not one

The benchmark headline most reviewers cite is tokens-per-second. This is two numbers that get reported as one, and the difference matters.

Prompt processing — the speed at which the model ingests a context window before generating its first token — is compute-bound. Big discrete GPUs dominate here because the matrix multiplications saturate compute units that unified-memory iGPUs don't have. Token generation — the speed at which the model produces output tokens one at a time, autoregressively — is bandwidth-bound. Unified-memory boxes compete here because the bottleneck is memory access, not compute.

The practical consequence: Strix Halo ingests context roughly 5× slower than an Nvidia workstation card, but once generation starts, the two are closer in throughput. Pasting a 50-page PDF into a chat session feels noticeably sluggish on the local box; chatting with already-loaded context feels fine. The benchmark reviewer testing a 200-token prompt sees something like the discrete-GPU performance characteristic. The actual user, working with a long codebase context, sees the unified-memory characteristic. These are different machines for different workloads, sold under the same headline number.

The newest benchmark posts have started distinguishing these correctly — Strix Halo's recent Vulkan AMDVLK driver lands at 38.65 t/s for token generation on Q6_K_XL quantizations but lags at 358 t/s prompt processing. Both numbers matter. Most buyer guides report only the bigger one.

The user-facing implication is that the "tokens-per-second" you experience depends entirely on which phase you spend your time in. A coding workflow with mostly short prompts and longer generations (write me this function) is mostly generation-bound, and unified memory holds up. A workflow with long prompts and short generations (here are five files, answer this specific question) is mostly prompt-processing-bound, and unified memory feels noticeably worse than discrete. Most real coding workflows are a mix; the felt performance depends on the mix.

NPU TOPS is theater

The NPU — the dedicated neural processing unit baked into modern CPUs — is the most misleading entry on any local AI spec sheet. Strix Halo advertises 90 combined AI TOPS, which is the sum of approximately 40 GPU TOPS and 50 NPU TOPS. The first 40 are usable. The second 50 are decoration. Every coding-relevant local inference runtime — LM Studio, Ollama, llama.cpp, vLLM — ignores the NPU entirely. Models route through the iGPU.

This isn't a temporary state of affairs. The NPU's architecture is optimized for INT8 inference of small models with short context — the workload Microsoft envisioned for Copilot+ PCs, where a 7B model with maybe 4K of context does a discrete inference per user request. NPU-optimized models cap at around 3,696 input tokens. A real coding-agent workload needs 21,000-plus. The NPU literally cannot serve the workload most local buyers think they're buying for.

Worse: on the same chip, the NPU is often slower than the GPU for AI workloads it was supposedly designed for, because the open-source runtimes haven't built the NPU acceleration paths and the vendor-specific paths are immature. Intel's IPEX-LLM project is the most serious attempt to bridge this — supporting Intel GPU, NPU, and CPU acceleration with hooks into llama.cpp and Ollama — but it's not the path the open-source local AI community actually runs on. The Copilot+ PC NPU sits at near-zero utilization on most local-AI users' machines.

If you're picking hardware based on TOPS, you're picking on a synthetic INT8 benchmark that doesn't predict real performance. The number that matters for token generation is iGPU memory bandwidth times utilization. The number that matters for prompt processing is iGPU compute throughput. NPU TOPS is uncorrelated with both. It's the entry on the spec sheet most likely to drive an irrelevant decision.

The cleanest evidence for this is the Copilot+ PC story. Microsoft launched the Copilot+ PC category in 2024 with NPUs as the marquee feature — 40+ TOPS, dedicated AI accelerator, the future of on-device AI. The reality two years later is that the actually-shipping open-source AI ecosystem runs on Windows without using any Microsoft AI hooks. Every Copilot+ PC's NPU sits idle while users hit the GPU. The hardware feature that defined the category is bypassed by the software the buyers actually use.

KV cache is the silent killer

Model weights are only half the memory math. The other half is the key-value cache — the per-token tensors that store the attention state of the running context. Every token in the context window — input plus generated output — adds another KV vector to the cache. The cache grows linearly with context length.

For Llama 3.1 70B with grouped-query attention (8 KV heads, 80 layers, 128 head dimension, BF16 precision), the per-token KV cache cost is approximately 0.31 MB. At 32K context, that's ~10 GB of cache. At 128K, it's ~40 GB. The model weights at Q4 are around 42 GB. Add the 10 GB KV cache at 32K and the working memory footprint is 52 GB. Add 40 GB at 128K and it's 82 GB. A 128GB box has only 46 GB of headroom at the 128K context the buyer was promised.

Most buyers don't run the cache math. They look at the model weight footprint, compare to system memory, and conclude they have margin. They open a long conversation, the context fills, the cache grows, the model hits the page boundary and starts paging — or simply OOMs. The same rule of thumb applies across model families: budget at least 1.5× the model weights for comfortable headroom at typical contexts, and 2× for long contexts. A model that's 90% of your memory will OOM the moment you have a real conversation.

This is also why "running 70B" on a 64GB Mac mini is mostly fiction. The model loads in technically, at aggressive quantization, but the cache and any working memory push the box into thrashing or OOM the moment context fills. The 128GB tier exists for a reason — not for the model itself but for the cache and the workspace around it.

The cache problem compounds the context problem. Open-weight models with claimed long-context support (Llama 3.1's 128K, Qwen3's 1M variants) generally have working context that falls off well before the claimed limit, but even within their working range, the cache cost can dominate the memory budget. A "1M context" model that can't fit 1M tokens of KV cache in your physical memory isn't actually a 1M context model on your hardware.

Software stack is a hierarchy, and Tier 4 is marketing

The buyer who sees "Linux supported" on a Strix Halo box and the buyer who sees "Linux supported" on an NVIDIA workstation are reading the same English words about completely different software realities. The local AI software stack has four tiers, and only the first two are usable without becoming a sysadmin.

Tier 1 is CUDA on NVIDIA hardware and MLX on Apple Silicon. Both work out of the box for every major inference runtime. Install LM Studio or Ollama or llama.cpp; load a model; it runs. The driver is mature, the inference kernels are optimized, the model formats are supported, the bug surface is small. This is the "just works" tier.

Tier 2 is ROCm on Linux for AMD GPUs (including Strix Halo's iGPU) and Vulkan as a vendor-agnostic backend. Both work, but with caveats. The setup is more involved. The model coverage is good but not complete. Performance is generally good but variable across driver versions. Recent Strix Halo benchmarks show Vulkan AMDVLK lapping ROCm by 16% on token generation for some quantizations — meaning the choice of backend on the same hardware matters more than it should. This is the "works but you're maintaining it" tier.

Tier 3 is ROCm on Windows, any NPU acceleration path on any vendor, and any vendor-specific SDK that isn't CUDA or MLX. These exist; they're shipping; the documentation says they support inference. In practice, getting them working reliably is a science project. The bug reports on GitHub are full of "works on my machine" comments and version-specific recipes. This is the "you'll spend a weekend on it" tier.

Tier 4 is everything that appears on the box but doesn't have a maintained inference path. Most NPU TOPS marketing falls here. Most "AI acceleration" claims from chipmakers fall here. The TOPS exist on silicon; no runtime calls them. This is the "decoration" tier.

The actually-shipping open-source AI ecosystem in 2026 is built on Tier 1 and Tier 2. Buyers picking hardware based on Tier 3 capability (because the spec sheet promises it) are picking on a feature that won't be usable for the useful life of the hardware. Buyers picking based on Tier 4 capability (because the marketing emphasizes it) are picking on a feature that will never be usable for any inference workload anyone actually runs.

Thermal throttling is the spec-sheet lie that lives between the lines

The spec sheet lists a TDP. The TDP is the rated thermal envelope. The chassis the chip ships in either dissipates that envelope under sustained load or it doesn't. Most don't. A 45W chip in a small fanless or near-fanless mini PC chassis hits its thermal limit in minutes. The same chip in a larger chassis with better cooling holds peak under sustained load for an hour or more.

The performance gap between two chassis with the same chip can be 30% or more under sustained inference workloads. Reviewers benchmark for 60 to 90 seconds, capturing peak. Real coding-agent workloads run for hours of inference scattered throughout a workday. The number that matters is sustained TDP after ten minutes, not peak. That number doesn't appear on any spec sheet.

The implication for the buyer: chassis matters as much as chip. A Strix Halo board in a NUC-sized case throttles harder than the same board in a 1-liter case with better airflow. The difference between two boxes nominally specced identically can be the difference between 4 t/s sustained and 5 t/s sustained on the same workload. That's not "rounding error" performance; it's a 25% throughput swing that compounds across every coding session.

The thermal lie also interacts with the laptop-vs-desktop question. Laptops are thermally constrained by definition; sustained inference on a thin-and-light is a slow march toward throttling. A 128GB MacBook Pro running an agentic coding workload for an afternoon will throttle the chip and the user will see worse performance after the first hour than during the first minute. A desktop or mini-PC in a better-cooled chassis holds peak. The "AI workstation laptop" sells against a thermal reality it can't actually deliver.

MoE rewrote the rules in 2026, and the buyer's guides haven't caught up

The structural shift that made 128GB unified-memory boxes interesting in 2026 isn't hardware. It's a model architecture change. Dense transformers read every parameter for every token. Mixture-of-experts transformers read only a subset — the "active" experts routed at each layer — while keeping the rest dormant in memory. Same model size on disk; an order of magnitude less compute and bandwidth per token.

Qwen3-Coder-Next is the practical example. Total parameter count: 80 billion. Active parameters per token: 3 billion. Memory footprint at Q4: roughly 42 GB. Tokens-per-second on Strix Halo: ~52, compared to 4.8 for a dense 70B at the same memory footprint and the same hardware. Same box. Same memory bandwidth. Ten times the throughput, because the active weight set the bandwidth has to feed is 1/27th the size.

This is the change that retroactively justifies the unified-memory architecture. A year ago, a 128GB box at 256 GB/s was a curious tradeoff: enough memory to hold a 70B dense model, not enough bandwidth to run it at usable speed. The right answer in late 2024 was either a 24GB-32GB discrete GPU running a smaller dense model fast, or a multi-GPU server farm running large dense models with proper bandwidth. The mid-tier 128GB-at-256GB/s point was awkward — too slow for dense large models, oversized for dense small models.

MoE moved that point from awkward to optimal. The active weight set on a Qwen3-class MoE is small enough that 215 GB/s of measured bandwidth is genuinely fast. The total weight footprint is large enough that you actually need 128GB to hold it. The hardware specification that looked like a compromise in 2024 looks like a perfectly-targeted product in 2026, not because the hardware changed but because the model class did.

The implication for buyers is that the right hardware in May 2026 looks different from the right hardware in May 2025. The buyer's guides published last year that recommended discrete GPUs for everything because "bandwidth is king" are looking at the wrong reference workload. The buyer's guides that recommended Apple Silicon or Strix Halo because "you need the memory" were directionally right but couldn't articulate why — until MoE made the memory-bandwidth tradeoff coherent.

The model class is also where the open-source ecosystem is currently making the most progress. Most of the meaningful releases in the past six months have been MoE designs: Qwen3-Coder-Next at 80B-A3B, GLM-4.7 at 358B (with a much higher active count), DeepSeek-V3.2 at 671B. The frontier of practical open-weight coding has shifted decisively to MoE, and the open-source community has responded by optimizing the inference runtimes around MoE-specific routing rather than dense-attention throughput.

This is important because it changes the depreciation curve on the hardware. A 128GB unified-memory box bought in 2026 is bought into a model class that's likely to remain the dominant pattern through at least 2027 — the active-parameter / total-parameter ratio is still being explored, but the broad shape of the architecture is now stable enough that hardware optimized for it has more useful life than hardware optimized for the dense paradigm.

Quantization for code: smaller-higher-quant beats bigger-lower-quant

The other model-side mistake buyers make is on quantization. The intuition "more parameters is always better" leads to picking the largest model that fits at whatever quantization makes it fit. This is wrong for coding work in a specific, repeatable way.

A 32B model at Q6_K (about 6 bits per weight, near-lossless) is a better coding companion than a 70B model at Q2_K (about 2.5 bits per weight, heavily lossy). Both occupy similar memory. The smaller model at the higher precision preserves the structure of the original weights well; the larger model at the lower precision loses subtle distinctions that show up specifically in code generation. Q4_K_M is the inflection point — useful for general chat, marginal for code. Q5_K_M is the practical floor for serious coding. Q6_K or Q8_0 is the recommended target for production coding work.

Q4_K_M introduces subtle bugs that you'll spend an hour debugging — not catastrophic failures, but off-by-one errors, missed imports, slightly wrong function signatures, slightly hallucinated method names that look right but reference a real function with the wrong arguments. The kind of bugs that pass a casual review and fail at runtime. The quality drop from Q5_K_M to Q4_K_M is invisible on aggregated benchmark scores (the HumanEval Pass@1 stays roughly the same) and visible in the specific failure modes that matter for production code.

"I can run the biggest model" is often the wrong frame. "I can run the best-quality model that fits at a useful precision" is the right frame. A 64GB box running a 32B model at Q6_K is materially better for coding than a 128GB box running a 70B model at Q3_K_M, even though the headline parameter count is half. The intuition that drives buyers toward the larger model is the same intuition that makes them buy the wrong machine.

The quantization choice also interacts with the KV cache budget. Lower-quantized weights leave more memory for cache, which lets you run longer context. But the model quality drop usually costs more than the extra context window buys. The right point is the highest quantization that fits at the context length you actually use — for most coding, that's Q5_K_M or Q6_K at 32K context, with a 32B-class dense or 80B-class MoE model.

What local actually can't do for coding

The performance question and the architecture question are settled. Local can run capable models, and the right hardware for the right model class delivers respectable throughput. The honest question is what local can't do — and the answer matters for the whole purchase rationale.

The benchmark gap is real, and the practical gap is wider

Claude Sonnet 4.6, released February 2026, scores 79.6 on SWE-bench Verified. Qwen3-Coder-Next, the best open-weight coding model in widespread use, scores 70.6 — within ten points on the synthetic benchmark. GLM-4.7 narrows the gap further at 74.2, but at 358 billion parameters it doesn't fit on any 128GB box anyone calls a "mini PC." DeepSeek-V3.2 at 671B lands at 70.2, in the same band as Qwen3-Coder-Next but at a parameter count that requires server-class hardware.

A ten-point gap on a synthetic benchmark looks closable on the next model release cycle. The practical gap on real coding work is much wider, and it doesn't appear on SWE-bench Verified.

SWE-bench Verified measures the ability to patch curated GitHub issues with a known scaffold — the issue is well-described, the relevant files are roughly bounded, the scaffold provides tool-calling structure, and the model has perhaps two or three tool calls' worth of work to do. This is a useful benchmark, but it's not the workload that defines "real agentic coding."

The workload that defines real agentic coding is what Claude Code does, or what the Anthropic SDK enables when wired into an agent harness. The agent receives an under-specified request like "the deploy pipeline is broken on staging, figure out why and fix it." It plans. It reads files. It runs commands. It interprets outputs. It hits dead ends. It backs up. It tries a different angle. The execution pattern is fifty or more tool calls, branching, recovering from failures, maintaining state about what's been tried, and continuing until the original objective is satisfied or impossible. Every viable local coding model lands at "Tier C or worse" on this work — not because they're bad at any single tool call but because they drift over fifty.

This is the part of model convergence pressure that hasn't actually converged. Raw capability on single-shot benchmarks is closing. Reliability across long agentic trajectories is not. The frontier-lab investment in post-training for tool-calling reliability — the work that distinguishes Sonnet 4.6 from the model two versions back even though their SWE-bench scores look similar — is largely invisible to the open-source ecosystem because the relevant training data and the relevant reward signals aren't public. The gap on agentic reliability is widening even as the gap on benchmarks closes.

For coding work that fits the single-shot mold — write me a function that does X, refactor this file, explain what this code does, generate tests for this module — local is genuinely competitive. For coding work that fits the agentic mold — figure out why this is broken, ship a fix that touches three services, debug across a real production stack — local is not in the same league, and the gap is structural.

This is the honest comparison, and it's the one most reviews miss because they test on the wrong workload. Reviewers who run the same single-shot prompt through Sonnet and Qwen3-Coder-Next and find the outputs comparable are not wrong about the prompt they ran. They're wrong about the workload that defines the buying decision. The single-shot prompt is the volume work; the multi-turn agentic trajectory is the difficulty work; the gap shows up in the second, not the first.

Context isn't capacity

The marketing claim that local models support long contexts — 128K, 256K, 1M, increasing every model release — is technically true and practically misleading. "Supporting" a context length means the model can be given that much input without erroring. It doesn't mean the model can use that much input effectively.

"Needle in haystack" evaluations measure whether a model can retrieve a specific fact placed somewhere in a long context. Open-weight models with claimed 128K+ contexts generally show clean retrieval through about 16K-32K tokens, then degrade rapidly. The relevant question for coding is whether the model can reason coherently about a codebase given as context, not just retrieve specific facts — and on that measure, open-weight long-context performance falls off even earlier.

Frontier closed models are better at this — they hold coherent reasoning over more tokens — but even they exhibit measurable degradation past 100K-200K. Anthropic's published results on Sonnet 4.6's 1M context support describe it as "supported," with the caveat that performance is best in the 200K range and the long-tail context is meaningfully degraded.

The practical implication for codebases is that none of this matters. A 100,000-line codebase is roughly 1-3 million tokens of source. No model — closed or open — has a working 3M context window in the sense that it can reason about every line as if it were locally relevant. The architecture that handles large codebases is not long-context loading; it's retrieval-augmented generation against a local vector database. Claude Code works this way. Every serious coding-agent system works this way. The buyer who imagines that a 128K-context local model will "load my codebase" is imagining an architecture that doesn't exist at usable quality at any price point.

This connects back to the hardware question. The KV cache math we worked through earlier shows that even running a 70B model at 32K context costs ~10 GB of cache memory; pushing to 128K costs ~40 GB. The hardware that can technically hold the cache for a long context can rarely use it effectively because the model itself degrades. The capacity exists for the wrong reason.

The right architecture for code on local hardware is a small-to-medium model with a working 32K context window, sitting behind a local vector store that holds embeddings of the codebase. The model handles individual queries with the relevant snippets retrieved from the vector store. The vector store is the long-context substitute. The model's context window is for the working set of the current query, not the codebase as a whole. This is how Claude Code itself operates, and it's the only architecture that actually scales to codebases of consequential size.

The hardware implication of this architecture is interesting and underrated. The vector store needs storage, not memory. A local embeddings database for a 100K-line codebase fits comfortably in a few hundred megabytes; the embedding queries are fast on any modern CPU. The model that consumes the retrieved context needs only enough working context for the immediate query — typically 8K-16K tokens, well within any open-weight model's reliable range. The total local hardware requirement for a RAG-backed coding system is significantly smaller than the requirement for a "load the whole codebase" fantasy. A 64GB Mac mini with a local Qdrant or Chroma instance is a more capable coding system than a 128GB box trying to run a model with a 1M context window.

The economics work against local-instead-of-cloud

The financial case for local-instead-of-cloud almost never works for an individual coder unless privacy or compliance forces it. The case requires assumptions that buyers tend to make optimistically: that local will replace cloud rather than supplement it, that the hardware will hold its useful life across multiple model generations, that the buyer's usage rate justifies the capital outlay. Each of these assumptions is wrong in characteristic ways, and the failure mode of getting all three wrong is roughly $5,000 of partially-used hardware.

$5,000 buys years of API at heavy daily use

Sonnet 4.6 is priced at $3 per million input tokens and $15 per million output tokens. Heavy daily coding through Claude Code or an equivalent agentic client — the kind of usage where the developer is genuinely leaning on the model for the bulk of their day — runs roughly $100-150 per month at the high end for an individual. This is a maximum-use number; most developers running serious agentic workflows still come in below $100. The number rises with prompt caching disabled or with very long contexts dominating the input, and falls with batch processing or cached prompts.

At $100/month — call it the upper end of heavy individual use — $5,000 of API credit lasts 50 months. At $150/month — genuinely all-day agentic coding with no caching — it lasts 33 months. Three years of API spending at the high end matches the price of a top-tier local AI laptop, with the additional features that the API models keep improving for free during that period and you can switch to whatever model is best without buying new hardware.

A $5,000 mini PC, by contrast, depreciates on a curve dictated by chip generations. The M-class chip in a $5,000 MacBook Pro depreciates roughly 70% over three years on the secondary market — not because the chip becomes unusable, but because the chip two generations later is meaningfully better and the price of a used machine reflects that. The Strix Halo box bought today is on a similar curve relative to the Strix Halo 2 or whatever AMD calls the next generation.

The financial comparison is asymmetric in another way. The API tokens you don't use don't expire (Anthropic's API credit policy preserves them indefinitely on a paid account). The hardware depreciation is locked in the moment you buy. The variance on the upside is also asymmetric — if you stop coding for a stretch, the API spend goes to zero; the hardware sits depreciating either way.

The conditions under which the math reverses are narrow. You'd need privacy or compliance pressure that prevents cloud use entirely. Or you'd need usage rates roughly an order of magnitude above heavy individual use — a team running shared local inference, an automated pipeline running 24/7, or a research workload that genuinely scales tokens-per-day past the individual ceiling. For a single developer doing daily coding, the API math wins in almost every realistic scenario.

The further wrinkle is prompt caching. Anthropic's prompt caching reduces cached input cost by 90%, which means a developer who reuses long system prompts or codebase context across many queries pays a small fraction of the headline input rate. Batch processing halves both input and output cost. Heavy use of either drops the realistic monthly bill another 30-60%. The local hardware has no equivalent of these discounts; you pay full inference cost on every token regardless of repetition.

Bundling laptop and AI workstation is the depreciation trap

A specific version of this mistake deserves naming: bundling laptop functionality and AI workstation functionality into a single high-end laptop purchase. The top-spec MacBook Pro with M5 Max and 128GB RAM is the canonical example. The reasoning the buyer applies is "one device that does both"; the financial consequence is two depreciation curves applied to one asset.

Laptop depreciation is driven by chassis wear, battery cycles, port standard changes, screen aging, the cosmetic effects of carry, and the release of newer laptop chassis with better keyboards, screens, and thermal designs. A laptop bought today is competing against the laptop refresh cycle on the next release.

AI workstation depreciation is driven by chip generation — the M5 Max competes against the M6 Max two years later, on the basis of memory bandwidth, NPU performance, and TDP/efficiency. The chip generation cadence is separate from the laptop chassis cadence; both apply.

Bundle them into one asset and the depreciation compounds. Three years in, the laptop is showing its age — the keyboard is worn, the battery is at 75%, the chassis has the scratches of being carried, and the chip is two generations behind. The bundled $5,000 purchase is now worth roughly $1,500 on the secondary market. Same money spent decoupled — a $1,500 mid-spec MacBook Air for portability, a $2,200 Mac mini M4 Pro 64GB for AI inference, leaving $1,300 in pocket — depreciates on independent curves. Three years in, the MacBook Air is worth maybe $700, the Mac mini is worth maybe $1,000, the $1,300 has earned modest interest, and the total residual value is comparable or better while the operational experience over the three years was equivalent or better.

The decoupled play also makes either replacement independent. When the M5 chip cycle is replaced by M7 and the AI inference performance gap becomes meaningful, you replace the Mac mini and keep the MacBook Air. When the MacBook Air's keyboard finally fails, you replace the laptop and keep the Mac mini. Each piece replaced on its own depreciation curve, not on the worst-case envelope of the two.

The decoupling also addresses the thermal lie. The Mac mini in its actively-cooled desktop chassis sustains peak performance for hours; the MacBook Pro running the same workload throttles after the first hour. The "AI workstation laptop" sells against a sustained-performance reality it can't deliver. The Mac mini at half the price delivers more sustained inference throughput than the laptop at twice the price, because the chassis matches the workload.

"AI box at home" beats "AI workstation I sit at"

Once you treat the AI hardware as a network service you SSH into rather than a workstation you physically sit at, the requirement profile changes dramatically. The box doesn't need a keyboard or monitor or trackpad. It doesn't need to be portable. It doesn't need to be quiet enough to share desk space. It just needs to run, expose its inference endpoint over the local network, and stay on.

This unlocks a much better price-performance frontier. The chassis can prioritize cooling over aesthetics. The components don't need to include a screen. The thermal envelope can be wider. The boot drive can be cheaper. The wireless networking doesn't need to be top-tier. Most of the cost premium of a portable AI laptop is the portable part — and once you don't need it, you don't pay for it.

The operational implications are also better. The box runs while you're at lunch. It runs while you're asleep. It runs while you're working on a different machine. It runs while you're in a meeting. The inference jobs that take long enough to be inconvenient on a laptop (run a model over 50 files in batch, generate documentation for a module, evaluate a model on a test set) become things you fire off and check back on later. The "computer I use" and the "computer I infer on" stop competing for resources and attention.

This is how serious local AI users actually run things. The buyer guides aimed at consumers consistently miss it. Reviews compare laptops because reviewers test laptops. The right architecture isn't an AI laptop or even an AI desktop; it's an AI server, sitting somewhere on the home network, addressed by the laptop you actually use.

The network architecture matters here. The local AI box doesn't need to be on the same machine as your editor. It needs to be on the same network, with a known address. Tailscale or equivalent mesh networking makes this trivial — the box has a stable name like ai-box on your network, addressable from any device, no port forwarding or VPN configuration. The editor extension is configured once with that address; the laptop talks to the box transparently from anywhere. The setup is genuinely "configure once" if done deliberately.

You're being asked to buy at the worst point in the price cycle

May 2026 is, by a meaningful margin, the worst point in several years to buy local AI hardware. Memory chip prices spiked starting in late 2025, driven by AI infrastructure demand consuming the production capacity that would otherwise serve the consumer market. DRAM contract prices increased roughly 90-95% quarter-over-quarter in early 2026. The consumer-facing consequences are visible across the product lineup.

Apple removed the 512GB Mac Studio configuration in March 2026, the 256GB in April-May, and the 128GB configuration in May. The M3 Ultra Mac Studio in May 2026 maxes at 96GB of unified memory, with delivery estimates of 10 weeks. Apple's M5 Mac Studio, originally rumored for mid-2026, has been pushed to October 2026 with constrained memory configurations expected at launch. The M5 Max will support up to 128GB; the M5 Ultra up to 256GB — both below the predecessor maxes.

The PC mini PC market has tracked similar dynamics. Strix Halo boxes that launched with promised pricing under $2,000 in late 2025 are retailing well above $3,000 by mid-2026. The discount channel that defined the segment in late 2025 has largely closed; new SKUs are launching at retail prices that reflect the memory cost spike rather than the silicon cost.

The buyer who waits has clear options. Memory supply is forecast to normalize through late 2026 and into 2027 as AI infrastructure capacity grows and DRAM production scales. The M5 Mac Studio cycle in October will offer the first real refresh of the Apple unified-memory lineup since the March supply crunch. The Strix Halo 2 generation — AMD's next iteration of the unified-memory chiplet design — is expected in mid-to-late 2026. Waiting is free, and the cost of buying now versus buying in 2027 is meaningful enough to justify deliberate patience.

The deferred-buy hedge is real money saved. A buyer who needs the hardware for compliance reasons today has no choice. A buyer who wants the hardware on a discretionary timeline saves roughly $1,500-2,500 by waiting six to twelve months, conditional on the supply curve normalizing as forecast.

The waiting strategy also reduces the risk of buying into a transitional architecture. Strix Halo 1 and the M3-generation Apple Silicon are both single-generation products in the unified-memory-for-AI category. The architecture is still being refined; the second generation will likely have meaningfully better memory bandwidth (Strix Halo 2 is rumored to push to 320 GB/s; the M5 Ultra to 1 TB/s+) and better software support as the AMD and Apple inference paths mature. Buying the first generation locks in the bandwidth ceiling that determines tokens-per-second; buying the second generation gets a higher ceiling for the same money in nominal terms.

The operational layer is where most local setups fail

Most analysis of local AI hardware stops at the purchase decision. The interesting question is what happens in the weeks after the box arrives. The failure mode that defines this market isn't financial and it isn't capability — it's operational. The buyer gets the hardware. The hardware works. The buyer doesn't use it. The box gathers dust.

The friction trap kills decoupled setups

The single biggest predictor of whether a local AI setup gets used over six months is the friction relative to the cloud alternative. Not the model quality. Not the throughput. Not the specs. Friction.

The friction calculation is concrete. Cloud: open browser tab, type prompt. Done. Local, naive setup: open terminal, SSH into the box, check the model is running, configure the editor extension for the local endpoint, paste the prompt, wait for response, parse the response. Each of these steps adds seconds of overhead and cognitive switching cost. In aggregate, they put the local path at maybe 30 seconds of overhead to start a task that takes the cloud path 5 seconds.

For a single task, this difference is irrelevant. For the hundredth task in a day, it's the difference between using the local box and not. Behavior tracks friction. The buyer who saved $0.30 of API cost by routing this query to the local box also spent 25 seconds of overhead — at an hourly cost of attention that dominates the savings by orders of magnitude. After a week of this, the buyer is back to using the cloud for everything and the box is decorative.

The fix is operational, not architectural. Tailscale (or equivalent zero-config network mesh) so the box is addressable from anywhere on the developer's machines without VPN configuration. An editor extension already configured to point at the local endpoint on installation. A keybinding that switches between local and cloud routing without a settings dialog. A single primary workflow set up on day one — code completion in the editor, agentic task in the terminal, whichever fits — with no manual steps between intent and execution.

This setup takes a half-day if done deliberately. It takes never if put off. The biggest predictor of whether the half-day gets invested is whether it happens in the first week of ownership, while the box is novel and the buyer has motivation. After that window, the box gathers dust faster than the operational setup gets built. This is the friction trap, and it's the single most common reason expensive local hardware purchases fail.

The friction trap is also why "I'll set it up later" is the most expensive decision in the segment. The hardware is bought; the value is gated on the operational setup; the operational setup happens on day-one momentum or it doesn't happen at all. The financial cost of buying the box is sunk on day zero. The operational cost of activating it is the only variable; getting it wrong wastes the entire investment.

Test the model via API before buying hardware to run it

The cleanest hedge against buying hardware that won't get used is to use the model first. Every viable local coding model is available through a hosted API for $0.30-2 per million tokens — well below the cost of running them locally on hardware you don't yet own. Qwen3-Coder-Next, GLM-4.7, DeepSeek-V3.2, the smaller Qwen and Llama variants — they're all on Fireworks, Together, OpenRouter, Lambda, or the model authors' own endpoints.

Spend two weeks doing your real coding work with the model you'd run locally. Not the chat interface; the actual workflow you'd use with the local box. Editor extension pointed at the API. Agentic harness using the API for tool calls. Real codebase, real tasks, real volume.

At the end of two weeks, you'll know whether you'd actually use that model locally. If you kept falling back to Sonnet for hard tasks while using Qwen3-Coder for completions, that's exactly the hybrid workflow you'd run on local hardware — and you've now validated it without buying anything. If you kept using Sonnet for everything because the model quality difference frustrated you on every non-trivial task, you'd do the same thing with local hardware except you'd be $5,000 poorer and have an unused box on the desk.

The total cost of this test is small: two weeks of API usage of a model that's cheaper than Sonnet, plus the time to wire the editor and harness pointed at the alternative endpoint. Call it $50 of API spend and four hours of setup. The expected information gained — do I actually want this on my hardware — saves an order of magnitude more money than the test costs.

Most buyers skip this test because the hardware decision is more exciting than the workflow validation. The hardware decision involves spec sheets and product photos and review videos. The workflow validation involves spending two weeks doing your actual job with a slightly different setup. The hardware decision wins on the dopamine axis even when the workflow validation wins on every other axis.

The result is a market where buyers consistently spend $3,000-5,000 on hardware to run models they haven't validated they'll use, while $50 of API spending would have told them the answer in advance. This is the most expensive avoidable mistake in the segment.

The hybrid stack is the only architecture that survives

The recommendation that emerges from layering all of this is structural, not contrarian: a hybrid stack where local hardware handles volume work and cloud APIs handle difficulty work is the only architecture that's robust to the actual constraints. Local-only is wrong because the capability gap is real and the operational friction is high. Cloud-only is wrong because it pays for completions and boilerplate that could move local at lower cost. The hybrid stack does both, and it's robust to whichever direction the gap moves.

Local handles volume, cloud handles difficulty

The partitioning is concrete. Volume work is tasks where the model needs to be good enough, latency matters, privacy might matter, and per-task cost adds up. Cloud work is tasks where the model needs to be excellent, multi-turn reliability matters, and the marginal cost of the task is small relative to the value.

Volume work that fits local cleanly: code completion in the editor — single-line and multi-line suggestions, where the model has perhaps 8K tokens of relevant context and the right answer is one of a small set of plausible completions. Single-file edits with clear specification — refactor this function, add error handling here, generate docstrings for these methods. Test generation for code you've already written — given the function, generate the unit tests that exercise its branches. Boilerplate generation — config files, API client stubs from a spec, repetitive structure. Documentation drafts — summaries of modules, comments inferred from code, README scaffolding. Local search and explanation — what does this function do, where is this called from (combined with grep-style search, the model adds intent inference to the structural output).

Cloud work that local can't reliably do: multi-file refactors that span unclear boundaries — "rename this concept across the codebase" where the concept is more than a single identifier. Architecture decisions — "where should this new service fit in the existing system." Debugging across services — "the deploy is failing, figure out why." Code review on real PRs — "look at this diff and tell me what I missed." Anything that requires 50+ tool calls without drift.

The partitioning is not perfectly clean. Some tasks slide between categories depending on the codebase and the developer. The cleanest mental model is: if the task takes you, the human, more than ten minutes of thinking to specify clearly, it's probably a cloud task. If you can specify it in a sentence and verify the result in another sentence, it's probably a local task.

The partitioning also adapts over time. As open-weight models improve, the line between volume and difficulty work shifts toward local. As frontier closed models capture more reliable agentic execution, the line shifts toward cloud. The architecture is stable; the partitioning is dynamic. The trap is treating the partitioning as fixed at purchase time — locking in the assumption that this task will always be local — and then either over-investing local when cloud got better, or stubbornly using local when cloud got cheaper.

Match capability to current models, not future ones

The fallacy that drives 128GB-tier purchases in May 2026 is the bet on future models. The buyer reasons: today's best local model fits in 64GB, but tomorrow's might need 128, and I want to be ready. This reasoning gets the timing wrong in both directions.

The 128GB tier as of May 2026 is a bet on models that don't yet exist at usable quality. The models that fit in 128GB but not 64GB at competitive quantization are large MoEs in the 100B+ total parameter range — and none of them are yet good enough to justify the hardware. Qwen3-Coder-Next fits comfortably in 64GB at Q5-Q6. GLM-4.7 doesn't fit in 128GB at quality quantization (it requires 256GB+). DeepSeek-V3.2 needs server-class hardware. The window between "fits in 64GB" and "needs 256GB+" doesn't contain the model the buyer is imagining.

The bet on future models is also a bet against cloud improving. If the open-source frontier produces a 100B-class MoE that's actually competitive with frontier closed models in 2027, that release will also be available via API at $0.30-2 per million tokens — and the cloud frontier will have moved up in parallel. The relative gap between local and cloud is more stable over time than the absolute capability of either.

The right hardware sizing in May 2026 is to match the model class you'll actually use today. For most coding workflows, that's a 32B dense or 80B-A3B MoE class model running at Q5-Q6 quantization. The hardware that fits this is in the 48-64GB tier, available at $2,000-2,500 for a Mac mini M4 Pro or comparable Strix Halo box. The 128GB upsell is a bet on a future workload that hasn't materialized at quality.

This implies a refresh strategy: buy what you need now, plan to upgrade when the model class actually requires more memory or different bandwidth. The current generation will hold three to four years of useful life on the workloads it was bought for. Trying to future-proof past that window with extra memory you don't yet need is buying optionality that depreciates without being exercised.

The deepest hedge: assume the gap stays

The structural mistake that defines the local-AI-maximalist mindset is the assumption that the gap between local and cloud will close. The cleaner stance is the opposite: assume the gap stays, and build a workflow that's robust to that assumption being true.

If the gap stays, the hybrid stack is the optimal architecture. Local handles the volume of straightforward tasks at lower cost and latency; cloud handles the long tail of complex tasks where the quality difference matters. Both layers improve over time; the partitioning shifts toward local as open models improve and toward cloud as frontier models capture more of the difficulty curve, but the architecture itself is stable.

If the gap closes — open-weight models reach effective parity with frontier closed models — the hybrid stack is still optimal. More work moves to local; cloud spend goes down; the local hardware investment pays off harder. The stack adapts.

If the gap widens — frontier models continue to lap open-weight ones on agentic reliability — the hybrid stack is still optimal. The cloud share grows; local handles a narrower band of true volume work; the architecture remains coherent.

The only architecture that doesn't survive every scenario is local-only. Local-only requires the gap to close enough that even the difficulty work moves down — and that hasn't happened, isn't happening on current trajectory, and probably won't happen in the timeframe the local hardware retains its useful life.

The buyer who builds a workflow that needs local to match cloud has bought into a single-scenario architecture that's at the mercy of one specific future. The buyer who builds a hybrid workflow has bought into an architecture that's robust to all three. The deepest hedge against the local-AI bet is to not need the bet to pay off.

The trap that's hardest to avoid here is sunk-cost stubbornness. The buyer who has spent $5,000 on local hardware feels — reasonably — that they should use it. Using it means defaulting to local for tasks where cloud would do better. The default-to-local-because-I-bought-the-box reflex produces subtly worse work over weeks and months. The honest move is to use the local hardware where it serves the workflow and use the cloud where it serves the workflow, regardless of which one carries the upfront cost. The architecture is the architecture; the hardware purchase is the hardware purchase; conflating them produces worse decisions over the long run.

What to do this quarter

The concrete recommendations are short.

Test the model through an API before buying hardware to run it. Two weeks of using Qwen3-Coder-Next or a comparable open-weight model on your real codebase, through a hosted endpoint at $0.30-2 per million tokens, tells you whether the hybrid workflow you'd build is one you'd actually run. If you keep falling back to Sonnet for difficult tasks, the local setup would have done the same — except with $5,000 already spent. Do this before buying.

Don't buy in May 2026. The supply environment is the worst in years. Apple has pulled the 256GB and 512GB Mac Studio configurations and the 128GB option for the M3 Ultra; the M5 refresh is delayed to October 2026 with constrained memory; Strix Halo boxes that launched promising sub-$2,000 are retailing above $3,000. Waiting six to nine months for the M5 Studio cycle or the Strix Halo 2 generation, both of which are likely to land into a slowly-improving DRAM supply environment, saves real money.

When you do buy, decouple. A modest laptop you actually carry — a 16GB MacBook Air, a refurbished business laptop, whatever fits — and a dedicated AI box you SSH into. The AI box does not need a keyboard or monitor or portability. The laptop does not need 128GB of unified memory. The bundle in a $5,000 high-spec laptop is the depreciation trap.

Set up the operational layer in the first week. Tailscale, an editor extension, a single primary workflow, a keybinding that routes between local and cloud. The friction calculation determines whether the box gets used; the friction calculation is set in the first week of ownership and rarely revisited. Get it right early.

Match the model class you'll actually use, not the one you imagine. A 64GB Mac mini M4 Pro at $2,200 runs every coding-relevant open-weight model that's worth running today at quality quantization. The $5,000 128GB upsell is a bet on models that don't yet exist at usable quality. Don't make that bet.

Pick the quantization that preserves code quality. Q5_K_M minimum for serious coding; Q6_K or Q8_0 if memory allows. Q4 introduces subtle bugs that pass casual review and fail at runtime. The quality drop is invisible on aggregate benchmarks and visible in the specific failure modes that matter for production code.

Build a hybrid stack that doesn't depend on local matching cloud. Local handles volume; cloud handles difficulty; the architecture survives every direction the gap moves. The trap that destroys this is sunk-cost stubbornness — defaulting to local because the hardware exists. The architecture is the architecture; route work to the layer that serves it best.

What you're paying for at the top of the local AI market in 2026 is the feeling of having the best thing you can put on your desk right now. That feeling is worth real money to some people. For those people, the right answer is yes — buy the top-spec MacBook Pro, accept the bundling tax, enjoy the device. The wrong way to evaluate that purchase is on capability grounds, because Sonnet-class capability isn't on the menu locally at any price you'd put in a mini PC, and it won't be in the useful life of the hardware you'd buy.

For everyone else, the hybrid stack doesn't need that feeling to work. The architecture is robust. The economics favor decoupling. The model class supports a smaller box than the buyer's instinct suggests. The supply environment rewards waiting. The operational layer matters more than the spec sheet. The most expensive mistake in the category is consistently the one that started with the spec sheet.

You can't buy Sonnet. You can build a stack that doesn't need to. The hybrid stack doesn't need the feeling of having the best thing on your desk. You don't either.