Erdős's Conjecture Fell to a Closed AI Loop. That's the Story.

On May 20, 2026, OpenAI published an eighteen-page PDF containing a proof that disproves a conjecture Paul Erdős posed in 1946. The closed-loop pipeline that produced it — AI-written prompt, AI-generated proof, AI-graded verification, human review only at the end — is the structural story the press coverage is missing.

On May 20, 2026, OpenAI published an eighteen-page PDF containing a proof that disproves a conjecture Paul Erdős posed in 1946. The conjecture concerned how fast the number of unit-distance pairs in a planar point set can grow with the number of points. For 80 years, mathematicians believed Erdős was right; the proof was the part missing. The new result establishes that Erdős was wrong by a margin Will Sawin later refined to δ ≈ 0.014: an honest, polynomial improvement over the conjectured upper bound, valid along an infinite sequence of point-set sizes.

This is, by any standard, a real mathematical result. Nine named mathematicians — Noga Alon at Princeton, Thomas Bloom at Manchester, Tim Gowers at the Collège de France (a Fields Medalist), Daniel Litt at Toronto, Will Sawin at Princeton, Arul Shankar and Jacob Tsimerman at Toronto, Victor Wang at NSTC Taiwan, and Melanie Matchett Wood at Harvard — verified it, then wrote a companion paper digesting the argument for the broader community. Gowers reviewed it. Bloom, who had publicly torn apart a previous AI math claim, endorsed this one. The argument connects discrete geometry to algebraic number theory via Golod–Shafarevich class field towers, and that connection, in retrospect, is the kind of move that wins prizes.

The press coverage has focused on the math. That coverage is missing the structural story.

The piece of the announcement worth dwelling on is not the proof. It is the loop that produced it: an AI wrote the problem statement; a general-purpose reasoning model produced the proof in one shot; an AI grading pipeline flagged the proof as likely correct; only then did human mathematicians enter to verify and rewrite for publication. The full generative-evaluative cycle of mathematical discovery ran without humans in the inner loop. This has not happened before in a public, verified, peer-reviewed math result. The loop, not the result, is what scales to other fields.

What 80 years of mathematicians did not actually miss

The standard framing of this result casts it as a problem too hard for humans, finally cracked by a sufficiently capable AI. That framing is not quite right. The unit-distance problem was not too hard. It was approached from a habit.

Erdős's own 1946 lower bound used the Gaussian integers ℤ[i]: take a √n × √n grid of points whose coordinates are Gaussian integers of bounded norm. A product of many rational primes p ≡ 1 (mod 4) has many representations as zz̄, which gives many lattice vectors of the same length. This produces n^(1 + Ω(1/log log n)) unit-distance pairs. The conjectured upper bound, that this is essentially optimal, fit the empirical pattern. Spencer–Szemerédi–Trotter proved the O(n^(4/3)) upper bound in 1984. Recent work by Alon, Bucić, and Sauermann (2025), and by Greilhuber, Schildkraut, and Tidor (2025), on generic norms in higher dimensions, hardened the intuition. The conjecture appeared safe.

The evidence base had been steadily accumulating. Matoušek (2011) proved O(n log n log log n) bounds for most planar norms. The Alon–Bucić–Sauermann result on a comeagre set of norms in ℝ^d gave d² · n · log²(n) unit distances on every n-point set. Greilhuber–Schildkraut–Tidor matched the lower bound for all norms in ℝ^d. Each of these results suggested the Euclidean case should behave similarly. The conjecture was widely believed to be true; the path to its disproof required entering a different mathematical universe entirely — algebraic number theory, where the relevant intuitions were so different from real-variable arguments that the cross-domain connection was easy to miss. The companion paper from the nine verifying mathematicians notes this directly: most experts believed the n^(1+o(1)) conjecture to be true.

In Diophantine language, the problem of constructing planar point sets with many unit distances is equivalent to finding number fields F in which the number of representations r_{2,F}(α) of an integer α as a sum of two squares grows rapidly. Erdős's grid corresponds to taking F = ℚ and using rational primes p ≡ 1 (mod 4). The natural generalization keeps F fixed and increases the size of α; in that regime, r_{2,F}(4D²) is O_{F,ε}(D^ε) for any fixed F, which is too slow to beat the conjecture. Several number theorists had tried something in this direction over the decades. None of these attempts produced a better bound; the gain from increasing the field's degree was cancelled by losses elsewhere.

The AI's approach turns this on its head: keep α bounded and grow F. Along an infinite tower of fields with bounded root discriminant, r_{2,F}(4D²) grows exponentially in [F:ℚ]. The same phenomenon shows up in the analogy with function fields 𝔽_q(t), where log q plays the role of [F:ℚ] — large-q analytic number theory has been a productive analogy in nearby areas for decades. None of this was applied to the unit-distance problem before.

This is the inversion. The model kept the parameter fixed and varied the field, letting the degree [K:ℚ] grow to infinity. The companion paper from the nine verifying mathematicians notes the novelty plainly: prior constructions of magnitude-one algebraic numbers in CM fields — including Akshay Venkatesh's celebrated sphere-packing work, Hendrik Lenstra's coding constructions, and Mark Kopp et al.'s equiangular lines work — all used number fields. The new ingredient was the limit. A novel ingredient of the AI argument is to take [K:ℚ] → ∞.

The model itself anticipated this in its chain of thought. The companion paper preserves the reasoning trace: "Number fields deserve a closer look." The six-word fragment captures the moment of strategic redirection. The model recognized that the natural geometric intuition — "more points means bigger region" — was a dead end, and pivoted to a higher-dimensional ambient arithmetic projected down. Mathematicians intuit through priors. The AI did not have the prior to suppress.

This is the kind of move that is easy to miss because it goes against the geometric grain. A working number theorist has eighty years of cumulative habit pointing at "bigger parameter," partly because that is what worked in the original Erdős construction, partly because most experts believed the n^(1+o(1)) conjecture was correct and so were tuning their attention toward proving it rather than disproving it. The model had no such tuning. It was given the problem cold, with no prior on what the answer should be.

The proof, deconstructed

The model's actual output is reproduced verbatim in the OpenAI PDF as the "Final Response from Internal Model." It is eight pages of dense mathematical prose. There are no false starts visible. The opening sentence selects the tool: a standard form of the Golod–Shafarevich argument with prescribed splitting.

The chain of reasoning that follows pulls together six advanced tools that are not normally combined for this purpose.

Cyclotomic class field theory constructs a totally real cyclic cubic field F over ℚ with controlled root discriminant, built from cubic subfields of cyclotomic fields ℚ(ζ_{r_i}) for primes r_i ≡ 1 (mod 3).

Shafarevich's relation-rank estimate bounds r(G) for the maximal everywhere-unramified pro-3 Galois group above F, where G is generated by at least ℓ−1 elements coming from the elementary abelian quotient (ℤ/3ℤ)^(ℓ−1).

Chebotarev's density theorem chooses rational primes q_1, …, q_t with three simultaneous properties: they split completely in F, they satisfy q_b ≡ 1 (mod 4), and their Frobenius classes lie in the Frattini subgroup Φ(G).

Frattini subgroup theory — specifically Burnside's basis theorem and the Hajir–Maire–Ramakrishna tower-cutting construction — quotients G by the closed normal closure of these Frobenius elements without lowering the generator count, adding only 3t new relations.

The Golod–Shafarevich inequality then concludes that the quotient G̅ is infinite. A finitely generated pro-p group with r ≤ d²/4 cannot be finite. The choice t ≈ ℓ²/100 makes the arithmetic work: 3t relations plus a linear correction stays well below d²/4 for ℓ large.

Minkowski's class-number bound, combined with the bounded root discriminant from the unramified tower, produces CM fields K_j = F_j(i) with class numbers at most exponential in [K_j:ℚ], where F_j is the j-th layer of the infinite unramified pro-3 tower in which every q_b splits completely.

That is the arithmetic. The geometric step is shorter. Embed K_j into ℂ^(f_j) via the Minkowski embedding. The magnitude-one elements u with uc(u) = 1 — where c is the nontrivial automorphism of K_j over F_j — become lattice translations whose every coordinate has modulus one. Choose a window of polydisc radius R. Average over cosets of the lattice via a standard unfolding argument. Pick the coset with the most pairs. Project to one complex coordinate. The projection is injective on the lattice — the kernel would be an algebraic integer with one conjugate equal to zero, hence zero. The resulting planar point set has n_j points and at least ½ · n_j · e^(γf_j/2) unit-distance pairs, where γ = t log 2 − log H_ℓ. Setting δ = γ/(4B) gives the polynomial improvement.

Several details are worth dwelling on. The model used p = 3 deliberately, noting in its own output that p = 2 would create technical complications at the dyadic prime — the kind of micro-optimization a research number theorist would make to anticipate edge cases. The Frobenius-cutting step in the model's original output was a finite-depth weaker version of the Hajir–Maire–Ramakrishna method; the companion paper notes that the full construction with infinitely many split primes was already in the literature and gives a stronger result. The model knew enough number theory to deploy the technique correctly, but used a slightly conservative version of it.

The honest characterization, from people who actually read the proof: the techniques in the model's output are not new. Golod–Shafarevich is from 1964. Hajir–Maire is from 2001. The Ellenberg–Venkatesh small-split-prime pigeonhole used in Lemma 2.1 of the companion paper is from the early 2010s. What is new is the application — using these tools to attack a discrete-geometry problem nobody had attacked this way. A number theorist looking at the proof would not say "this is a new method." They would say, in the companion paper's careful phrasing, that the argument relies crucially on ideas that may, in retrospect, be attributed to Ellenberg–Venkatesh, Golod–Shafarevich, and Hajir–Maire–Ramakrishna. It is a clever recombination.

The autonomy stack

What is qualitatively new is not the math. It is the production pipeline.

OpenAI's announcement made three structural claims about how the proof was produced, and each one matters. First, the problem statement was AI-written. The prompt reproduced verbatim in the PDF is unusually formal and rigid — it explicitly forbids partial progress, demanding either a complete proof of the Erdős upper bound or a complete disproof. The prompt reads like it was generated by an AI auto-formalizing the problem from natural-language descriptions in the literature. A human posing this problem would more likely tolerate intermediate progress, partial bounds, or special cases. The AI-written prompt did not.

The prompt's rigidity is worth examining. It demands a complete resolution either way, explicitly excluding finite verification, special cases, structural reductions, or heuristic evidence as acceptable partial progress. A human researcher posing the same question to a colleague would phrase it more openly, leaving room for incremental results — partial bounds, asymptotic improvements, related conjectures. An AI prompt-writer optimizing for clean evaluability would phrase it exactly as this one does — binary, complete, unambiguous. The prompt is a tell that the inner cycle was designed for AI-to-AI communication, not human-to-AI. The proof's success against this rigid framing matters precisely because such framings cut off the partial-credit paths human researchers usually take when they cannot solve the full problem.

Second, the proof was generated in one shot by a general-purpose reasoning model. Not AlphaProof. Not a Lean-tuned theorem prover. Not a system fine-tuned on the unit-distance problem. The model was being evaluated on a batch of Erdős problems, and this proof was one of the outputs. OpenAI noted explicitly that the model was not scaffolded with proof-search tools or targeted at this problem in particular. The fact that it produced a verifiable proof on this particular question is a downstream consequence of evaluating a sufficiently capable general model on a sufficiently large batch of open questions.

Third, and most easily overlooked: the proof was first checked by an AI grading pipeline, which indicated high confidence the solution was correct before any human looked at it. The PDF makes this sequence explicit: model generates, AI grades, then humans verify. The AI grading step is the part of the loop that has historically been the bottleneck for AI math. A model that can generate plausible-looking proofs is useless if humans must read every one to check correctness — the verification cost scales with the number of attempts, and proofs are slow to read. An AI that can pre-filter for likely-correct proofs collapses the verification cost from O(humans × proofs) to O(humans × promising proofs). The first cost is unbearable. The second is tractable.

The completed loop is the structural innovation. Prompt-generation → solution-generation → solution-verification, all running inside the AI stack without humans in the inner cycle. Humans entered the loop only after the AI grading pipeline flagged this particular proof as likely correct. From the moment the AI wrote the prompt to the moment the AI confirmed the proof was likely valid, the entire generative-evaluative cycle of mathematical discovery happened inside the model and its scaffolding.

This is the part that scales. The δ ≈ 0.014 improvement on the Erdős bound is a one-off — there is only one Erdős unit-distance problem, and now it is solved. The closed loop, on the other hand, can be pointed at the next problem, and the one after. If the loop's success rate on open problems is even one percent, the math community now has a way to grind through Erdős's roughly 1,500 published open problems faster than any human could ever do. Most attempts will produce nothing. Some will produce results like this one. The math itself becomes a population-statistical phenomenon, more like a high-throughput screening pipeline than the slow craft it has historically been.

A subtler implication: the AI grading step is itself an interesting object. The proof is correct, which means the grading pipeline's confidence assessment was correct. We do not know how the grading pipeline works internally, but its existence implies OpenAI has built a system that can evaluate the validity of mathematical arguments well enough to be useful as a pre-filter. That is its own capability claim, distinct from the proof generation. A grading pipeline that works on this problem might work on others.

Why AlphaProof and AlphaGeometry do not compare

There has been a steady accumulation of AI math results over the last two years, and the natural question is how this fits the trajectory. The honest answer is that it does not fit the trajectory. It is structurally different from what came before, and the differences are not subtle.

AlphaProof, the DeepMind system that scored a silver medal at the 2024 International Mathematical Olympiad with 28 out of 42 points, is a reinforcement-learning system that operates entirely inside Lean, a formal proof assistant. AlphaProof was trained on 80 million auto-formalized mathematical statements. Its proofs are mechanically verifiable because Lean's type-checker checks them — there is no need for a separate AI grading pipeline because Lean is the grader. At IMO 2024, the problems had to be manually translated into Lean by human experts, and the proofs had to be translated back out for human readers. AlphaProof took two to three days of computation per problem. The system is specialized for formal math, formally scaffolded, and human-mediated at the input and output boundaries.

The computation was not pure inference. AlphaProof used Test-time RL: at competition time, the system continued reinforcement-learning on self-generated variations of the problem, gradually building up enough proof traces in similar territory that one would eventually succeed. This is a fundamentally different shape of computation from one-shot proof generation. The two-to-three-day cost was the cost of training, in production, for each specific problem. The OpenAI Erdős model, by contrast, produced its eight-page proof in one shot from the prompt — no problem-specific training, no test-time learning, no scaffolded search.

AlphaGeometry, and its successor AlphaGeometry 2, is a hybrid neural-symbolic system specialized for Euclidean geometry. It pairs a neural language model that proposes auxiliary constructions with a symbolic deduction engine that follows standard geometric inference rules. The 2024 IMO geometry problem was solved by AlphaGeometry 2 quickly. The system has a constrained domain — points, lines, circles, angles — and a constrained inference engine. It works because the domain is small enough to be formally tractable. The natural-language problem statement still had to be translated into the system's internal representation by humans.

Gemini Deep Think, which scored gold-medal level at IMO 2025, was a step further along the trajectory: operating end-to-end in natural language without formal translation, finishing within the competition's 4.5-hour time window. But the IMO is still a constrained competition. Problems are vetted by humans, calibrated to be solvable in finite time by smart undergraduates, and have known-existence solutions. The model knows a solution exists; the only question is finding it.

The OpenAI Erdős result is none of those things. The model was a general-purpose reasoning system, not a specialized math system. It operated in natural language, not Lean. It received an AI-generated problem statement, not a human-curated one. The problem was a genuine open research question with no known answer — Erdős had asked it, and 80 years of mathematics had not given a clean answer either way. There was no time limit, no proof-search scaffolding, no formal verification engine, no Lean library to lean on. And the verification was AI-graded before human review.

This is the first AI math result that looks like real research rather than competition math or specialized theorem proving. It is therefore the first one that gives a credible signal about what closed-loop AI can do on the actual problem distribution working mathematicians face, where there is no oracle that says "a solution exists," no time limit imposed by competition rules, and no scaffolded formalization step. The trajectory from AlphaProof to AlphaGeometry to Gemini Deep Think to this is one of progressively removing scaffolding while expanding scope. The OpenAI result is the first point on the curve where almost all the scaffolding is gone.

The Co-Scientist parallel

The closed-loop frame becomes most legible when one looks sideways at Google DeepMind's Co-Scientist, which was published in Nature one day before OpenAI's announcement, on May 19, 2026. Co-Scientist is the biology version of the same idea: a multi-agent system that generates research hypotheses, runs an Elo-based tournament between them to surface the best, verifies hypotheses against literature and external databases like ChEMBL, UniProt, and AlphaFold, and produces final research proposals.

The case studies are concrete. A fibrosis drug candidate that blocked 91% of a scarring-linked response in cell culture at Stanford's Gary Peltz lab, published in Advanced Science. An integrated stress response hypothesis confirmed at Calico Life Sciences. An ALS-targeted RNA approach generated at MIT's Ritu Raman lab in collaboration with Ryan Flynn. A narrowing of zoonotic flu and COVID transmission to specific amino acids at Clare Bryant's Cambridge lab. These share a structural pattern with the Erdős result: AI generates the hypothesis, AI runs the verification loop against grounding data, humans enter only for final lab confirmation or theoretical review.

The two systems do different things. Co-Scientist generates and ranks hypotheses; the OpenAI model writes proofs from scratch. But they share the same underlying bet: the bottleneck in scientific discovery is not generation, it is the generation-verification cycle. Humans are too slow to read every hypothesis or every proof draft. An AI that can pre-filter, rank, and verify against grounding data shrinks the human input down to the things that survive automated quality checks. That collapse changes the economics of research.

Both systems also share the same caveat: they are heavily selected. Co-Scientist's six case studies are drawn from collaborations with more than 100 institutions; the failures are not in the press release. OpenAI's Erdős proof was one win from a batch of Erdős problems; the other batch members were not announced. The selection effects are real. But the existence of the loop is not undermined by selection — selection is precisely what the loop enables at scale. A pipeline that generates 1,000 hypotheses, scores them, surfaces 10, and finds 1 that holds up in the lab is exactly what closed-loop research looks like in practice.

The parallel suggests that the autonomy-stack framing extends beyond mathematics. The same shape — AI-generated problem, AI-generated candidate solution, AI-graded verification, human review only of likely-correct candidates — works in biology. It should work in any field where the candidate solutions can be checked against grounding data faster than they can be generated by humans. The Erdős result and the Co-Scientist result published within 24 hours of each other are not a coincidence; they are two instances of the same underlying capability becoming legible.

What still requires humans

The closed-loop frame does not imply that mathematicians are about to become obsolete. The Erdős proof actually clarifies what humans are still required for, and the answer is not what one might expect.

The first thing humans are still required for is the choice of which problem to attack. The Erdős unit-distance problem was selected by humans at OpenAI for evaluation — it is a well-known, well-defined, well-studied conjecture, with a clean yes/no answer, and a strong prior on what the answer should be. The model did not pick this problem; OpenAI did. Picking the right problem is the part of mathematical research that requires taste, judgment, and an opinion about what is important to know. The model does not yet have that opinion. The closest analog is the AI-written prompt, which formalizes a specific question — but the choice of which question to formalize in the first place was upstream of any AI step.

The second thing humans are still required for is the upper-bound side of the problem. The model disproved the lower-bound conjecture by constructing point sets that exceeded it. Proving the upper bound — showing that no point set can exceed some bound — is harder, because it requires reasoning about all possible configurations rather than producing one specific example. The Spencer–Szemerédi–Trotter O(n^(4/3)) upper bound has stood since 1984. The model has made no dent in it. Construction beats characterization, and the AI's relative strength has so far been in the construction direction. This is true across other AI math results as well: AlphaProof's IMO solutions were heavily weighted toward problems with constructive answers.

The third thing humans are still required for is judgment about what the result means. The model produced a correct proof. The companion paper from the nine mathematicians did three things the model did not: it generalized the argument, it placed the result in the broader landscape of class-field-tower constructions used in coding theory (Lenstra, Litsyn–Tsfasman), sphere packing (Venkatesh), and equiangular lines (Kopp et al.), and it offered nine separate reflections on what the proof says about the relationship between AI and mathematical practice. That contextualizing work is the part that turns a result into a contribution to the field. It is not yet automated.

These three roles — problem selection, asymmetric reasoning, contextual judgment — are likely to be the last parts of mathematical research to compress. They are also the parts working mathematicians spend most of their time on. The mechanical-grinding part of math is shrinking. The taste-and-judgment part is not.

The honest caveats

There are several things this result is not, and editorial honesty requires naming them.

It is not a proof that AI can do original mathematics in the deepest sense. The techniques the model used — Golod–Shafarevich towers, the Hajir–Maire method, Minkowski embedding, the Ellenberg–Venkatesh pigeonhole — were all available to working number theorists. The model combined them in a way humans had not combined them, which is genuinely interesting, but combination is not invention. A serious skeptic could say: the model did sophisticated mathematical bricolage, not original mathematics. The skeptic would not be wrong about the bricolage; what they would be missing is that almost all original mathematics, by working mathematicians, is also bricolage. The model did what humans do, on a problem humans had not solved, faster.

It is also not evidence that mathematicians had been stupid. The conjecture appeared safe for 80 years partly because the natural number-theoretic generalizations of Erdős's grid did not give better bounds, and partly because most experts believed the n^(1+o(1)) conjecture was true. The companion paper from the nine verifying mathematicians is quite direct about this: experts believed the conjecture, the construction is subtle, and the subtlety is real. Mathematicians missed this for the same reason scientists miss things in other fields — they were looking in the directions their training and intuition pointed. The AI's advantage was not raw intelligence. It was the absence of priors that locked human attention into certain paths.

The improvement is small. δ ≈ 0.014 is a polynomial improvement, which is qualitatively important — it disproves the conjectured form of the bound, not just the constants — but the actual gap between best lower bound and best upper bound, O(n^(4/3)) unchanged since 1984, is still gigantic. The Erdős upper-bound question is still wide open. Whoever closes that gap will be doing more mathematically substantial work than what was just done on the lower bound. The model has not made mathematicians obsolete; it has demonstrated capability on one direction of one specific problem.

The proof has been human-rewritten. The published 18-page PDF is expositionally refined through human interactions with Codex before reaching its final form. The original model output is preserved verbatim in Section 1, but the surrounding exposition is human-edited. The companion paper from the nine mathematicians is a human-digested, simplified, and somewhat generalized version of the AI proof. The output is therefore a hybrid artifact. The mathematical content is the AI's; the presentation is humans' work. This is not deception — OpenAI is forthright about the editing sequence — but it means the "AI wrote this proof" claim is more precisely "AI produced the mathematical content of this proof in one shot, and humans then refined the presentation."

There is a meta-caveat that deserves naming. We do not know what was in the model's training data. If the model had been trained on a corpus that included Hajir–Maire–Ramakrishna's 2021 work, Ellenberg–Venkatesh's small-split-prime pigeonhole, and the broader literature on Golod–Shafarevich tower constructions in coding theory and sphere packing, then "recognizing" that these tools applied to the unit-distance problem is a less impressive cognitive act than it might otherwise appear. OpenAI has not disclosed the training data composition for the model used. The closest the model came to a strategic insight was the chain-of-thought fragment about taking [K:ℚ] to infinity — but even that insight is downstream of seeing the technique in some other context. The model's contribution is real either way, but the magnitude depends on how much of the synthesis was actually novel inside the model's representational space versus interpolated from training data the model had seen but not previously connected to this problem. We do not have the means to answer that question from outside.

None of these caveats undermine the structural argument. They are, in some ways, the reason the structural argument matters. The model did not produce a finished paper. It produced a correct argument that humans could then verify and refine. The verification step and the refinement step are themselves bottlenecks that will be automated in the coming years; the OpenAI announcement showed the first piece of that automation, the AI grading pipeline. As more of the loop closes, the human role in mathematical discovery will compress further. The Erdős proof is the moment we can see the compression beginning.

Where this goes

Erdős famously said about hard problems: wait 100 years. This one took 80, and the last meter was crossed by a language model running overnight, then graded by another language model, then handed to humans for verification. The closed loop is the part that scales. The δ ≈ 0.014 improvement is the part that does not.

The math community now has a working demonstration that a general-purpose LLM, operating inside an AI-generated and AI-graded pipeline, can produce a correct proof of an 80-year-old open problem. There are many more 80-year-old open problems. There are many more fields where the bottleneck is the generation-verification cycle rather than the raw cognitive horsepower. The Erdős result is not the end of human mathematics; the upper bound is still open, and most of the interesting questions in discrete geometry remain. But it is the first credible signal that closed-loop AI research is becoming a real method, alongside the methods mathematicians have used for centuries.

The next few years will determine whether the loop holds across other fields. Co-Scientist's biology results are early but real; AxiomProver's Lean-verified Putnam solutions are early but real; the OpenAI Erdős result is early but real. Each one alone is a press release; together they are a pattern. The pattern is that the human role in scientific discovery is moving from inside the inner loop to the boundary, where verification, judgment, and choice of problem still require human attention but where the generative work is increasingly automated.

What working mathematicians and scientists should be watching for is not the next big result. It is the next big improvement to the grading pipeline. The bottleneck on closed-loop research is not how clever the model is; it is whether the verification step can keep up. OpenAI's announcement implicitly bet that AI grading will get better, and that the resulting compression of the human review burden will make population-scale generation tractable. If that bet is right, the next decade in mathematics looks different from the last one — not because humans stop doing math, but because the unit of mathematical labor stops being "one mathematician thinking about one problem" and becomes "one mathematician verifying ten AI-generated candidate proofs of ten different problems."

The next twelve months will be informative. If the OpenAI model — or its successors — produces verified results on a second open problem of comparable significance, the closed-loop frame becomes the operating consensus. If the next twelve months produce only press releases on smaller results, or one-off victories that do not reproduce, the closed-loop frame remains conjectural. Either outcome will be visible. The Co-Scientist Nature paper and the OpenAI Erdős announcement, within 48 hours of each other, are not yet a trend — they are two data points. Watch for the third.

Erdős was wrong by 0.014. The loop that proved it is the larger number.