Google DeepMind's AlphaProof Nexus Solves Decades-Old Math Problems for a Few Hundred Dollars

Google DeepMind just solved math problems that have stumped human mathematicians for over half a century — and the total compute cost ran a few hundred dollars per proof. The system, called AlphaProof Nexus, autonomously cracked 9 out of 353 open Erdős problems it attempted, including two questions that had gone unanswered for 56 years.

This isn't another benchmark-topping headline. It's a shift in how AI approaches formal reasoning — and a surprising result about whether complexity actually helps.

What AlphaProof Nexus Actually Did

The DeepMind team set their system loose on three classes of problems:

Erdős problems: 9 solved out of 353 attempted. Two of those had been open since 1970. The problems span combinatorics, number theory, and graph theory — areas where Paul Erdős, one of the most prolific mathematicians in history, left questions that generations of researchers couldn't close.

OEIS conjectures: 44 proved out of 492 open conjectures from the Online Encyclopedia of Integer Sequences. The OEIS catalogs integer sequences and their properties; proving conjectures about them is foundational work in discrete mathematics.

Hilbert functions: A 15-year-old question in algebraic geometry about the behavior of Hilbert functions under certain constraints — settled.

Convex optimization: An improved bound in a subfield where even marginal gains require months of human effort.

Inference costs, according to the research paper, ran "a few hundred dollars per problem." To put that in perspective: a single graduate student working full-time on one of these problems might spend months or years. The system spent hours and a fraction of a research budget.

How It Works: LLM + Compiler Feedback

AlphaProof Nexus isn't a monolithic model. It's a system architecture built around four agent variants of increasing complexity, all running on Gemini 3.1 Pro.

Agent (A) — the simplest — deploys independent sub-agents in loops. The LLM generates proof steps in Lean's formal language. The Lean compiler checks each step. Error messages feed directly back into the next attempt. The LLM gets grounded by symbolic feedback, not just its own reasoning.

Agent (B) adds queries to AlphaProof, DeepMind's reinforcement-learning system for olympiad math, to fill in missing proof segments.

Agent (C) introduces an evolutionary component inspired by AlphaEvolve. Sub-agents share a common population of proof sketches. Rating agents score these for plausibility and novelty, then rank them using an Elo system.

Agent (D) combines all capabilities — LLM loops, AlphaProof queries, and evolutionary search.

Agent (D) was used for the Erdős problems. But the post-hoc analysis revealed something unexpected.

The Surprise: Simple Beat Complex

Agent (A) — the simplest variant, just an LLM and compiler feedback — could also prove all nine solved Erdős problems. It took more money on the hardest ones, but it got there.

The researchers attribute this to two factors: rapid improvement in the underlying language models, and what they call "the power of compiler feedback in grounding LLM reasoning." The formal verification layer acts as a safety net that offsets the well-known logical weaknesses of language models.

"This points to a broader trend," the paper notes: "an ongoing shift from specialized trained systems toward simple agentic loops as LLMs become more capable."

In other words: the fancy multi-agent, evolutionary, reinforcement-learning stack may not be necessary. A tight loop between an LLM and a compiler might be enough for a surprising range of problems.

Why This Matters Beyond Math

Formal verification is about to get cheap. If a few hundred dollars of compute can settle open questions in pure mathematics, the same architecture applies to software verification, protocol auditing, and safety-critical system checks. Bugs that currently require manual code review might be caught automatically.

Lean as an interface language has arrived. Lean is a formal proof assistant used by mathematicians. The fact that an LLM can write valid Lean code well enough to prove original theorems means the barrier between natural language reasoning and formal verification is shrinking. This has implications for AI safety research, where formal specification of alignment properties has been a bottleneck.

The cost curve is the story. A few hundred dollars per proof is not free, but it's within the budget of any research lab, startup, or even a motivated independent researcher. Open mathematical questions are about to get a lot more crowded with attempts.

What Google Isn't Saying

DeepMind's framing is careful. The paper emphasizes that these are pilot-scale results. The system failed on 344 of 353 Erdős problems — a 2.5% success rate. That's not a general-purpose mathematician. It's a specialized tool that happens to be very good at a specific type of problem.

The "few hundred dollars" figure also doesn't include the research and engineering cost to build the system, the pre-training of Gemini 3.1 Pro, or the infrastructure to run it at scale. It's the marginal cost per proof, not the total cost of the program.

And there's a selection effect: the researchers chose problems that Lean could express. Many open questions in mathematics don't have formalized statements yet. The system can't solve what it can't encode.

What Mathematicians Are Saying

The reception has been cautious but interested. MathOverflow and the Lean community Zulip channels have active threads parsing the proofs. Early reactions cluster around two points:

Positive: The proofs are valid. Lean-checked proofs are machine-verified; they're not subject to the "did the AI hallucinate this?" problem that plagues natural-language mathematical claims. The results are real.

Skeptical: Nine problems is not a revolution. Erdős posed thousands of problems. The system solved nine. The real test is whether the architecture scales to harder classes of problems — the ones that have resisted formalization for decades.

One mathematician on Zulip put it bluntly: "It's a very good theorem-proving tool. It's not a mathematician."

The Bottom Line

AlphaProof Nexus is the most credible demonstration yet that LLMs, when tightly coupled with formal verification, can produce original mathematical proofs. The 2.5% success rate on Erdős problems is not the headline — the fact that it's non-zero on problems this hard is.

The deeper implication is architectural. The simplest agent variant solved everything the complex one did, suggesting that the real innovation is the feedback loop, not the multi-agent orchestration. If that holds, expect rapid improvement as language models get better — because the compiler feedback mechanism scales with model capability automatically.

For researchers and engineers: formal verification is about to become a standard tool, not a specialist niche. The question is whether the Lean ecosystem can grow fast enough to keep up with the demand this will create.

Related reading:

The Catch

It doesn't work everywhere. Agentic AI shines in structured workflows but struggles with ambiguous tasks requiring human judgment.

The setup is real work. Connecting agents to existing systems takes engineering time most teams underestimate.

Monitoring is harder. When something breaks, tracing the failure path across multiple agent steps isn't straightforward yet.