Spiky Superhuman AI is here - what’s next?
Google DeepMind released AlphaEvolve and the results are “spectacular”: “I think AlphaEvolve is the first successful demonstration of new discoveries based on general-purpose LLMs.” AlphaEvolve has discovered a more efficient 4x4 matrix multiplication algorithm, a more efficient hexagonal packing algorithm, and 23% speedup across Gemini training kernels.
These are new discoveries! Almost by definition, these results are superhuman. The effects of these deployments are substantial. The 23% speedup across the Gemini training kernels saved 1% of the total training time of Gemini. Similar runs reportedly cost in the tens to hundreds of millions of dollars of compute time, which would be >$1M in savings!
You might quibble about the specific details. How much human effort has actually been deployed towards these problems? Do they generalize to domains outside of math and computer science? You’ve probably used an AI tool that has been absolutely a waste of time. How do these all fit together?
These are valid questions, but despite them, I believe it’s clear that the era of general-purpose spiky superhuman AI (SSAI) is here.
What is SSAI?
First, let’s define spiky superhuman AI.
I’ll say that an AI system on a specific set of tasks is superhuman if it can outperform 99.99% of humans on that set of tasks. If you want to be conservative, you can say every human alive today, but that doesn’t substantively change anything.
We already have superhuman AI systems:
Chess engines have been superhuman for decades.
AlphaGo has beaten the world champions since at least 2018.
o3 appears to beat humans at localizing images (i.e., Geoguessr).
AlphaEvolve has found new advances in matrix multiplication and hexagon packing.
A general-purpose SSAI system is an AI system that is superhuman on a wide range of tasks using general-purpose AI techniques. Every general definition has boundaries but AlphaEvolve clearly fits here: Google uses Gemini in a wide range of tasks spanning its entire business.
Finally, what’s spiky? Spiky means that progress is highly uneven between domains. Even though Gemini is superhuman in certain coding and math tasks, it can’t win the International Math Olympiad or cure cancer. It also can’t write, from start to finish, a literary masterpiece.
Today, these SSAI systems are trained with reinforcement learning - I’ll use the imprecise term reinforcement fine-tuning (RFT) to distinguish this from other forms of reinforcement learning (such as RLHF). RFT has already been shown to scale out to any task with large amounts of verifiable tasks (this is called reinforcement learning with verified rewards - RLVR).
What tasks can be easily verified? Games with simple win/loss conditions (Go, Chess, etc.), coding challenges, and math problems with numeric answers or computationally gradable solutions (e.g., math competition questions) are all easily verifiable. In fact, Geoguessr is also easily verifiable and it’s easy to generate hundreds of thousands of problems! These tasks have already fallen under the relentless progress of AI.
Progress has been uneven though, spiky as I call it. AlphaEvolve and o3 still struggle with many economically productive tasks, including financial analysis tasks.
What can we expect next?
We already have general-purpose SSAI and AlphaEvolve is proof of that. What happens next?
Frontier AI labs spend an enormous amount of money on generating and labeling these tasks. Scale AI, a vendor for AI tasks and labels, had over $800 million in revenue last year and is on pace to even more revenue this year (>$2 billion). If we ballpark that a training data point costs $100 (this is already ~1400x more expensive than binary labels for an image!) and each frontier AI lab is spending ~$400M on tasks, that’s 4 million tasks! That’s plenty to generate tens to hundreds of thousands of tasks in different domains (medical, legal, etc.).
If we extrapolate from AlphaEvolve and the progress from OpenAI’s o1 to o3, it’s safe to assume that enormous amounts of data have already been generated to train the next generation of models (Gemini 3+, OpenAI’s o4+). Expect to see these models become superhuman on a wide range of easily verifiable tasks, beyond what we’ve seen already. These tasks can be quite complex to solve, such as improving LLM training kernels.
Here’s my prediction: in the next 24-48 months, AI will be superhuman at nearly any task that can be easily verified and where lots of problems can be generated. This will likely include tasks from domains spanning medicine, legal, accounting, and many others.
What’s unknown is if this progress will continue straight to general superhuman AI systems.
Beware of RFT generalization
So far, I’ve made the case for progress in spiky SSAI systems. What about general-purpose SSAI?
The problem with these systems today is that RFT struggles to generalize in the “same way” that pretraining does. RFT on verified math problems doesn’t generalize to proofs (o3 crushes AIME but flops the IMO), but more importantly, RFT on math doesn’t appear to generalize to other domains, like legal tasks. Although we don’t know what data o1/o3/AlphaEvolve were trained on, this lack of generalization has been anecdotally confirmed by Sam Altman.
However, algorithmic progress has made incredible strides. Once we see this kind of generalization (within a domain but on different tasks, and across domains), we’re likely to see a bootstrapping straight to general SSAI. Watch out for signs of this.