A Chatbot Just Solved A Math Problem That Has Stumped Humans For Decades

Mathematica ex machina.


Dr. Katie Spalding


Dr. Katie Spalding

Freelance Writer

Katie has a PhD in maths, specializing in the intersection of dynamical systems and number theory.

Freelance Writer

Edited by Laura Simmons
Laura Simmons - Editor and Staff Writer

Laura Simmons

Editor and Staff Writer

Laura is an editor and staff writer at IFLScience. She obtained her Master's in Experimental Neuroscience from Imperial College London.

little robot with lots of exposed gears and springs looks proud as he points a wooden pointer towards a chalkboard with a solved math problem written on it

Look, he's so proud!

Image credit: Besjunior/

Lazy students the world over may have just got the loophole they were waiting for when it comes to using ChatGPT for their math homework: it turns out, even the researchers at Google do it. And the really impressive part? It looks like the artificial intelligence (AI) might have outperformed its creators.

What is the breakthrough?

“When we started the project there was no indication that it would produce something that’s genuinely new,” Pushmeet Kohli, the head of AI for science at Google’s DeepMind, told the Guardian. “As far as we know, this is the first time that a genuine, new scientific discovery has been made by a large language model.


That’s right: according to the engineers in Google’s AI department, a chatbot is now one of the leading minds in the notoriously annoying mathematical field of combinatorics. It was only meant to be a proof-of-concept at first – the real breakthrough was a new algorithm that the team have dubbed FunSearch – but instead, the AI went ahead and found solutions to open problems that were better than any previously found.

“FunSearch discovered new solutions for the cap set problem, a longstanding open problem in mathematics,” wrote Alhussein Fawzi and Bernardino Romera Paredes in a blog post for DeepMind.

“The problem consists of finding the largest set of points (called a cap set) in a high-dimensional grid, where no three points lie on a line,” they explained. 

Perhaps an example will help here. In the game Set (no relation), 12 cards are dealt, each marked with a unique combination of shape, color, shading, and quantity. Players then aim to find a set of three that have every one of those features either unique or the same – for example, a card with one red solid diamond, another with two blue striped diamonds, and a third with three green empty diamonds, would form a set, because all have diamonds, but the colors, shading, and number of diamonds on each are all different.


If nobody can spot one of these sets from the 12 cards on the table – which is perfectly possible – then more cards are laid out until one is found. And because mathematicians are tricky bastards, somebody decided to ask how many cards can be dealt before a set has to be there – or, in math-speak, what the maximum size of a cap set in Z34 is.

Now, that particular problem was solved in 1971 – the answer is 20, by the way – but for larger sets, things are much more difficult. As is depressingly common in combinatorics, the number of potential solutions grows super-fast – you only have to get as far as eight features before you’re dealing with something like 31600 potential “cards”. 

Unsurprisingly, humans haven’t solved that one yet – because, well, why would you even try? More than that: how would you even try? That’s not rhetorical, by the way: mathematicians don’t even agree on the best way to even attempt the cap set problem for n = 8, let alone what the answer actually is

Which is why it’s so remarkable that Google’s AI appears to have solved it, with a hitherto unknown cap set of size 512.


“This is the first time anyone has shown that an LLM-based system can go beyond what was known by mathematicians and computer scientists,” Kohli told Nature. “It’s not just novel, it’s more effective than anything else that exists today.”

How to train your chatbot

It’s big news, assuming it holds up. Large language models, or LLMs, are the neural networks that underpin all those chatbots that have recently proven so popular and terrifying. While there’s been a lot of noise about how they’re about to make all creatives unemployed and humans no longer need to make art or music or any of the wonderful things that kind of define us as a species, the truth is that LLMs are nowhere near sophisticated enough to pull an Ex Machina or an I, Robot – they work by basically scraping vast amounts of human-generated text and data and repackaging it in an uncannily realistic style.

It's actually a major problem, and not just because of all the real artists getting ripped off by the bots. The LLMs that power these chatbots aren’t focused on what’s true or not, but on finding patterns in speech and text – in other words, it often provides answers that sound like they make sense, but are functionally garbage.

So how did the DeepMind researchers avoid this problem in their mathematical ventures? Well, in a way, they didn’t. Instead, FunSearch – which is named for its ability to search the function space, if you’ve been wondering what about extremal combinatorics is such a hoot – combines two different programs: the first is Google’s LLM-based coding model Codey, which can prompt and generate code for developers; the second is an algorithm to check and score what Codey came up with.


It went like this: the team would write a piece of code to solve the math problem, but leave out the lines that actually told the program how to do it. Codey would then come in and suggest what those lines should be. The second algorithm would then essentially mark Codey’s work, and send it back for review.

“Many will be nonsensical, some will be sensible, and a few will be truly inspired,” Kohli told MIT Technology Review. “You take those truly inspired ones and you say, ‘Okay, take these ones and repeat.’”

And, apparently unsatisfied with besting its human overlords in just one longstanding mathematical puzzle, FunSearch then got to work on another one: the so-called “bin packing problem”. 


“Encouraged by our success with the theoretical cap set problem, we decided to explore the flexibility of FunSearch by applying it to an important practical challenge in computer science,” wrote Fawzi and Paredes. “The ‘bin packing’ problem […] sits at the core of many real-world problems, from loading containers with items to allocating compute jobs in data centers to minimize costs.”

The bin packing problem is precisely what it sounds like: it’s the question of how to best pack items into bins or containers in a way that minimizes the number of bins needed. Despite this apparent simplicity, though, it’s even worse than the cap set problem in terms of computational complexity – it’s NP-hard rather than NP-complete, for those interested in the technical jargon. 

But “despite being very different from the cap set problem, setting up FunSearch for this problem was easy,” Fawzi and Paredes reported. “FunSearch delivered an automatically tailored program (adapting to the specifics of the data) that outperformed established heuristics – using fewer bins to pack the same number of items.”

The limits of LLMs

While the ramifications of DeepMind’s breakthroughs are incredible, working mathematicians probably shouldn’t be worrying about their job security just yet. FunSearch, for now, is limited to problems that satisfy a certain set of criteria – they have to be able to be evaluated and scored easily and efficiently, and they need to follow the same “fill in the missing code” trick that the team used in the cap set and bin packing problems. Generating proofs, for example, would be way too hard for the AI right now, the researchers note, since you can’t grade things like that in a way that would make sense for a computer.


Nevertheless, it’s a brave new world out there – and there’s no telling what longstanding puzzle will topple next. 

“What I find really exciting, even more so than the specific results we found, is the prospects it suggests for the future of human-machine interaction in math,” Jordan Ellenberg, professor of mathematics at the University of Wisconsin-Madison and co-author on the paper, told the Guardian.

“Instead of generating a solution, FunSearch generates a program that finds the solution. A solution to a specific problem might give me no insight into how to solve other related problems. But a program that finds the solution, that’s something a human being can read and interpret and hopefully thereby generate ideas for the next problem and the next and the next.”

The study is published in the journal Nature


  • tag
  • mathematics,

  • AI,

  • math problem,

  • chatbots