Artificial intelligence (AI) researchers have created what they are calling "Humanity's Last Exam" in an attempt to benchmark the progress of large language models (LLMs). Looking at the performance of top current models, there is a clear frontrunner.
The rest of this article is behind a paywall. Please sign in or subscribe to access the full content.Though there are still plenty of problems with LLMs if you are a fan of accuracy or ethics, there's no denying they have progressed astonishingly quickly over the last decade. The Turing test, created by renowned mathematician and computer scientist Alan Turing in 1950, looks to see if a computer can fool a human into believing that it too is an ordinary human being. That has likely been beaten by LLMs.
"Participants in our experiment were no better than chance at identifying GPT-4 after a five minute conversation, suggesting that current AI systems are capable of deceiving people into believing that they are human,” a 2025 paper explains. “The results here likely set a lower bound on the potential for deception in more naturalistic contexts where, unlike the experimental setting, people may not be alert to the possibility of deception or exclusively focused on detecting it."
While beating the Turing test doesn't mean that AI is conscious, it does mean that the test may be a little useless to AI researchers. There are other ways of assessing their performance, such as the Massive Multitask Language Understanding benchmark, which tests the "knowledge" of LLMs – or their ability to put forward the correct answer when prompted – across a broad range of subjects. But this too is no longer cutting it.
"Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities," the creators of Humanity's Last Exam (HLE) explain in their paper. "However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities."
The new benchmark contains 2,500 questions across a broad range of subjects, including science and the humanities, which have been developed by subject experts from around the world. Each question is designed with an unambiguous answer that can be automatically graded, and, crucially, with answers that cannot be answered easily by retrieving them via the Internet. In general, the questions were designed to require graduate-level knowledge of the subject area. The models were also asked to suggest how confident they were of each answer, so that they could compare it to how well they performed.
"Among the diversity of questions in the benchmark, HLE emphasizes world-class mathematics problems aimed at testing deep reasoning skills broadly applicable across multiple academic areas," the team adds, with math making up 41 percent of the questions.
The team evaluated current models of LLMs against Humanity's Last Exam to test the effectiveness of the exam as well as the models taking it. So, how did they do? Not great.
"All frontier models achieve low accuracy on HLE, highlighting substantial room for improvement in narrowing the gap between current LLMs and expert-level academic capabilities on closed-ended questions," the team writes, adding, "these low scores are partially by design the dataset collection process attempts to filter out questions that existing models can answer correctly."
As well as performing poorly, the team found that many of the LLMs were overconfident in being correct in their answers.
"The stated confidence of a well-calibrated model should match its actual accuracy," the team explains, "for example, achieving 50% accuracy on questions, in which it claims 50% confidence."
GPT-4o, for example, achieved an accuracy rating of around 2.7 percent, a calibration error of 89 percent in this overconfident model. Later models performed better, with Gemini 2.5 Pro being accurate around 22 percent of the time, and GPT-5 around 25 percent of the time. These models were still too confident, however, with a calibration error of 72 percent and 50 percent, respectively.
In updated tests published to the Humanity's Last Exam website, Gemini's 3.1 Pro model achieved 45.9 percent accuracy, with a 50.3 percent calibration error, taking the spot as the top-performing model. Chat GPT-5.2 also improved, with 36.6 percent accuracy and a slightly higher calibration error of 55.1 percent.
That may not sound great, but given that they were searching specifically for new ways to benchmark the models, this isn't a bad thing.
"Without accurate assessment tools, policymakers, developers and users risk misinterpreting what AI systems can actually do,” Dr Tung Nguyen, Instructional Associate Professor of Computer Science and Engineering at Texas A&M University College of Engineering, said in a statement. "Benchmarks provide the foundation for measuring progress and identifying risks."
The exam is publicly available, though the answers are hidden from any AI thinking of simply Googling the correct response.
The study is published in Nature.





