Scientists Forced AI Language Models To Play Dungeons & Dragons To See How Well They Concentrate

The performances were hammy, and the goblins incredibly irritating.

James Felton

Senior Staff Writer

James is a published author with multiple pop-history and science books to his name. He specializes in history, space, strange science, and anything out of the ordinary.View full profile

James is a published author with multiple pop-history and science books to his name. He specializes in history, space, strange science, and anything out of the ordinary.

View full profile

EditedbyTom Leslie

Researchers from UC San Diego got large language models (LLMs) to play the popular fantasy tabletop roleplaying game Dungeons & Dragons, in an attempt to evaluate their performance over time.

Generally in AI research, the focus has been evaluating their performance in short-term tasks. But AI agents are increasingly being asked to perform tasks that require them to act independently, or semi-independently, over longer periods of time. This piece of research attempted to address that by monitoring several LLMs during a game of Dungeons & Dragons.

“Dungeons & Dragons is a natural testing ground to evaluate multistep planning, adhering to rules and team strategy,” Raj Ammanabrolu, the study’s senior author and a faculty member in the Department of Computer Science and Engineering at UC San Diego, explained in a statement. “Because play unfolds through dialog, D&D also opens a direct avenue for human-AI interaction: agents can assist or coplay with other people.”

The team got the LLMs to work with a game engine programmed with the rules of Dungeons & Dragons. This engine provided the maps needed for the game, including where resources were located, in order to minimize AI "hallucinations". During the game, the AI agents would act as players, as well as the monsters being fought in the campaign, which focused solely on combat.

^{The idea has also been tested less methodically on YouTube.}

As well as playing against themselves and fellow AI agents, the LLMs played against 2,000 experienced human players. They were evaluated based on how well they kept track of what was going on. For example, their resources and actions available, on what actions they took during the game, as well as how well they stayed "in character".

During the experiment, the LLMs were found to devolve into hammy performances, with Warlocks being particularly dramatic even when the situation didn't call for it, and Paladins making heroic speeches at inappropriate times. LLMs playing goblins started regurgitating irritating canned phrases like "heh — shiny man’s gonna bleed" during fights, and there were significant differences between the models.

"DeepSeek-V3 consistently produces short, first-person action beats and monster taunts (e.g., 'I dart left,' 'Get them!'); however, it tends to reuse the same few voices within a scenario, so the number of distinct traits stays narrow," the team explains in their report, adding that Claude Haiku 3.5 was better at modifying its speech to fit the character class.

"GPT-4o usually sits between these behaviors: it mixes vivid stage directions with more tactical or meta phrasing, so its persona density is middling while its trait variety is comparable to DeepSeek-V3."

Overall, the team found that the chatbots performed well, though they did struggle with long-term tasks.

"Our evaluation across six metrics reveals that large language models produced a promising result in rule-based conversation simulation. Smaller, open-source language models, however, are not yet capable of giving consistent simulation, which might be because their pre-trained tuning is different compared to the D&D simulation task," the team wrote in their paper. "All LLMs exhibit progressive degradation in long-horizon scenarios."

The team next plans to simulate a full Dungeons & Dragons campaign, rather than purely the combat element. Hopefully the goblins can sort out their dialog before that happens.

The work was presented at the NeurIPS conference in December 2025 and posted to OpenReview.

Scientists Forced AI Language Models To Play Dungeons & Dragons To See How Well They Concentrate

The performances were hammy, and the goblins incredibly irritating.

Invisible Drones: A New Flying Robot Spins So Fast It Hides In Plain Sight

AI Solved A Math Problem That Had Stumped The World For 80 Years. Not Everyone Is Happy About That

Back In 2021, Scientists Added A Human "Fat Gene" Into A Potato. What Happened Next Surprised Everybody

How Do You Name A New Species? | IFLScience The Big Questions

Could AI Find A Cure For Cancer? | IFLScience The Big Questions

What Is Archaeoastronomy? Find Out More In Issue 48 Of CURIOUS – Out Now

Thank you!

Can't find the email?