Skip to main content

Ad

technology-iconTechnologytechnology-iconartificial intelligence
clock-iconPUBLISHEDJanuary 22, 2026
comments icon18
share140

Scientists Forced AI Language Models To Play Dungeons & Dragons To See How Well They Concentrate

The performances were hammy, and the goblins incredibly irritating.

James Felton headshot

James Felton

James Felton headshot

James Felton

Senior Staff Writer

James is a published author with multiple pop-history and science books to his name. He specializes in history, space, strange science, and anything out of the ordinary.

Senior Staff Writer

James is a published author with multiple pop-history and science books to his name. He specializes in history, space, strange science, and anything out of the ordinary.View full profile

James is a published author with multiple pop-history and science books to his name. He specializes in history, space, strange science, and anything out of the ordinary.

View full profile
EditedbyTom Leslie
Tom Leslie headshot

Tom Leslie

Editor & Staff Writer

Tom has a master’s degree in biochemistry from the University of Oxford and his interests range from immunology and microscopy to the philosophy of science.

Dungeons & Dragons dice.

To be fair, many humans struggle to concentrate during the game too.

Image credit: Esther H. Derksen/Shutterstock


Researchers from UC San Diego got large language models (LLMs) to play the popular fantasy tabletop roleplaying game Dungeons & Dragons, in an attempt to evaluate their performance over time.

The rest of this article is behind a paywall. Please sign in or subscribe to access the full content.

Generally in AI research, the focus has been evaluating their performance in short-term tasks. But AI agents are increasingly being asked to perform tasks that require them to act independently, or semi-independently, over longer periods of time. This piece of research attempted to address that by monitoring several LLMs during a game of Dungeons & Dragons.

“Dungeons & Dragons is a natural testing ground to evaluate multistep planning, adhering to rules and team strategy,” Raj Ammanabrolu, the study’s senior author and a faculty member in the Department of Computer Science and Engineering at UC San Diego, explained in a statement. “Because play unfolds through dialog, D&D also opens a direct avenue for human-AI interaction: agents can assist or coplay with other people.”

The team got the LLMs to work with a game engine programmed with the rules of Dungeons & Dragons. This engine provided the maps needed for the game, including where resources were located, in order to minimize AI "hallucinations". During the game, the AI agents would act as players, as well as the monsters being fought in the campaign, which focused solely on combat.

The idea has also been tested less methodically on YouTube.

As well as playing against themselves and fellow AI agents, the LLMs played against 2,000 experienced human players. They were evaluated based on how well they kept track of what was going on. For example, their resources and actions available, on what actions they took during the game, as well as how well they stayed "in character".

During the experiment, the LLMs were found to devolve into hammy performances, with Warlocks being particularly dramatic even when the situation didn't call for it, and Paladins making heroic speeches at inappropriate times. LLMs playing goblins started regurgitating irritating canned phrases like "heh — shiny man’s gonna bleed" during fights, and there were significant differences between the models.

"DeepSeek-V3 consistently produces short, first-person action beats and monster taunts (e.g., 'I dart left,' 'Get them!'); however, it tends to reuse the same few voices within a scenario, so the number of distinct traits stays narrow," the team explains in their report, adding that Claude Haiku 3.5 was better at modifying its speech to fit the character class.

"GPT-4o usually sits between these behaviors: it mixes vivid stage directions with more tactical or meta phrasing, so its persona density is middling while its trait variety is comparable to DeepSeek-V3."

Overall, the team found that the chatbots performed well, though they did struggle with long-term tasks.

"Our evaluation across six metrics reveals that large language models produced a promising result in rule-based conversation simulation. Smaller, open-source language models, however, are not yet capable of giving consistent simulation, which might be because their pre-trained tuning is different compared to the D&D simulation task," the team wrote in their paper. "All LLMs exhibit progressive degradation in long-horizon scenarios."

The team next plans to simulate a full Dungeons & Dragons campaign, rather than purely the combat element. Hopefully the goblins can sort out their dialog before that happens.

The work was presented at the NeurIPS conference in December 2025 and posted to OpenReview.


Written by 

Add us as a Google preferred source to see more of our
trusted coverage in Search