Researchers from UC San Diego got large language models (LLMs) to play the popular fantasy tabletop roleplaying game Dungeons & Dragons, in an attempt to evaluate their performance over time.
The rest of this article is behind a paywall. Please sign in or subscribe to access the full content.Generally in AI research, the focus has been evaluating their performance in short-term tasks. But AI agents are increasingly being asked to perform tasks that require them to act independently, or semi-independently, over longer periods of time. This piece of research attempted to address that by monitoring several LLMs during a game of Dungeons & Dragons.
“Dungeons & Dragons is a natural testing ground to evaluate multistep planning, adhering to rules and team strategy,” Raj Ammanabrolu, the study’s senior author and a faculty member in the Department of Computer Science and Engineering at UC San Diego, explained in a statement. “Because play unfolds through dialog, D&D also opens a direct avenue for human-AI interaction: agents can assist or coplay with other people.”
The team got the LLMs to work with a game engine programmed with the rules of Dungeons & Dragons. This engine provided the maps needed for the game, including where resources were located, in order to minimize AI "hallucinations". During the game, the AI agents would act as players, as well as the monsters being fought in the campaign, which focused solely on combat.
The idea has also been tested less methodically on YouTube.
As well as playing against themselves and fellow AI agents, the LLMs played against 2,000 experienced human players. They were evaluated based on how well they kept track of what was going on. For example, their resources and actions available, on what actions they took during the game, as well as how well they stayed "in character".
During the experiment, the LLMs were found to devolve into hammy performances, with Warlocks being particularly dramatic even when the situation didn't call for it, and Paladins making heroic speeches at inappropriate times. LLMs playing goblins started regurgitating irritating canned phrases like "heh — shiny man’s gonna bleed" during fights, and there were significant differences between the models.
"DeepSeek-V3 consistently produces short, first-person action beats and monster taunts (e.g., 'I dart left,' 'Get them!'); however, it tends to reuse the same few voices within a scenario, so the number of distinct traits stays narrow," the team explains in their report, adding that Claude Haiku 3.5 was better at modifying its speech to fit the character class.
"GPT-4o usually sits between these behaviors: it mixes vivid stage directions with more tactical or meta phrasing, so its persona density is middling while its trait variety is comparable to DeepSeek-V3."
Overall, the team found that the chatbots performed well, though they did struggle with long-term tasks.
"Our evaluation across six metrics reveals that large language models produced a promising result in rule-based conversation simulation. Smaller, open-source language models, however, are not yet capable of giving consistent simulation, which might be because their pre-trained tuning is different compared to the D&D simulation task," the team wrote in their paper. "All LLMs exhibit progressive degradation in long-horizon scenarios."
The team next plans to simulate a full Dungeons & Dragons campaign, rather than purely the combat element. Hopefully the goblins can sort out their dialog before that happens.
The work was presented at the NeurIPS conference in December 2025 and posted to OpenReview.





