A team of researchers may have found a way of improving large language model (LLM) chatbots, including improving ChatGPT-4's accuracy by around 21 percent. In a new preprint paper, yet to be peer-reviewed, the team explains how they achieved it: allowing artificial intelligence (AI) agents to reflect on their own mistakes.
The team used a process called Reflexion, which "endows an agent with dynamic memory and self-reflection capabilities to enhance its existing reasoning trace and task-specific action choice abilities", according to their paper.
"Human intelligence is notable for its ability to learn from mistakes," the team explained on Substack. "We often don't solve problems on our first try, but when we make mistakes we generate new ideas to refine our approach through self-reflection, through analyzing our missteps."
They tried to replicate this to an extent, by allowing the AI agents to analyze their own actions and mistakes. In the research, AI agents were challenged to solve various problems, from coding to a trial in AlfWorld, a text-based environment that is used to train and test AI agents. In AlfWorld, the agent was asked to complete a number of tasks, but the only way to do so was to learn about its environment through text and be rewarded with observations, like in a text adventure game.
While running the agent in AlfWorld without the reflective technique, it achieved 63 percent accuracy. When the agent was given the ability to reflect on its actions and mistakes, it was able to achieve 97 percent accuracy, solving 130 out of 134 tasks.
In one of these tasks, natural language AI was asked to find the answer to the question "Grown-Ups starred the actor who was best known for which role on 'Allo ’Allo!?" The language model first searched for Grown Ups to view a cast list, and then ’Allo ’Allo! to cross-reference. After failing to get the cast list it needed, it failed the task too.
"I searched the wrong title for the show, ’Allo ’Allo!," the AI explained its reflection process, "which resulted in no results. I should have searched the show’s main character, Gorden Kaye, to find the role he was best known for in the show."
After applying this reflective model, it was given the task again. This time it applied what it learned and finished the task in fewer steps, getting the answer correct.
These AI agents were all powered using ChatGPT-3 and GPT-3.5. In an update, the team used an agent based on ChatGPT-4, and found that when using Reflexion, the AI scored 88 percent accuracy on coding tasks, compared to 67 percent when ChatGPT-4 acted alone.
"It’s not everyday that humans develop novel techniques to achieve state-of-the-art standards using decision-making processes once thought to be unique to human intelligence," the team added on Substack. "But, that’s exactly what we did."
The paper is published on the preprint server arXiv.