Blog

How multiagent fine-tuning overcomes the data bottleneck of LLMs

January 27, 2025

robots debating — Image created with Ideogram

This article is part of our coverage of the latest in AI research.

A new technique proposed by researchers at MIT, Harvard, Stanford, and DeepMind uses multiple agents to solve one of the most pressing problems of large language models (LLMs): shortage of quality training data. AI labs have run out of data to train LLMs as frontier models have already consumed most of what’s useful on the web.

Among the different solutions to unlocking this bottleneck is creating synthetic data. One of the approaches is self-improvment, where LLMs generate quality examples to train themselves. For example, the LLM is prompted to solve a math, reasoning, or coding problem. The model generates reasoning chains and responses, evaluates the results, and adds valid examples to a training dataset that is used to fine-tune the model in the next training cycle.

This is an effective but limited approach. Different studies show that such methods plateau after a few training iterations, limiting the applicability of self-improvement methods.

Multiagent debate and fine-tuning

To improve performance, the new technique uses the concept of multiagent debate, where several LLM agents draft and refine responses together. Instead of fine-tuning a single model, the framework uses the same debate and refinement framework to generate diverse datasets and fine-tune multiple models. The models are derived from the same base model and each is trained to specialize in certain parts of the target task.

The framework is composed of generation and critic agents. For each problem, the first group of LLMs—the generation agents—create initial responses. The role of the generation models is to accurately respond to input questions. Each model is prompted in a different way to create a diverse set of reasoning chains and responses.

Then the critic agents assess the outputs from all generation agents and select the most effective response or generate feedback for improvement. The role of the critic agents is to provide accurate critiques of LLM-generated responses and use these responses to provide an updated answer. The agents can engage in multiple rounds of debate and feedback to further refine the answer.

The updated responses and critiques are then used to create a dataset to fine-tune both the generation and the critic agents. Once both sets of agents are trained, they repeat the cycle to create better responses. To ensure diversity, each of the generation and critic agents is fine-tuned on a different set of examples that have been produced based on its interactions. As they repeat the cycle, they create better and better datasets and each agent becomes better at specific parts of the task.

Unlike the classic self-improvement framework, the quality of the training data continues to improve across many iterations given the diversity of the behavior it creates.

“We found that iterative application of the multiagent finetuning allows for continuous learning and adaptation, leading to progressively refined and more accurate responses over time,” the researchers write.

The researchers further note that by “training each model on distinct sets of data and roles, our approach fosters specialization across models and promotes diversification within the society of models. Consequently, our system can autonomously improve over many more rounds of finetuning compared to single-agent self-improvement methods.”

Multiagent fine-tuning multi-iteration — Multiagent fine-tuning continues to improve performance on MATH benchmark across multiple iterations while single-agent fine-tuning plateaus quickly (source: arXiv)

During inference, the framework uses the ecosystem of generation and critic agents to draft multiple responses and refine them through multiagent debate. Each agent takes the responses from all other agents and generates a new response in each round of the debate.

“We found that summarizing the responses from the other agents helps eliminate redundant information while retaining the most important details, thereby further improving performance,” the researchers write.

Multiagent fine-tuning in action

The researchers tested the approach on several reasoning benchmarks for arithmetic, grade-school math, and competition-level math problems. They used it with open source models (Mistral 7B, Llama 3-8B, and Phi 3-4B) as well as GPT-3.5. Since multiagent debate and fine-tuning does not require access to the internal weights of the models, it was applicable both to open and closed models.

The results showed that the multiagent approach outperformed other techniques, including majority voting (where the model produces several independent answers and chooses the best one) and approaches where agents refine their answers but do not go through the fine-tuning process. Moreover, the fine-tuned models could generalize to unseen tasks and outperform baseline methods that directly train the models on the target task. For example, an agent ecosystem fine-tuned on the MATH dataset performed very well on the GSM benchmark.

Importantly, multiagent continued to show improved performance throughout multiple iterations, while other self-improvement methods started regressing after a few cycles.
One tradeoff of the multiagent approach is the cost since it requires multiple copies of models to train and run at the same time. It will be interesting to see if optimization techniques such as LoRA and quantization can help overcome this challenge. However, what’s important is that for the moment, multiagent fine-tuning seems to address one of the important problems that the AI community faces.

What to know about o3 and o4-mini, OpenAI’s new reasoning models

GPT-4.1: OpenAI’s most confusing model

Making data work for you: Challenges, innovations, and lessons learned

Under the hood: The Innovations powering DeepSeek’s AI breakthrough

What to know about Meta’s Llama 4 model family

How looking differently at data can save your machine learning project

Building a solid data foundation for generative AI applications

These mistakes can ruin your machine learning project

How to choose metrics for your machine learning projects

How to start with machine learning in your organization

How Google’s Agent2Agent can boost AI productivity through inter-agent communication

Demystifying vibe coding: Hype, reality, and why you still need to…

What is Model Context Protocol (MCP)?

What to know about Claude 3.7 Sonnet, Anthropic’s new frontier language…

Everything you need to know about Grok-3

The cash flow conundrum: How technology is reshaping small business finance

What to know about the security of open-source machine learning models

Understanding the impact of open-source language models

What we learned from the deep learning revolution

AI21 Labs’ mission to make large language models get their facts…

How multiagent fine-tuning overcomes the data bottleneck of LLMs

Multiagent debate and fine-tuning

Multiagent fine-tuning in action

Like this:

Leave a ReplyCancel reply

Multiagent debate and fine-tuning

Multiagent fine-tuning in action

Like this:

Leave a ReplyCancel reply

Discover more from TechTalks