Understanding LLM ensembles and mixture-of-agents (MoA)

LLM ensemble
Image created with Ideogram

This article is part of Demystifying AI, a series of posts that (try to) disambiguate the jargon and myths surrounding AI.

Large language models (LLMs) can accomplish many tasks alone. But combining multiple LLMs can provide even better results, and it has become an effective technique known as “LLM ensembles.” Inspired by machine learning ensembles, LLM ensembles come in different flavors, from simple ensembles to advanced “mixture-of-agent” combinations. 

LLM ensembles vs multiple sampling

LLM ensembles are related to multi-sampling techniques. In multi-sampling, the same prompt is provided to one or several LLMs multiple times to generate N responses. Those responses are then evaluated to choose the best response. For example, if the answer can be verified objectively with something such as a code interpreter, then the best answer is chosen. In other cases, you can use techniques such as majority voting and self-consistency, where the answer that shows up more than others is chosen as the final solution.

At the most basic level, LLM ensembles can be used like multi-sampling. However, in most cases, the model and answer selection processes are more complicated. For example, in many cases, LLM ensembles select different models with complementary strengths and weaknesses to create a diverse set of responses.

LLM ensembles also use more advanced methods to choose answers. For example, ensembles might use “weight averaging,” where each LLM’s response is given a weight based on its strengths and weaknesses in that particular task or the confidence score in its answr. Another method is routing, where one or more models within an ensemble are selected based on a predefined set of criteria. Ensembles are more like a team of LLMs that work together to provide an answer.

Ensembles are designed to be more dynamic and tackle more complex problems. But they are also more complicated and difficult to implement and can be more expensive to run than more basic best-of-N paradigms.

Mixture-of-agents (MoA)

Mixture of agents
Mixture of agents (source: arXiv)

Mixture-of-agents (MoA) is one of the more effective and popular types of LLM ensembling. MoA first queries multiple LLMs (proposers) to generate responses. It then uses another LLM as the “aggregator” to create a high-quality response by synthesizing and summarizing the output from the proposers. MoA works like an executive team that takes proposals from different parties and uses them to create a final solution.

The classic approach to creating MoA hinges on diversity in the proposer models. Some studies show that using MoA with a diverse range of small but specialized models can outperform single large models.

However, according to a recent study by researchers at Princeton University, diversity in MoA proposers might have the adverse effect and hurts the overall performance of the ensemble. Their experiments show that MoA performance is sensitive to the quality of the models being mixed, and mixing different LLMs can lower the average quality.

Instead, they propose “self-MoA,” where a single strong model is used as both the proposer and the aggregator. In this mode, multiple answers are sampled from the same model, using higher temperatures and the stochastic nature of LLMs to create diverse range of responses. Then, the aggregator uses the proposed answers to provide the final response. Their experiments show that self-MoA outperforms the classic MoA (also referred to as mixed-MoA) on a wide range of benchmarks, including the popular AlpacaEval 2.0 dataset.

The researchers also propose self-MoA-seq, a sequential version of self-MoA designed to work with models that have limited context windows. self-MoA-seq uses a “sliding window approach,” where at any given time, a limited number of responses are sampled and given to the aggregator. When the aggregator generates its response, another set of responses are sampled and fed back to the aggregator along with its previous output. This process can be repeated iteratively until the MoA produces the desired response or a predefined number of cycles are exhausted.

LLM ensembles and mixture-of-agents are among several inference-time scaling techniques that use more compute cycles during inference to improve the abilities of LLMs.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.