What is...

Understanding LLM ensembles and mixture-of-agents (MoA)

February 17, 2025

This article is part of Demystifying AI, a series of posts that (try to) disambiguate the jargon and myths surrounding AI.

Large language models (LLMs) can accomplish many tasks alone. But combining multiple LLMs can provide even better results, and it has become an effective technique known as “LLM ensembles.” Inspired by machine learning ensembles, LLM ensembles come in different flavors, from simple ensembles to advanced “mixture-of-agent” combinations.

LLM ensembles vs multiple sampling

LLM ensembles are related to multi-sampling techniques. In multi-sampling, the same prompt is provided to one or several LLMs multiple times to generate N responses. Those responses are then evaluated to choose the best response. For example, if the answer can be verified objectively with something such as a code interpreter, then the best answer is chosen. In other cases, you can use techniques such as majority voting and self-consistency, where the answer that shows up more than others is chosen as the final solution.

At the most basic level, LLM ensembles can be used like multi-sampling. However, in most cases, the model and answer selection processes are more complicated. For example, in many cases, LLM ensembles select different models with complementary strengths and weaknesses to create a diverse set of responses.

LLM ensembles also use more advanced methods to choose answers. For example, ensembles might use “weight averaging,” where each LLM’s response is given a weight based on its strengths and weaknesses in that particular task or the confidence score in its answr. Another method is routing, where one or more models within an ensemble are selected based on a predefined set of criteria. Ensembles are more like a team of LLMs that work together to provide an answer.

Ensembles are designed to be more dynamic and tackle more complex problems. But they are also more complicated and difficult to implement and can be more expensive to run than more basic best-of-N paradigms.

Mixture-of-agents (MoA)

Mixture-of-agents (MoA) is one of the more effective and popular types of LLM ensembling. MoA first queries multiple LLMs (proposers) to generate responses. It then uses another LLM as the “aggregator” to create a high-quality response by synthesizing and summarizing the output from the proposers. MoA works like an executive team that takes proposals from different parties and uses them to create a final solution.

The classic approach to creating MoA hinges on diversity in the proposer models. Some studies show that using MoA with a diverse range of small but specialized models can outperform single large models.

However, according to a recent study by researchers at Princeton University, diversity in MoA proposers might have the adverse effect and hurts the overall performance of the ensemble. Their experiments show that MoA performance is sensitive to the quality of the models being mixed, and mixing different LLMs can lower the average quality.

Instead, they propose “self-MoA,” where a single strong model is used as both the proposer and the aggregator. In this mode, multiple answers are sampled from the same model, using higher temperatures and the stochastic nature of LLMs to create diverse range of responses. Then, the aggregator uses the proposed answers to provide the final response. Their experiments show that self-MoA outperforms the classic MoA (also referred to as mixed-MoA) on a wide range of benchmarks, including the popular AlpacaEval 2.0 dataset.

The researchers also propose self-MoA-seq, a sequential version of self-MoA designed to work with models that have limited context windows. self-MoA-seq uses a “sliding window approach,” where at any given time, a limited number of responses are sampled and given to the aggregator. When the aggregator generates its response, another set of responses are sampled and fed back to the aggregator along with its previous output. This process can be repeated iteratively until the MoA produces the desired response or a predefined number of cycles are exhausted.

LLM ensembles and mixture-of-agents are among several inference-time scaling techniques that use more compute cycles during inference to improve the abilities of LLMs.

Are we at the cusp of a new era for artificial…

What to know about o3 and o4-mini, OpenAI’s new reasoning models

GPT-4.1: OpenAI’s most confusing model

Making data work for you: Challenges, innovations, and lessons learned

Under the hood: The Innovations powering DeepSeek’s AI breakthrough

How to turbocharge your product and market research with DeepSearch

How looking differently at data can save your machine learning project

Building a solid data foundation for generative AI applications

These mistakes can ruin your machine learning project

How to choose metrics for your machine learning projects

How Google’s Agent2Agent can boost AI productivity through inter-agent communication

Demystifying vibe coding: Hype, reality, and why you still need to…

What is Model Context Protocol (MCP)?

What to know about Claude 3.7 Sonnet, Anthropic’s new frontier language…

Everything you need to know about Grok-3

The cash flow conundrum: How technology is reshaping small business finance

What to know about the security of open-source machine learning models

Understanding the impact of open-source language models

What we learned from the deep learning revolution

AI21 Labs’ mission to make large language models get their facts…

Understanding LLM ensembles and mixture-of-agents (MoA)

LLM ensembles vs multiple sampling

Mixture-of-agents (MoA)

Like this:

Leave a ReplyCancel reply

LLM ensembles vs multiple sampling

Mixture-of-agents (MoA)

Like this:

Leave a ReplyCancel reply

Discover more from TechTalks