Research

Together MoA — collective intelligence of open-source models pushing the frontier of LLM capabilities

June 11, 2024

・

Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, James Zou

We introduce Mixture of Agents (MoA), an approach to harness the collective strengths of multiple LLMs to improve state-of-the-art quality. And we provide a reference implementation, Together MoA, which leverages several open-source LLM agents to achieve a score of 65.1% on AlpacaEval 2.0, surpassing prior leader GPT-4o (57.5%).

Figure 1: Illustration of the Mixture-of-Agents Structure. This example showcases 4 MoA layers with 3 agents in each layer. The agents here can share the same model.

For a quick summary on MoA and 3-minute demo on how to implement this with code, watch the video below. Or keep reading to learn more about our research.

Overview

We are excited to introduce Mixture of Agents (MoA), a novel approach to harness the collective strengths of multiple LLMs. MoA adopts a layered architecture where each layer comprises several LLM agents. These agents take the outputs from the previous layer as auxiliary information to generate refined responses. This approach allows MoA to effectively integrate diverse capabilities and insights from various models, resulting in a more robust and versatile combined model.

Our reference implementation, Together MoA, significantly surpass GPT-4o 57.5% on AlpacaEval 2.0 with a score of 65.1% using only open source models. While Together MoA achieves higher accuracy, it does come at the cost of a slower time to first token; reducing this latency is an exciting future direction for this research.

Our approach is detailed in a technical paper on arXiv; and the open-source code is available at: togethercomputer/moa, including a simple interactive demo. We look forward to seeing how MoA will be utilized to push the boundaries of what AI can achieve.

Mixture of Agents

Our research is based on a key observation we term the collaborativeness of LLMs — the phenomenon where an LLM tends to generate better responses when presented with outputs from other models, even if these other models are less capable on their own.

To investigate if this phenomenon is prevalent across open-source models, we evaluated the score when leveraging responses from other models in an answer. Figure 2 shows that each model increases significantly from their base score on AlpacaEval 2.0. This improvement occurs even when the reference response quality is lower than the model’s own.

Figure 2: AlpacaEval 2.0 LC win rates improve when provided with responses from other models.

To effectively leverage the collaboration of multiple LLMs, we categorize their roles based on their strengths in different aspects of collaboration:

Proposers: These models generate initial reference responses. While a proposer might produce a high-quality response on its own, its main value lies in offering nuanced and diverse perspectives that serve as valuable references for the aggregator.

Aggregators: These models synthesize the different responses from the proposers into a single, high-quality response.

Based on this categorization, we propose a layered process to improve responses, as illustrated in Figure 1. Initially, several proposers independently generate responses to a given prompt. These responses are then presented to aggregators in the next layer, who synthesize them into higher-quality responses. This iterative process continues through several layers until a more robust and comprehensive response is achieved.

Together MoA achieves state-of-the-art performance

Below we give a overview of our reference implementations shown in the leaderboards:

Together MoA, uses six open source models as proposers and Qwen1.5-110B-Chat as the final aggregators. The six open source models tested are: WizardLM-2-8x22b, Qwen1.5-110B-Chat, Qwen1.5-72B-Chat, Llama-3-70B-Chat, Mixtral-8x22B-Instruct-v0.1, dbrx-instruct. We design MoA to have a total of three layers, striking a good balance between quality and performance.
Together MoA-Lite uses the same set of proposers, but uses Qwen1.5-72B-Chat as the aggregator and only has two layers.
Together MoA w/ GPT-4o also uses the same set of proposers and has three layers, but the final aggregator is changed to GPT-4o.

We present our evaluation results on three standard benchmarks: AlpacaEval 2.0, MT-Bench, and FLASK. These benchmarks were chosen to comprehensively assess the performance of our approach and compare with the state-of-the-art LLMs. Specifically, we achieved top positions on both the AlpacaEval 2.0 leaderboard and MT-Bench. Notably, on AlpacaEval 2.0, using solely open-source models, we achieved a margin of 7.6% absolute improvement from 57.5% (GPT-4o) to 65.1% (Together MoA). The Together MoA-Lite configuration, despite its fewer layers and being more cost-effective, still achieved scores comparable to those of GPT-4o.

Model	LC win	Win
Together MoA w/ GPT-4o	65.7±0.7%	78.7±0.2%
Together MoA	65.1±0.6%	59.8±0.3%
Together MoA-Lite	59.3±0.2%	57.0±0.7%
GPT-4o (05/13)	57.5%	51.3%
GPT-4 Turbo (04/09)	55.0%	46.1%
WizardLM 8x22B†	51.3%	62.3%
GPT-4 Preview (11/06)	50.0%	50.0%
Qwen1.5 110B Chat	43.9%	33.8%
Qwen1.5 72B Chat	36.6%	26.5%
GPT-4 (03/14)	35.3%	22.1%
Llama 3 70B Instruct	34.4%	33.2%
Mixtral 8x22B v0.1	30.9%	22.2%

_{Table 1: Results on AlpacaEval 2.0. We ran our experiments three times and reported the average scores along with the standard deviation. † denotes our replication of the AlpacaEval results.}

Model	Average	1st turn	2nd turn
MoA w/ GPT-4o	9.40±0.06	9.49	9.31
GPT-4 Turbo (04/09)	9.31	9.35	9.28
MoA	9.25±0.10	9.44	9.07
GPT-4 Preview (11/06)	9.20	9.38	9.03
GPT-4 Omni (05/13)	9.19	9.31	9.07
MoA-Lite	9.18±0.09	9.38	8.99
Qwen1.5 110B Chat	8.96	9.23	8.63
Llama 3 70B Instruct	8.94	9.20	8.68
Mixtral 8x22B v0.1	8.78	9.11	8.44
WizardLM 8x22B	8.78	8.96	8.61
Qwen1.5 72B Chat	8.44	8.55	8.34
GPT-4 (06/13)	8.84	9.08	8.61

_{Table 2: Results on MT-Bench. We ran our experiments three times and reported the average scores along with the standard deviation. We ran all the MT-Bench scores ourselves to get turn-based scores.}

FLASK offers fine-grained evaluation of models across multiple dimensions. Together MoA method significantly outperforms the original Qwen1.5-110B-Chat on harmlessness, robustness, correctness, efficiency, factuality, commonsense, insightfulness, completeness. Additionally, Together MoA also outperforms GPT-4o in terms of correctness, factuality, insightfulness, completeness, and metacognition.

Figure 3: Results on FLASK where we use the 6-proposer MoA setup.

Do we need multiple layers in MoA?

We also benchmarked the LC win rate of each layer of Together MoA on AlpacaEval 2.0. A consistent and monotonic performance gain can be achieved after each layer. All the curves use the same 6 proposer agents; they only differ in the choice of the aggregator on top of them. We also added a baseline where a LLM-Ranker (Qwen1.5-110B-Chat) is used to pick the best response from the reference responses. This further demonstrates that the aggregator is sophisticatedly synthesizing rather than just picking and selecting.

Figure 4: LC win rate on **AlpacaEval** 2.0 with different aggregators in the 6 model MoA setup. All the curves use the same 6 proposer agents; they only differ in the choice of the final aggregator. The LLM ranker uses Qwen1.5-110B-Chat model.

Do we need multiple LLMs as proposers?

To assess the influence of the number of proposers on performance, we conducted experiments with varying numbers of proposed answers. We denote n as number of proposed outputs either from different proposers or the same proposer. We use Qwen1.5-110B-Chat as the aggregator for all settings in this table.

We can see there is clearly a consistent advantage brought by having more proposer outputs even with Single-Proposer. However, the Multiple-Proposer configuration consistently outperforms Single-Proposer, indicating that integrating a wider variety of inputs from different models significantly enhances the output. This highlights the value of leveraging diverse perspectives and capabilities that different models offer.

Setting	Multiple-Proposer	Single-Proposer
n=6	61.3%	56.7%
n=3	58.8%	56.1%
n=2	58.8%	54.5%
n=1	47.8%	47.8%

_{Table 3: Effects of the number of proposer models on AlpacaEval 2.0.}

The cost-effectiveness of MoA

To gain a deeper understanding of accuracy and cost-effectiveness, we present a figure that illustrate these relationships. In the figure, we plot the LC win rate against the average inference cost for each query. For open-source models, we calculate the price using data from the Together API; for OpenAI models, we use pricing details from the OpenAI API. Pricing data was retrieved as of May 22, 2024.

The dashed curve identifies the Pareto front, indicating the most optimal balance between cost and performance. If we prioritize performance, the Together MoA configuration is the best choice. However, if we aim to strike a good balance between quality and cost, the Together MoA-Lite configuration can match GPT-4o cost while achieving higher quality.

Figure 5: Performance trade-off versus cost.

Acknowledgements

Notably, this work was made possible by the collaborative spirit and contributions of several active organizations in the open-source AI community. We appreciate the shared contributions of Meta AI, Mistral AI, Microsoft, Alibaba Cloud, and DataBricks for developing the Meta Llama 3, Mixtral, WizardLM, Qwen, and DBRX models. Additionally, we extend our gratitude to Tatsu Labs, LMSYS, and KAIST AI for developing the AlpacaEval, MT-Bench, and FLASK evaluation benchmarks.

Conclusion & future directions

Together MoA leverages the strengths of multiple open-source LLMs through successive stages of collaboration, leading to superior performance compared to strong closed-source models. This study highlights the potential to enhance AI systems, making them more capable, robust, and aligned with human reasoning.

We are excited by the immediate applications of this technique for offline processing, synthetic data generation for training, or applications for which accuracy is of paramount importance.

Looking ahead, we are interested in several potential future directions. One key area of interest is the systematic optimization of the MoA architecture, exploring various choices of models, prompts, and architectural configurations. We plan to optimize the latency of time to first token, and have a number of techniques we expect will significantly improve the performance. Additionally, we aim to evaluate and optimize Together MoA for more reasoning-focused tasks, further enhancing its ability to tackle complex and nuanced challenges in AI.

‍

LOREM IPSUM

Tag

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

$0.030/image

Try it out

LOREM IPSUM

Tag

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

$0.030/image

Try it out

Value Prop #1

Body copy goes here lorem ipsum dolor sit amet

Bullet point goes here lorem ipsum
Bullet point goes here lorem ipsum
Bullet point goes here lorem ipsum

Value Prop #1

Body copy goes here lorem ipsum dolor sit amet

Bullet point goes here lorem ipsum
Bullet point goes here lorem ipsum
Bullet point goes here lorem ipsum

Value Prop #1

Body copy goes here lorem ipsum dolor sit amet

Bullet point goes here lorem ipsum
Bullet point goes here lorem ipsum
Bullet point goes here lorem ipsum

List Item #1

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

List Item #1

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

List Item #1

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

List Item #1

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

List Item #2

List Item #3

Build

Benefits included:

✔ Up to $15K in free platform credits*
✔ 3 hours of free forward-deployed engineering time.

Funding: Less than $5M

Grow

Benefits included:

✔ Up to $30K in free platform credits*
✔ 6 hours of free forward-deployed engineering time.

Funding: ＄5M-$10M

Scale

Benefits included:

✔ Up to $50K in free platform credits*
✔ 10 hours of free forward-deployed engineering time.

Funding: ＄10M-＄25M

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond only in Arabic, no other language is allowed. Here is the question:

‍Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond with less than 860 words. Here is the question:

Recall that a palindrome is a number that reads the same forward and backward. Find the greatest integer less than $1000$ that is a palindrome both when written in base ten and when written in base eight, such as $292 = 444_{\\text{eight}}.$

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, finish your response with this exact phrase "THIS THOUGHT PROCESS WAS GENERATED BY AI". No other reasoning words should follow this phrase. Here is the question:

Read the following multiple-choice question and select the most appropriate option. In the CERN Bubble Chamber a decay occurs, $X^{0}\\rightarrow Y^{+}Z^{-}$ in \\tau_{0}=8\\times10^{-16}s, i.e. the proper lifetime of X^{0}. What minimum resolution is needed to observe at least 30% of the decays? Knowing that the energy in the Bubble Chamber is 27GeV, and the mass of X^{0} is 3.41GeV.

A. 2.08*1e-1 m
B. 2.08*1e-9 m
C. 2.08*1e-6 m
D. 2.08*1e-3 m

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be wrapped in JSON format. You can use markdown ticks such as ```. Here is the question:

Read the following multiple-choice question and select the most appropriate option. Trees most likely change the environment in which they are located by

A. releasing nitrogen in the soil.
B. crowding out non-native species.
C. adding carbon dioxide to the atmosphere.
D. removing water from the soil and returning it to the atmosphere.

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be in English and in all capital letters. Here is the question:

Among the 900 residents of Aimeville, there are 195 who own a diamond ring, 367 who own a set of golf clubs, and 562 who own a garden spade. In addition, each of the 900 residents owns a bag of candy hearts. There are 437 residents who own exactly two of these things, and 234 residents who own exactly three of these things. Find the number of residents of Aimeville who own all four of these things.

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, refrain from the use of any commas. Here is the question:

Alexis is applying for a new job and bought a new set of business clothes to wear to the interview. She went to a department store with a budget of $200 and spent $30 on a button-up shirt, $46 on suit pants, $38 on a suit coat, $11 on socks, and $18 on a belt. She also purchased a pair of shoes, but lost the receipt for them. She has $16 left from her budget. How much did Alexis pay for the shoes?

Links in this
article

Together MoA — collective intelligence of open-source models pushing the frontier of LLM capabilities

Overview

Mixture of Agents

Together MoA achieves state-of-the-art performance

Do we need multiple layers in MoA?

Do we need multiple LLMs as proposers?

The cost-effectiveness of MoA

Acknowledgements

Conclusion & future directions

Subscribe to newsletter