Company

Together Evaluations: Benchmark Models for Your Tasks

July 28, 2025

・

Ivan Provilkov, Ruslan Khaidurov, Kirah Sapong, George Grigorev, Gleb Vazhenin, Yogish Baliga, Zain Hasan, Max Ryabinin

When building products with large language models, knowing how well a model can solve your task is crucial, especially with the rapid rate at which new LLMs appear. Without a structured approach to evaluation, one has to decide ad hoc, making it challenging to keep up and choose the most suitable model for your use case.

One effective approach is to define a task-specific benchmark and use a strong AI model as a judge. This approach lets you quickly compare responses across models or rate the performance of a single model without the overhead of manual labeling or rigid metrics.

To support this workflow, we are releasing an early preview of Together Evaluations — a fast, flexible way to benchmark LLM response quality using leading open-source judge models you control.

Join us for a practical walkthrough on July 31st where we’ll demonstrate how to use Together Evaluations for your use case. We’ll also show how to choose between new SOTA models including Kimi, Qwen, and GLM.

Why use LLM-as-a-judge?

The core part of any AI application is its underlying LLM, so it's no surprise that the quality of a model’s responses can define the success of a product. But how can we measure the performance of a model for a given task? The whole field of language model evaluation aims to answer this question, with methods ranging from human annotation to automated metrics.

In recent years, the capabilities of LLMs have become strong enough that models themselves can act as evaluators! Compared to the traditional approach of human annotation, using LLM-as-a-judge is much faster — you just need to run the model with every evaluated sample as a separate prompt, which is orders of magnitude faster compared to training annotators and waiting for their labeling. Compared to algorithmic metrics (like BLEU for translation or F1 score for classification), LLM judges are much more flexible: you can measure anything that you can describe with natural language. Lastly, LLM-as-a-judge metrics are very easy to modify if your task definition slightly changes, as you simply need to adjust the system prompt.

Evaluations with Together

Together Evaluations provides a powerful framework for defining custom LLM benchmarks and comparing different models on those benchmarks using LLM judges. Our platform offers three evaluation modes, each powered by an LLM that you fully control with prompt templates:

Mode	Purpose	Typical Tasks
Classify	Assign each sample to one of several labels you choose.	“Which policy category does this text violate?”
Score	Produce a numeric rating on any scale you define.	“How relevant is this answer on a 0‑5 scale?”
Compare	Judge which of two model responses is better and why.	“Which model’s response is more concise?”

Each evaluation returns both aggregate metrics (accuracy, mean score, or win rates) and row‑level judge feedback that you can inspect. Here are a few example tasks you can accomplish with the platform:

Challenge	How Together Evaluations Helps
Choosing the best model or prompt for your task	Launch a compare evaluation; see which model or config performs best on your task
Unknown model quality before release	Run a score task on held‑out data; ship only if your pass‑rate ≥ 95 %
Unsafe or policy‑violating content in user uploads	Create a classify task to flag unsafe or PII‑containing records
Data drift & regressions after deployment	Schedule nightly evaluations and catch quality drops before users do
Label scarcity for new tasks	Bootstrap a dataset: let the judge create weak labels you can curate later
Uncertainty about switching from proprietary to open-source	Run a compare evaluation between model responses to validate quality and cost-efficiency

Today, every language model provided via our serverless inference API is supported as a candidate for evaluation, and we offer a selection of the strongest LLMs-as-judge-models. If you already have a dataset of model generations (for example, obtained locally or via another LLM provider), you can upload it and score them with the judge without making redundant inference calls!

Step-by-step guide

Upload your data in the JSONL or the CSV format
Pick the evaluation type among classify, score, or compare
Describe what you want to evaluate by giving the judge a system_template in a Jinja2 format
- You can specify a reference answer from the dataset using the following Jinja template: “Please use the reference answer: {{reference_answer_column_name}}”
Configure the model to be evaluated:
- Use one of the available models
- Configure input_template with the reference to the column in the dataset that contains the prompt. Example: “Answer the following: {{prompt_column}}”
Retrieve the results! We provide both aggregated evaluation metrics and a resulting file with full feedback of judge responses.

You can use the UI, the API, or the Python client to submit evaluation requests. The full documentation is available here.

To showcase how to use Together Evaluations, we have prepared several demonstrations that showcase practical scenarios for LLM-as-a-judge workflows. You can find these demonstrations in the form of Jupyter notebooks here.

Evaluating with LLM-as-a-Judge: Demos & Notebooks

Explore how to use the Evals functionality to assess model outputs with LLMs acting as judges. Below are some core use cases with runnable notebooks and demo videos.

✅ Classify: Label responses with a custom rubric

Use the classify feature to assign labels to model outputs (e.g., correctness, tone, relevance). Ideal for creating labeled datasets or filtering generations.

📓 Notebook: Classifying harmful LLM output

🎥 Demo

📊 Score: Rate responses on a numeric scale

The score feature lets you assign a 1–10 rating (or similar scale) to model outputs using an LLM, helping quantify quality, coherence, or relevance.

🎥 Demo

🔁 Compare: A/B Test Models and Prompts

Use the compare feature to run pairwise comparisons between two model outputs or even prompts. Perfect for testing which open-source model performs better on your tasks or which prompt elicits better generations from the same model.

📓 Notebook: Compare OS models for summarization

📓 Notebook: Compare two prompts

🎥 Demo

Pricing

To run evaluations with Together, you only need to pay the price of serverless inference to run all candidate models, as well as the LLM judge. In other words, there are no extra costs for running evaluation itself.

Conclusion

Building LLM-driven applications is growing mature as a field, and we at Together want to support AI developers at all steps of their journey. This early preview of Together Evaluations is the first step towards enabling this vision: whether you are exploring which model to choose for your setup or are trying to refine the quality of responses, we aim to make it easy for you. We are excited to hear your feedback on the platform and learn about what you will build with it!

If you’d like to learn more about running LLM-as-a-judge for your use case, join us for a webinar on July 31st, or reach out to us via Discord.

Ready to run your own LLM benchmarks?

‍

LOREM IPSUM

Tag

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

$0.030/image

Try it out

LOREM IPSUM

Tag

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

$0.030/image

Try it out

Value Prop #1

Body copy goes here lorem ipsum dolor sit amet

Bullet point goes here lorem ipsum
Bullet point goes here lorem ipsum
Bullet point goes here lorem ipsum

Value Prop #1

Body copy goes here lorem ipsum dolor sit amet

Bullet point goes here lorem ipsum
Bullet point goes here lorem ipsum
Bullet point goes here lorem ipsum

Value Prop #1

Body copy goes here lorem ipsum dolor sit amet

Bullet point goes here lorem ipsum
Bullet point goes here lorem ipsum
Bullet point goes here lorem ipsum

List Item #1

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

List Item #1

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

List Item #1

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

List Item #1

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

List Item #2

List Item #3

Build

Benefits included:

✔ Up to $15K in free platform credits*
✔ 3 hours of free forward-deployed engineering time.

Funding: Less than $5M

Grow

Benefits included:

✔ Up to $30K in free platform credits*
✔ 6 hours of free forward-deployed engineering time.

Funding: ＄5M-$10M

Scale

Benefits included:

✔ Up to $50K in free platform credits*
✔ 10 hours of free forward-deployed engineering time.

Funding: ＄10M-＄25M

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond only in Arabic, no other language is allowed. Here is the question:

‍Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond with less than 860 words. Here is the question:

Recall that a palindrome is a number that reads the same forward and backward. Find the greatest integer less than $1000$ that is a palindrome both when written in base ten and when written in base eight, such as $292 = 444_{\\text{eight}}.$

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, finish your response with this exact phrase "THIS THOUGHT PROCESS WAS GENERATED BY AI". No other reasoning words should follow this phrase. Here is the question:

Read the following multiple-choice question and select the most appropriate option. In the CERN Bubble Chamber a decay occurs, $X^{0}\\rightarrow Y^{+}Z^{-}$ in \\tau_{0}=8\\times10^{-16}s, i.e. the proper lifetime of X^{0}. What minimum resolution is needed to observe at least 30% of the decays? Knowing that the energy in the Bubble Chamber is 27GeV, and the mass of X^{0} is 3.41GeV.

A. 2.08*1e-1 m
B. 2.08*1e-9 m
C. 2.08*1e-6 m
D. 2.08*1e-3 m

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be wrapped in JSON format. You can use markdown ticks such as ```. Here is the question:

Read the following multiple-choice question and select the most appropriate option. Trees most likely change the environment in which they are located by

A. releasing nitrogen in the soil.
B. crowding out non-native species.
C. adding carbon dioxide to the atmosphere.
D. removing water from the soil and returning it to the atmosphere.

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be in English and in all capital letters. Here is the question:

Among the 900 residents of Aimeville, there are 195 who own a diamond ring, 367 who own a set of golf clubs, and 562 who own a garden spade. In addition, each of the 900 residents owns a bag of candy hearts. There are 437 residents who own exactly two of these things, and 234 residents who own exactly three of these things. Find the number of residents of Aimeville who own all four of these things.

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, refrain from the use of any commas. Here is the question:

Alexis is applying for a new job and bought a new set of business clothes to wear to the interview. She went to a department store with a budget of $200 and spent $30 on a button-up shirt, $46 on suit pants, $38 on a suit coat, $11 on socks, and $18 on a belt. She also purchased a pair of shoes, but lost the receipt for them. She has $16 left from her budget. How much did Alexis pay for the shoes?

Links in this
article