Together Evaluations: Benchmark Models for Your Tasks
When building products with large language models, knowing how well a model can solve your task is crucial, especially with the rapid rate at which new LLMs appear. Without a structured approach to evaluation, one has to decide ad hoc, making it challenging to keep up and choose the most suitable model for your use case.
One effective approach is to define a task-specific benchmark and use a strong AI model as a judge. This approach lets you quickly compare responses across models or rate the performance of a single model without the overhead of manual labeling or rigid metrics.
To support this workflow, we are releasing an early preview of Together Evaluations — a fast, flexible way to benchmark LLM response quality using leading open-source judge models you control.
Join us for a practical walkthrough on July 31st where we’ll demonstrate how to use Together Evaluations for your use case. We’ll also show how to choose between new SOTA models including Kimi, Qwen, and GLM.
Why use LLM-as-a-judge?
The core part of any AI application is its underlying LLM, so it's no surprise that the quality of a model’s responses can define the success of a product. But how can we measure the performance of a model for a given task? The whole field of language model evaluation aims to answer this question, with methods ranging from human annotation to automated metrics.
In recent years, the capabilities of LLMs have become strong enough that models themselves can act as evaluators! Compared to the traditional approach of human annotation, using LLM-as-a-judge is much faster — you just need to run the model with every evaluated sample as a separate prompt, which is orders of magnitude faster compared to training annotators and waiting for their labeling. Compared to algorithmic metrics (like BLEU for translation or F1 score for classification), LLM judges are much more flexible: you can measure anything that you can describe with natural language. Lastly, LLM-as-a-judge metrics are very easy to modify if your task definition slightly changes, as you simply need to adjust the system prompt.
Evaluations with Together
Together Evaluations provides a powerful framework for defining custom LLM benchmarks and comparing different models on those benchmarks using LLM judges. Our platform offers three evaluation modes, each powered by an LLM that you fully control with prompt templates:
Each evaluation returns both aggregate metrics (accuracy, mean score, or win rates) and row‑level judge feedback that you can inspect. Here are a few example tasks you can accomplish with the platform:
Today, every language model provided via our serverless inference API is supported as a candidate for evaluation, and we offer a selection of the strongest LLMs-as-judge-models. If you already have a dataset of model generations (for example, obtained locally or via another LLM provider), you can upload it and score them with the judge without making redundant inference calls!
Step-by-step guide
- Upload your data in the JSONL or the CSV format
- Pick the evaluation type among
classify,score, orcompare - Describe what you want to evaluate by giving the judge a
system_templatein a Jinja2 format- You can specify a reference answer from the dataset using the following Jinja template: “Please use the reference answer: {{reference_answer_column_name}}”
- Configure the model to be evaluated:
- Use one of the available models
- Configure
input_templatewith the reference to the column in the dataset that contains the prompt. Example: “Answer the following: {{prompt_column}}”
- Retrieve the results! We provide both aggregated evaluation metrics and a resulting file with full feedback of judge responses.
You can use the UI, the API, or the Python client to submit evaluation requests. The full documentation is available here.
To showcase how to use Together Evaluations, we have prepared several demonstrations that showcase practical scenarios for LLM-as-a-judge workflows. You can find these demonstrations in the form of Jupyter notebooks here.
Evaluating with LLM-as-a-Judge: Demos & Notebooks
Explore how to use the Evals functionality to assess model outputs with LLMs acting as judges. Below are some core use cases with runnable notebooks and demo videos.
✅ Classify: Label responses with a custom rubric
Use the classify feature to assign labels to model outputs (e.g., correctness, tone, relevance). Ideal for creating labeled datasets or filtering generations.
📓 Notebook: Classifying harmful LLM output
🎥 Demo
📊 Score: Rate responses on a numeric scale
The score feature lets you assign a 1–10 rating (or similar scale) to model outputs using an LLM, helping quantify quality, coherence, or relevance.
🎥 Demo
🔁 Compare: A/B Test Models and Prompts
Use the compare feature to run pairwise comparisons between two model outputs or even prompts. Perfect for testing which open-source model performs better on your tasks or which prompt elicits better generations from the same model.
📓 Notebook: Compare OS models for summarization
📓 Notebook: Compare two prompts
🎥 Demo
Pricing
To run evaluations with Together, you only need to pay the price of serverless inference to run all candidate models, as well as the LLM judge. In other words, there are no extra costs for running evaluation itself.
Conclusion
Building LLM-driven applications is growing mature as a field, and we at Together want to support AI developers at all steps of their journey. This early preview of Together Evaluations is the first step towards enabling this vision: whether you are exploring which model to choose for your setup or are trying to refine the quality of responses, we aim to make it easy for you. We are excited to hear your feedback on the platform and learn about what you will build with it!
If you’d like to learn more about running LLM-as-a-judge for your use case, join us for a webinar on July 31st, or reach out to us via Discord.
Ready to run your own LLM benchmarks?
- 📄 Read the documentation
- 🖥 Try the Evaluations UI
- 📓 Read the tutorial notebooks
- 🧑🏫 Sign up for the workshop
- 💬 Questions? Join our Discord
LOREM IPSUM
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
LOREM IPSUM
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
Value Prop #1
Body copy goes here lorem ipsum dolor sit amet
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
Value Prop #1
Body copy goes here lorem ipsum dolor sit amet
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
Value Prop #1
Body copy goes here lorem ipsum dolor sit amet
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
List Item #1
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
List Item #1
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
List Item #1
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
List Item #1
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
List Item #2
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
List Item #3
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Build
Benefits included:
✔ Up to $15K in free platform credits*
✔ 3 hours of free forward-deployed engineering time.
Funding: Less than $5M
Grow
Benefits included:
✔ Up to $30K in free platform credits*
✔ 6 hours of free forward-deployed engineering time.
Funding: $5M-$10M
Scale
Benefits included:
✔ Up to $50K in free platform credits*
✔ 10 hours of free forward-deployed engineering time.
Funding: $10M-$25M
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond only in Arabic, no other language is allowed. Here is the question:
Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond with less than 860 words. Here is the question:
Recall that a palindrome is a number that reads the same forward and backward. Find the greatest integer less than $1000$ that is a palindrome both when written in base ten and when written in base eight, such as $292 = 444_{\\text{eight}}.$
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, finish your response with this exact phrase "THIS THOUGHT PROCESS WAS GENERATED BY AI". No other reasoning words should follow this phrase. Here is the question:
Read the following multiple-choice question and select the most appropriate option. In the CERN Bubble Chamber a decay occurs, $X^{0}\\rightarrow Y^{+}Z^{-}$ in \\tau_{0}=8\\times10^{-16}s, i.e. the proper lifetime of X^{0}. What minimum resolution is needed to observe at least 30% of the decays? Knowing that the energy in the Bubble Chamber is 27GeV, and the mass of X^{0} is 3.41GeV.
- A. 2.08*1e-1 m
- B. 2.08*1e-9 m
- C. 2.08*1e-6 m
- D. 2.08*1e-3 m
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be wrapped in JSON format. You can use markdown ticks such as ```. Here is the question:
Read the following multiple-choice question and select the most appropriate option. Trees most likely change the environment in which they are located by
- A. releasing nitrogen in the soil.
- B. crowding out non-native species.
- C. adding carbon dioxide to the atmosphere.
- D. removing water from the soil and returning it to the atmosphere.
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be in English and in all capital letters. Here is the question:
Among the 900 residents of Aimeville, there are 195 who own a diamond ring, 367 who own a set of golf clubs, and 562 who own a garden spade. In addition, each of the 900 residents owns a bag of candy hearts. There are 437 residents who own exactly two of these things, and 234 residents who own exactly three of these things. Find the number of residents of Aimeville who own all four of these things.
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, refrain from the use of any commas. Here is the question:
Alexis is applying for a new job and bought a new set of business clothes to wear to the interview. She went to a department store with a budget of $200 and spent $30 on a button-up shirt, $46 on suit pants, $38 on a suit coat, $11 on socks, and $18 on a belt. She also purchased a pair of shoes, but lost the receipt for them. She has $16 left from her budget. How much did Alexis pay for the shoes?
article