Together Evaluations: Benchmark Models for Your Tasks
When building products with large language models, knowing how well a model can solve your task is crucial, especially with the rapid rate at which new LLMs appear. Without a structured approach to evaluation, one has to decide ad hoc, making it challenging to keep up and choose the most suitable model for your use case.
One effective approach is to define a task-specific benchmark and use a strong AI model as a judge. This approach lets you quickly compare responses across models or rate the performance of a single model without the overhead of manual labeling or rigid metrics.
To support this workflow, we are releasing an early preview of Together Evaluations — a fast, flexible way to benchmark LLM response quality using leading open-source judge models you control.
Join us for a practical walkthrough on July 31st where we’ll demonstrate how to use Together Evaluations for your use case. We’ll also show how to choose between new SOTA models including Kimi, Qwen, and GLM.
Why use LLM-as-a-judge?
The core part of any AI application is its underlying LLM, so it's no surprise that the quality of a model’s responses can define the success of a product. But how can we measure the performance of a model for a given task? The whole field of language model evaluation aims to answer this question, with methods ranging from human annotation to automated metrics.
In recent years, the capabilities of LLMs have become strong enough that models themselves can act as evaluators! Compared to the traditional approach of human annotation, using LLM-as-a-judge is much faster — you just need to run the model with every evaluated sample as a separate prompt, which is orders of magnitude faster compared to training annotators and waiting for their labeling. Compared to algorithmic metrics (like BLEU for translation or F1 score for classification), LLM judges are much more flexible: you can measure anything that you can describe with natural language. Lastly, LLM-as-a-judge metrics are very easy to modify if your task definition slightly changes, as you simply need to adjust the system prompt.
Evaluations with Together
Together Evaluations provides a powerful framework for defining custom LLM benchmarks and comparing different models on those benchmarks using LLM judges. Our platform offers three evaluation modes, each powered by an LLM that you fully control with prompt templates:
Each evaluation returns both aggregate metrics (accuracy, mean score, or win rates) and row‑level judge feedback that you can inspect. Here are a few example tasks you can accomplish with the platform:
Today, every language model provided via our serverless inference API is supported as a candidate for evaluation, and we offer a selection of the strongest LLMs-as-judge-models. If you already have a dataset of model generations (for example, obtained locally or via another LLM provider), you can upload it and score them with the judge without making redundant inference calls!
Step-by-step guide
- Upload your data in the JSONL or the CSV format
- Pick the evaluation type among
classify
,score
, orcompare
- Describe what you want to evaluate by giving the judge a
system_template
in a Jinja2 format- You can specify a reference answer from the dataset using the following Jinja template: “Please use the reference answer: {{reference_answer_column_name}}”
- Configure the model to be evaluated:
- Use one of the available models
- Configure
input_template
with the reference to the column in the dataset that contains the prompt. Example: “Answer the following: {{prompt_column}}”
- Retrieve the results! We provide both aggregated evaluation metrics and a resulting file with full feedback of judge responses.
You can use the UI, the API, or the Python client to submit evaluation requests. The full documentation is available here.
To showcase how to use Together Evaluations, we have prepared several demonstrations that showcase practical scenarios for LLM-as-a-judge workflows. You can find these demonstrations in the form of Jupyter notebooks here.
Evaluating with LLM-as-a-Judge: Demos & Notebooks
Explore how to use the Evals functionality to assess model outputs with LLMs acting as judges. Below are some core use cases with runnable notebooks and demo videos.
✅ Classify: Label responses with a custom rubric
Use the classify
feature to assign labels to model outputs (e.g., correctness, tone, relevance). Ideal for creating labeled datasets or filtering generations.
📓 Notebook: Classifying harmful LLM output
🎥 Demo
📊 Score: Rate responses on a numeric scale
The score
feature lets you assign a 1–10 rating (or similar scale) to model outputs using an LLM, helping quantify quality, coherence, or relevance.
🎥 Demo
🔁 Compare: A/B Test Models and Prompts
Use the compare feature to run pairwise comparisons between two model outputs or even prompts. Perfect for testing which open-source model performs better on your tasks or which prompt elicits better generations from the same model.
📓 Notebook: Compare OS models for summarization
📓 Notebook: Compare two prompts
🎥 Demo
Pricing
To run evaluations with Together, you only need to pay the price of serverless inference to run all candidate models, as well as the LLM judge. In other words, there are no extra costs for running evaluation itself.
Conclusion
Building LLM-driven applications is growing mature as a field, and we at Together want to support AI developers at all steps of their journey. This early preview of Together Evaluations is the first step towards enabling this vision: whether you are exploring which model to choose for your setup or are trying to refine the quality of responses, we aim to make it easy for you. We are excited to hear your feedback on the platform and learn about what you will build with it!
If you’d like to learn more about running LLM-as-a-judge for your use case, join us for a webinar on July 31st, or reach out to us via Discord.
Ready to run your own LLM benchmarks?
- 📄 Read the documentation
- 🖥 Try the Evaluations UI
- 📓 Read the tutorial notebooks
- 🧑🏫 Sign up for the workshop
- 💬 Questions? Join our Discord
- Lower
Cost20% - faster
training4x - network
compression117x
Q: Should I use the RedPajama-V2 Dataset out of the box?
RedPajama-V2 is conceptualized as a pool of data that serves as a foundation for creating high quality datasets. The dataset is thus not intended to be used out of the box and, depending on the application, data should be filtered out using the quality signals that accompany the data. With this dataset, we take the view that the optimal filtering of data is dependent on the intended use. Our goal is to provide all the signals and tooling that enables this.
article