Model Library

How to choose the right open model for production

January 8, 2026

By 

Nicholas Broad, Dan Waters

How do you choose the right open model for your workload?

With over 2 million open models on Hugging Face, it’s hard to know which one is best for any specific task. There are countless leaderboards and benchmarks comparing dozens of models, and noteworthy new releases practically every week. How can anyone possibly know which model they should use, or how to even begin the selection process? Why choose an open model?

Why choose an open model for your workload?

There are many reasons why choosing an open vs. closed model makes sense for enterprise tasks, including transparency, adaptability, and control.

Open models are transparent because details of their weights, training data, and architecture are known, making them fitting candidates for introspection and analysis into their decision making. Transparency helps identify the sources of issues like overfitting and bias, which can help organizations gradually increase confidence in AI decision making over time if choosing to adapt the model.

Open models are adaptable because fine-tuning techniques from the research community can be applied. Proprietary adaptation methods for closed models may not provide the same level of customization or alignment as a collection of open post-training approaches, which might include supervised fine-tuning (SFT), direct preference optimization (DPO), and reinforcement learning (RL).

Finally, open models provide an additional level of control to organizations. Open models can be run anywhere and are not tied to any proprietary architecture or stack. This enables your team to innovate upon open community research, maintaining full ownership and auditability over the model artifacts created along the way. When you own your AI, you’re more invested in building for the future needs of your organization.

Legal considerations for open models

Some open models have strict licensing requirements that disqualify them from commercial or production use. For some companies, this means only using models with Apache-2.0 or MIT licenses. For others, it could mean only using models made in the U.S. or France. Be sure to consult with your legal team before starting your model selection process, because this will limit the field of candidate models to consider. There is a table in the next section with relevant information about licenses and region of origin. The Llama license is the most restrictive, but all are generally permissive for commercial purposes.

Closed vs. open models: Task comparison

It’s common to begin the model selection process with experience using proprietary models like GPT-5, Claude Opus, or Gemini 3. These providers typically offer three tiers, trading off speed and cost versus capability (e.g. GPT-5 pro, GPT-5 mini, GPT-5 nano). We will refer to these three tiers as low, medium, and high, suggesting similarly capable open models to try. The low tier is the cheapest, fastest, and least capable, whereas the high tier is the most expensive, slowest, and most capable.

Model size & capabilities

  • Here are some rough guidelines to help narrow down your model selection by parameter size based on current performance of closed-model systems. If the task requires the closed model to be from the high tier, the open model equivalent should be at least 300B total parameters.
  • For the medium tier, the open model should be between 70 and 250B total parameters.
  • For the low tier, the model should likely be <32B total parameters. After fine-tuning, there are instances where an even smaller model can be used — but this is not always the case. These are simple guidelines to use as a starting point, and proper evaluation is still required.

The following model families are recommended for general-purpose evaluation versus proprietary models:

Name Country of Origin Available Sizes License
Deepseek China 650B MIT
Kimi China 1T Modified MIT
Qwen China 1B – 480B Apache-2.0
GLM China 100B – 350B MIT
Meta Llama U.S. 7B – 400B Llama
Google Gemma U.S. 1B – 27B Gemma
Mistral France 3B – 675B Apache-2.0

Plotting model quality versus parameter size gives us a clear view into the performance of each model family at its various available parameter sizes.

Tradeoffs to consider during model selection

When choosing a model, there are three main dimensions: cost, speed, and quality. Cost and quality are directly related, and are determined primarily by the size of the model. The larger the model, the higher the expected quality of its output, and the more expensive it is to run. Speed is inversely proportional to quality: the higher the quality, the lower the speed. For instance, here are three different potential configurations for the same task:

  • Deepseek-v3.2
    • Running on 8x GB 200s (most costly, $$$)
    • Max load: 2 RPS
    • 99% accuracy
  • Qwen3-Next-80B-A3B-Instruct
    • Running on 4x H200s (good price-performance mix, $$)
    • Max load: 10 RPS
    • 96% accuracy
  • Ministral-3-3B-Instruct-2512
    • Running on 2x H100s (Fast, but lower accuracy and cost, $)
    • Max load: 20 RPS
    • 94% accuracy

The best option for you depends on your requirements for quality, cost margins, and throughput. It’s worth taking the time to thoroughly define these requirements; they will be instrumental in guiding your selection.

Evaluating the model

While a given model might have published academic benchmarks, both your data and task are likely to be unique. Success in AI initiatives requires disciplined experimentation and a clear understanding of the metrics you’ve chosen to represent performance. Techniques like LLM-as-a-judge evaluations enable a better approximation of performance on famously difficult- to-evaluate LLM tasks, like numerical scoring, classification, and open-ended comparison. Other tasks, like reranking, can be evaluated with traditional ML metrics like accuracy, precision, and nDCG. To set yourself up for success, ensure you’ve done the work to understand what these metrics mean, select the appropriate ones for your use case, create a golden dataset, and gain confidence in how to run those evaluations before directly evaluating models. This delivers clarity now and saves time later.

For most use cases, the minimum viable LLM should be used to ensure high cost margins, low latency, fast generation speeds, and acceptable quality. Analyzing the failure modes of the system is a virtuous cycle: discover issues, make changes, re-run evaluations, and repeat.

Here’s the painful truth: when evaluating the output, some level of manual review is still necessary. It can be tedious, but the benefits are worth the time commitment. Manual review gets you closer to understanding the real ways the model fails, and will ultimately be the fastest way to determine the model’s quality. It also helps you uncover blind spots in your evaluation process.

The simplest system for doing evaluations is:

  1. Review traces
  2. Cluster common failures into groups
  3. Keep going until 100 samples are labeled as pass/fail
  4. Modify prompt to fix most prominent errors (providing positive and negative examples)
  5. Repeat, ideally using automated methods

Here are some helpful tips:

  • Be sure to use a representative set of samples so the failures are also representative of the likely errors to occur.
  • Be sure to do the first pass manually to gain real experience about how the system is failing.
  • After manual review, consider developing a rubric indicating what conditions would make the response pass/fail. The failure modes identified earlier should be mentioned in the rubric. For example, if the query mentions X and the model doesn’t do Y, then this is a failure. A few examples can be included in the prompt to further guide the model about what is considered passing and failing. Guiding the model to output JSON makes it easy to aggregate results at the end.
  • Automated LLM-as-a-judge evals can be done through the Together platform — see more here.

Fine-tuning for last-mile task adaptation

One major benefit of open models is their potential to be improved if they are fine-tuned for a specific use case. Sometimes, you’ll discover a fast model that is very close in quality for most tasks, but too deficient in others to adopt in production. Fine-tuning or preference optimization may be your solution, as you’re creating a servable artifact attuned to your data and task. This is especially compelling if you already have logged conversations or other production data. The investment in most LoRA SFT fine-tuning experiments is relatively small, and running many experiments with both LoRA and DPO are supported on the Together AI platform today.

FAQ

We’ve discussed many important topics to consider when evaluating open models. Open models are incredibly powerful due to their cost, broad community support, and adaptability through fine-tuning and other model shaping techniques.

Why should I select an open model for my workload?

Open models offer transparency, adaptability, and control.

What legal considerations are relevant when selecting an open model?

The license of the model and its country of origin are examples of legal considerations you may encounter during model selection. Ensure you’re aware of any internal licensing or provenance requirements.

How do I select a model for a specific task?

The table earlier in this post includes a comparison across low, medium, and high parameter count models. Remember the tradeoffs involved: large models are generally smarter but slower, and cost a bit more. Smaller models are usually less capable but significantly faster and cheaper.

How should I evaluate open-source models?

While academic benchmarks are helpful to generally describe a model’s capabilities in some domain or task, understanding what metrics apply to your successful outcomes is the first and most important step. Understand the use case and the metrics that commonly apply to it, and become comfortable regularly running those evaluations and parsing their results.

How can I adapt a model to get better at a task?

Fine-tuning techniques such as LoRA SFT, full fine-tuning, or direct preference optimization can guide a well-performing model to cover gaps in performance. Existing evaluation processes make fine-tuning very powerful, because you can quickly test model quality with a held-out dataset.

LOREM IPSUM

Tag

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

$0.030/image

Try it out

LOREM IPSUM

Tag

Audio Name

Audio Description

0:00

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

$0.030/image

Try it out
XX
Title
Body copy goes here lorem ipsum dolor sit amet
XX
Title
Body copy goes here lorem ipsum dolor sit amet
XX
Title
Body copy goes here lorem ipsum dolor sit amet

Value Prop #1

Body copy goes here lorem ipsum dolor sit amet

  • Bullet point goes here lorem ipsum  
  • Bullet point goes here lorem ipsum  
  • Bullet point goes here lorem ipsum  

Value Prop #1

Body copy goes here lorem ipsum dolor sit amet

  • Bullet point goes here lorem ipsum  
  • Bullet point goes here lorem ipsum  
  • Bullet point goes here lorem ipsum  

Value Prop #1

Body copy goes here lorem ipsum dolor sit amet

  • Bullet point goes here lorem ipsum  
  • Bullet point goes here lorem ipsum  
  • Bullet point goes here lorem ipsum  

List Item  #1

  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

List Item  #1

  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

List Item  #1

  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

List Item  #1

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

List Item  #2

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

List Item  #3

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

Build

Benefits included:

  • ✔ Up to $15K in free platform credits*

  • ✔ 3 hours of free forward-deployed engineering time.

Funding: Less than $5M

Grow

Benefits included:

  • ✔ Up to $30K in free platform credits*

  • ✔ 6 hours of free forward-deployed engineering time.

Funding: $5M-$10M

Scale

Benefits included:

  • ✔ Up to $50K in free platform credits*

  • ✔ 10 hours of free forward-deployed engineering time.

Funding: $10M-$25M

Multilinguality

Word limit

Disclaimer

JSON formatting

Uppercase only

Remove commas

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond only in Arabic, no other language is allowed. Here is the question:

Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond with less than 860 words. Here is the question:

Recall that a palindrome is a number that reads the same forward and backward. Find the greatest integer less than $1000$ that is a palindrome both when written in base ten and when written in base eight, such as $292 = 444_{\\text{eight}}.$

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, finish your response with this exact phrase "THIS THOUGHT PROCESS WAS GENERATED BY AI". No other reasoning words should follow this phrase. Here is the question:

Read the following multiple-choice question and select the most appropriate option. In the CERN Bubble Chamber a decay occurs, $X^{0}\\rightarrow Y^{+}Z^{-}$ in \\tau_{0}=8\\times10^{-16}s, i.e. the proper lifetime of X^{0}. What minimum resolution is needed to observe at least 30% of the decays? Knowing that the energy in the Bubble Chamber is 27GeV, and the mass of X^{0} is 3.41GeV.

  • A. 2.08*1e-1 m
  • B. 2.08*1e-9 m
  • C. 2.08*1e-6 m
  • D. 2.08*1e-3 m

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be wrapped in JSON format. You can use markdown ticks such as ```. Here is the question:

Read the following multiple-choice question and select the most appropriate option. Trees most likely change the environment in which they are located by

  • A. releasing nitrogen in the soil.
  • B. crowding out non-native species.
  • C. adding carbon dioxide to the atmosphere.
  • D. removing water from the soil and returning it to the atmosphere.

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be in English and in all capital letters. Here is the question:

Among the 900 residents of Aimeville, there are 195 who own a diamond ring, 367 who own a set of golf clubs, and 562 who own a garden spade. In addition, each of the 900 residents owns a bag of candy hearts. There are 437 residents who own exactly two of these things, and 234 residents who own exactly three of these things. Find the number of residents of Aimeville who own all four of these things.

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, refrain from the use of any commas. Here is the question:

Alexis is applying for a new job and bought a new set of business clothes to wear to the interview. She went to a department store with a budget of $200 and spent $30 on a button-up shirt, $46 on suit pants, $38 on a suit coat, $11 on socks, and $18 on a belt. She also purchased a pair of shoes, but lost the receipt for them. She has $16 left from her budget. How much did Alexis pay for the shoes?

Start
building
yours
here →