Announcing Together Inference Engine – the fastest inference available
The Together Inference Engine is multiple times faster than any other inference service, with 117 tokens per second on Llama-2-70B-Chat and 171 tokens per second on Llama-2-13B-Chat
Today we are announcing Together Inference Engine, the world’s fastest inference stack. It is up to 3x faster than TGI or vLLM when running on the same hardware, up to 2x faster than other serverless APIs (eg: Perplexity, Anyscale, Fireworks AI, or Mosaic ML1). This means the most demanding generative AI applications in the world can now provide a much faster user experience, with greater efficiency, and lower cost.
The Together Inference Engine is built on CUDA and runs on NVIDIA Tensor Core GPUs. Over the past several months, our team and collaborators have released a number of techniques that optimize inference performance including FlashAttention-2, Flash-Decoding, and Medusa, available in the open source and incorporated into many libraries. Our team has combined these techniques with our own optimizations and today we are excited to announce the Together Inference Engine.
At Together AI, we’re focused on providing the fastest cloud for generative AI. Since launching, we have had over 10,000 users sign up and many production applications are now built on Together Inference. Now, with the fastest inference engine, we are excited to bring blazing fast performance to AI developers and enterprises around the world.
Performance
To measure the performance of the Together Inference Engine in a transparent manner, we leveraged the new open-source LLMPerf benchmarking harness released by Anyscale. First, we use the default setting for LLMPerf, which takes as input on average 500 tokens and generates on average 150 output tokens.
Together Inference Engine compared to TGI and vLLM with default LLMPerf settings


Together Inference API compared to Perplexity and Anyscale APIs with default LLMPerf settings


Together Inference API compared to Perplexity and Anyscale APIs with modified LLMPerf settings
Second, we use modified LLMPerf settings which take as input on average 700 tokens and generates on average 550 output tokens.


Quality
The improvements to performance with the Together Inference Engine come without any compromise to quality. These changes do not involve techniques like quantization which can change the behavior of the model, even if in a modest way.
The following table shows the results of several accuracy benchmarks. Together Inference achieves results inline with the reference Hugging Face implementation.

Features
We’ve received incredible feedback from our customers, including the desire to have more flexibility in how they use the Together Inference API. So today we are also introducing a number of new features.
Serverless Endpoints
Over 50 leading open-source models are hosted for you through Serverless Endpoints, including Llama-2, RedPajama, Falcon, and Stable Diffusion XL. We continue to curate the best models to be featured and available – adding new leading models as they become available.
With Serverless Endpoints capacity is automatically added and scales depending on your traffic, eliminating the need for you to choose instance types. With Serverless Endpoints, you pay for just what you use for the sum of the input and output tokens.
Dedicated Instances
Deploy Dedicated Instances for over 100 popular open-source models, your own fine-tuned model, or any proprietary model. Dedicated Instances are designed to match your traffic needs. You choose the hardware configuration, control the number of instances deployed, and how many you allow auto-scaling to.
The Together Inference Engine also dynamically optimizes between faster tokens per second and higher overall throughput. And you can configure the max batch size – to tune between ensuring faster latency for your end users versus allowing for higher overall capacity from the instance.

Auto-scaling
Both Serverless Endpoints and Dedicated Instances can now be configured with auto-scaling, so that additional hardware is automatically provisioned when API volumes exceed the capacity of the currently deployed hardware. And, auto-scaling automatically scales back down to your minimum configured instances when API volumes reduce – saving you costs when running Dedicated Instances.
Additional models
We continue to expand the number of models available out-of-the-box, now with over 100 models available. Since launch, we’ve added over 35 new models including recent updates like Mistral, Nous Hermes, and Lemma.
Pricing
With faster performance, we use less compute resources. And we’re excited to reflect these efficiencies onto you as the customer in the form of lower pricing. Today we are lowering the price of Serverless Endpoints for 70B models including Llama-2-70B to $0.0009 per 1K tokens.
Leverage models like Llama-2-13b-Chat, at 6 times lower cost than GPT 3.5 Turbo, with performance that is 1.85 times faster.
Updated pricing for Llama-2 models is as follows:
For full pricing details, including other models, visit our pricing page.
LOREM IPSUM
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
LOREM IPSUM
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
Value Prop #1
Body copy goes here lorem ipsum dolor sit amet
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
Value Prop #1
Body copy goes here lorem ipsum dolor sit amet
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
Value Prop #1
Body copy goes here lorem ipsum dolor sit amet
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
List Item #1
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
List Item #1
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
List Item #1
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
List Item #1
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
List Item #2
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
List Item #3
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Build
Benefits included:
✔ Up to $15K in free platform credits*
✔ 3 hours of free forward-deployed engineering time.
Funding: Less than $5M
Grow
Benefits included:
✔ Up to $30K in free platform credits*
✔ 6 hours of free forward-deployed engineering time.
Funding: $5M-$10M
Scale
Benefits included:
✔ Up to $50K in free platform credits*
✔ 10 hours of free forward-deployed engineering time.
Funding: $10M-$25M
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond only in Arabic, no other language is allowed. Here is the question:
Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond with less than 860 words. Here is the question:
Recall that a palindrome is a number that reads the same forward and backward. Find the greatest integer less than $1000$ that is a palindrome both when written in base ten and when written in base eight, such as $292 = 444_{\\text{eight}}.$
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, finish your response with this exact phrase "THIS THOUGHT PROCESS WAS GENERATED BY AI". No other reasoning words should follow this phrase. Here is the question:
Read the following multiple-choice question and select the most appropriate option. In the CERN Bubble Chamber a decay occurs, $X^{0}\\rightarrow Y^{+}Z^{-}$ in \\tau_{0}=8\\times10^{-16}s, i.e. the proper lifetime of X^{0}. What minimum resolution is needed to observe at least 30% of the decays? Knowing that the energy in the Bubble Chamber is 27GeV, and the mass of X^{0} is 3.41GeV.
- A. 2.08*1e-1 m
- B. 2.08*1e-9 m
- C. 2.08*1e-6 m
- D. 2.08*1e-3 m
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be wrapped in JSON format. You can use markdown ticks such as ```. Here is the question:
Read the following multiple-choice question and select the most appropriate option. Trees most likely change the environment in which they are located by
- A. releasing nitrogen in the soil.
- B. crowding out non-native species.
- C. adding carbon dioxide to the atmosphere.
- D. removing water from the soil and returning it to the atmosphere.
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be in English and in all capital letters. Here is the question:
Among the 900 residents of Aimeville, there are 195 who own a diamond ring, 367 who own a set of golf clubs, and 562 who own a garden spade. In addition, each of the 900 residents owns a bag of candy hearts. There are 437 residents who own exactly two of these things, and 234 residents who own exactly three of these things. Find the number of residents of Aimeville who own all four of these things.
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, refrain from the use of any commas. Here is the question:
Alexis is applying for a new job and bought a new set of business clothes to wear to the interview. She went to a department store with a budget of $200 and spent $30 on a button-up shirt, $46 on suit pants, $38 on a suit coat, $11 on socks, and $18 on a belt. She also purchased a pair of shoes, but lost the receipt for them. She has $16 left from her budget. How much did Alexis pay for the shoes?
article