This website uses cookies to anonymously analyze website traffic using Google Analytics.

Announcing our $102.5M Series A





Our research team is behind breakthrough AI models, datasets, and optimizations.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Try now

Together Inference

The fastest inference on the market1. Start with an easy-to-use API and immediately benefit from the fastest backend optimizations anywhere.

speed relative
to tgi, vllm or
other inference services

3x faster2

llama-2 70b


cost relative to gpt-3.5-turbo

6x lower cost4

Summarize the following article: "Transformers are the workhorse architecture behind modern foundation models. Our team at Together AI has released optimizations like FlashAttention-2 to help further scale their capabilities. However, the computational requirements for Transformers increase quadratically with sequence length, which limits the extent of knowledge these models can capture from available datasets. We are interested in new, sub-quadratic approaches to increase scale and eventually produce models that learn from data more comprehensively.Today, in partnership with Hazy Research, we’re releasing a first look at Monarch Mixer—an exploration into an alternative, more efficient model architecture.Monarch Mixer (M2) is sub-quadratic in both sequence length and model dimension, enabling it to scale more efficiently while still matching quality. Our first target for M2 is BERT, the most popular model in use today for language tasks. M2-BERT is 25% more parameter-efficient than BERT and trains up to 3x faster.This is just the beginning of research into Monarch Mixer models, and we are excited to share the results. You can access the first set of checkpoints at a standard sequence length today, and expect long-sequence BERT models to be coming soon! "


The article discusses the limitations of Transformers, which are the workhorse architecture behind modern foundation models. The authors introduce Monarch Mixer, a new exploration into an alternative, more efficient model architecture that is sub-quadratic in both sequence length and model dimension. The article highlights the benefits of Monarch Mixer, including its ability to scale more efficiently and match the quality of Transformers. The authors also provide a first look at Monarch Mixer-BERT, which is 25% more parameter-efficient than BERT and trains up to 3x faster. They encourage readers to access the first set of checkpoints and expect long-sequence BERT models to be coming soon.


You are an expert programmer that writes simple, concise code and explanations. Write a python function that accepts a long input string, extracts words in between `<span>` and `<\span>` and returns a list.


Here is a Python function that accepts a long input string, extracts words in between <span> and <\span> and returns a list:

import re
def extract_words(input_string):    
           pattern = r"<span>(.*?)<\/span>
            matches = re.findall(pattern, input_string)    
            return matches

This function uses the re module to perform a regular expression search on the input string. The regular expression pattern <span>(.*?)<\\/span> matches any text that is enclosed in <span> and <\\span> tags. The findall function returns a list of all the matches found in the input string.Here is an example of how you can use this function:

input_string = "<span>Hello</span><span>World</span>"
words = extract_words(input_string)
print(words) # Output: ['Hello', 'World']

This code will extract the words "Hello" and "World" from the input string and return them in a list.


<s>[INST] <<SYS>>Your task is to extract top 10 keywords from a provided text.<</SYS>>\n\nScaling Transformers to longer sequence lengths has been a major problem in the last several years, promising to improve performance in language modeling and high-resolution image understanding, as well as to unlock new applications in code, audio, and video generation. The attention layer is the main bottleneck in scaling to longer sequences, as its runtime and memory increase quadratically in the sequence length. FlashAttention exploits the asymmetric GPU memory hierarchy to bring significant memory saving (linear instead of quadratic) and runtime speedup (2-4× compared to optimized baselines), with no approximation. However, FlashAttention is still not nearly as fast as optimized matrix-multiply (GEMM) operations, reaching only 25-40\% of the theoretical maximum FLOPs/s. We observe that the inefficiency is due to suboptimal work partitioning between different thread blocks and warps on the GPU, causing either low-occupancy or unnecessary shared memory reads/writes. We propose FlashAttention-2, with better work partitioning to address these issues. In particular, we (1) tweak the algorithm to reduce the number of non-matmul FLOPs (2) parallelize the attention computation, even for a single head, across different thread blocks to increase occupancy, and (3) within each thread block, distribute the work between warps to reduce communication through shared memory. These yield around 2× speedup compared to FlashAttention, reaching 50-73\% of the theoretical maximum FLOPs/s on A100 and getting close to the efficiency of GEMM operations. We empirically validate that when used end-to-end to train GPT-style models, FlashAttention-2 reaches training speed of up to 225 TFLOPs/s per A100 GPU (72\% model FLOPs utilization).[/INST]


1. Scaling
2. Transformers
3. Longer sequence lengths
4. Language modeling
5. High-resolution image understanding
6. Code generation
7. Audio generation
8. Video generation
9. Attention layer
10. Memory saving


<s>[INST] <<SYS>>You are a helpful, polite customer service agent for Awesome Cheeses Inc named Sally. If the customer mentions anything about food, tell them you have a 20% off deal on gouda cheese. <</SYS>>\n\nhi who am i speaking to?[/INST]Hello! this is Sally from Awesome Cheeses. How is your day going?</s><s>[INST]im hungry for something cheesy[/INST]


Well, you've come to the right place! I'm happy to help you with that. We're running a special promotion right now - you can get 20% off all gouda cheese products. Would you like me to tell you a bit more about our gouda selection? We have a variety of delicious flavors to choose from, such as Smoked Gouda, Goat Gouda, and Garlic Gouda. Which one sounds the most appealing to you?


Label the sentences as either "positive", "negative", "mixed", or "neutral":

Sentence: I can say that there isn't anything I would change.

Sentence: I'm not sure about this.

Sentence: I liked some parts but I didn't like other parts.

Sentence: I think the background image could have been better.

Sentence: I really like it.


Sentence: I can say that there isn't anything I would change.
Label: positive

Sentence: I'm not sure about this.
Label: neutral

Sentence: I liked some parts but I didn't like other parts.
Label: mixed

Sentence: I think the background image could have been better.
Label: negative

Sentence: I really like it.
Label: positive

Together Fine-tuning

Fine-tune leading open-source models with your data to achieve greater accuracy for your tasks.

together files upload acme_corp_customer_support.jsonl
  "filename" : "acme_corp_customer_support.json",
  "id": "file-aab9997e-bca8-4b7e-a720-e820e682a10a",
  "object": "file"
together finetune create --training-file file-aab9997-bca8-4b7e-a720-e820e682a10a
--model together compute/RedPajama-INCITE-7B-Chat

together finetune create --training-file $FILE_ID 
--model $MODEL_NAME 
--wandb-api-key $WANDB_API_KEY 
--n-epochs 10 
--n-checkpoints 5 
--batch-size 8 
--learning-rate 0.0003
    "training_file": "file-aab9997-bca8-4b7e-a720-e820e682a10a",
    "model_output_name": "username/togethercomputer/llama-2-13b-chat",
    "model_output_path": "s3://together/finetune/63e2b89da6382c4d75d5ef22/username/togethercomputer/llama-2-13b-chat",
    "Suffix": "Llama-2-13b 1",
    "model": "togethercomputer/llama-2-13b-chat",
    "n_epochs": 4,
    "batch_size": 128,
    "learning_rate": 1e-06,
    "checkpoint_steps": 2,
    "created_at": 1687982945,
    "updated_at": 1687982945,
    "status": "pending",
    "id": "ft-5bf8990b-841d-4d63-a8a3-5248d73e045f",
    "epochs_completed": 3,
    "events": [
            "object": "fine-tune-event",
            "created_at": 1687982945,
            "message": "Fine tune request created",
            "type": "JOB_PENDING",
    "queue_depth": 0,
    "wandb_project_name": "Llama-2-13b Fine-tuned 1"

Need a cluster of H100s on Infiniband today?

Reserve your cluster today

Together GPU Clusters

Get your own private GPU cluster for large training and fine-tuning today.

Use our purpose built training clusters with H100 and A100 GPUs connected over fast Infiniband networks. Your cluster comes optimized for distributed training with the high performance Together Training stack out of the box. You focus on your model, and we’ll ensure everything runs smoothly.

  • 01

    We offer flexible terms – even with our highest quality hardware. You can commit to just a month or reserve capacity for up to 5 years.

  • 02

    A100 PCIe Clusters Node Specifications: 

    - 8x A100 / 80GB / PCIe

    - 200Gb node-to-node non-blocking Ethernet

    - 120 vCPU Intel Xeon (Ice Lake)

    - 960GB RAM

    - 7.68 TB NVMe storage

    A100 SXM Clusters Node Specifications: 

    - 8x NVIDIA A100 80GB SXM

    - 4120 vCPU Intel Xeon (Sapphire Rapids)

    - 960 GB RAM

    - 8 x 960GB NVMe storage

    - 200 Gbps Ethernet or 3200 Gbps Infiniband configs available

    H100 Clusters Node Specifications: 

    - 8x Nvidia H100 / 80GB / SXM5
    - 3.2 Tbps Infiniband network
    - 2x AMD EPYC 9474F 18 Cores 96 Threads 3.6GHz CPUs
    - 1.5TB ECC DDR5 Memory

    - 8x 3.84TB NVMe SSDs

  • 03

    We value your time. Clusters are pre-configured for high-speed distributed training, using Slurm and the Together Custom Models stack to get you up and running at lightspeed.

Together Custom Models

We built the open-source RedPajama models. Now, we’re helping you build models from scratch. 

Together Custom Models optimizes model builds end-to-end from data collection and preparation to deployment. And in the end, you own the model.





relative to aws

lower cost7

We believe the future of AI is

Open Source


01 Transparency

Inspect how models are trained and what data is used to increase accuracy and minimize risks

02 no lock-in

You own the model you fine-tune, not your cloud provider. Change providers for whatever reason, including price changes.

03 Privacy

Maintain complete data privacy by storing data locally or in our secure cloud

Build, deploy, and scale. All in a single platform.

  • 01


    Whether prompt engineering, fine-tuning, or training, we are ready to meet your business demands.

  • 02


    Easily integrate your new model into your production application using the Together Inference API.

  • 03


    With the fastest performance available and elastic scaling, Together AI is built to scale with your needs as you grow.

Customer Stories

See how we support leading teams around the world. Our customers are creating innovative generative AI applications, faster.

Pika creates the next gen text-to-video models on Together GPU Clusters

Nexusflow uses Together GPU Clusters to build cybersecurity models

Arcee builds domain adaptive language models with Together Custom Models

here →