The AI Acceleration Cloud

AI pioneers train, fine-tune, and run frontier models on our GPU cloud platform.

Start building now Contact sales

200+ generative AI models

Build with open-source and specialized multimodal models for chat, images, code, and more. Migrate from closed models with OpenAI-compatible APIs.

End-to-end platform for the full generative AI lifecycle

Leverage pre-trained models, fine-tune them for your needs, or build custom models from scratch. Whatever your generative AI needs, Together AI offers a seamless continuum of AI compute solutions to support your entire journey.

Inference
The fastest way to build with pretrained AI models:
- ✔ Serverless or dedicated endpoints
- ✔ Deploy in enterprise VPC
- ✔ SOC 2 and HIPAA compliant
Fine-Tuning
Tailored customization for your tasks
- ✔ Complete model ownership
- ✔ Fully tune or adapt models
- ✔ Easy-to-use APIs
- Full Fine-Tuning
- LoRA Fine-Tuning
GPU Clusters
Full control for massive AI workloads
- ✔ Accelerate large model training
- ✔ GB200, B200, and H100 GPUs
- ✔ Pricing from $1.75 / hour

Run
models

Train 
Models

Speed, cost, and accuracy. Pick all three.

SPEED RELATIVE TO VLLM

4x FASTER

LLAMA-3 8B AT FULL PRECISION

400 TOKENS/SEC

COST RELATIVE TO GPT-4o

11x lower cost

Why Together Inference

accelerated by cutting edge research
Transformer-optimized kernels: our researchers' custom FP8 inference kernels, 75%+ faster than base PyTorch
‍
Quality-preserving quantization: accelerating inference while maintaining accuracy with advances such as QTIP
‍
Speculative decoding: faster throughput, powered by novel algorithms and draft models trained on RedPajama dataset
Flexibility to choose a model that fits your needs
Turbo: Best performance without losing accuracy
‍
Reference: Full precision, available for 100% accuracy
‍
Lite: Optimized for fast performance at the lowest cost
Available via Dedicated instances and serverless API
Dedicated instances: fast, consistent performance, without rate limits, on your own single-tenant NVIDIA GPUs
‍
Serverless API: quickly switch from closed LLMs to models like Llama, using our OpenAI compatible APIs

Control your IP.
‍Own your AI.

Fine-tune open-source models like Llama on your data and run them on Together Cloud or in a hyperscaler VPC. With no vendor lock-in, your AI remains fully under your control.

together files upload acme_corp_customer_support.jsonl
  
{
  "filename" : "acme_corp_customer_support.json",
  "id": "file-aab9997e-bca8-4b7e-a720-e820e682a10a",
  "object": "file"
}
  
  
together finetune create --training-file file-aab9997-bca8-4b7e-a720-e820e682a10a
--model together compute/RedPajama-INCITE-7B-Chat

together finetune create --training-file $FILE_ID 
--model $MODEL_NAME 
--wandb-api-key $WANDB_API_KEY 
--n-epochs 10 
--n-checkpoints 5 
--batch-size 8 
--learning-rate 0.0003
{
    "training_file": "file-aab9997-bca8-4b7e-a720-e820e682a10a",
    "model_output_name": "username/togethercomputer/llama-2-13b-chat",
    "model_output_path": "s3://together/finetune/63e2b89da6382c4d75d5ef22/username/togethercomputer/llama-2-13b-chat",
    "Suffix": "Llama-2-13b 1",
    "model": "togethercomputer/llama-2-13b-chat",
    "n_epochs": 4,
    "batch_size": 128,
    "learning_rate": 1e-06,
    "checkpoint_steps": 2,
    "created_at": 1687982945,
    "updated_at": 1687982945,
    "status": "pending",
    "id": "ft-5bf8990b-841d-4d63-a8a3-5248d73e045f",
    "epochs_completed": 3,
    "events": [
        {
            "object": "fine-tune-event",
            "created_at": 1687982945,
            "message": "Fine tune request created",
            "type": "JOB_PENDING",
        }
    ],
    "queue_depth": 0,
    "wandb_project_name": "Llama-2-13b Fine-tuned 1"
}

Fine-tuning API

Forge the AI frontier. Train on expert-built GPU clusters.

Built by AI researchers for AI innovators, Together GPU Clusters are powered by NVIDIA GB200, H200, and H100 GPUs, along with the Together Kernel Collection — delivering up to 24% faster training operations.

Top-Tier NVIDIA GPUs
NVIDIA's latest GPUs, like GB200, H200, and H100, for peak AI performance, supporting both training and inference.
Accelerated Software Stack
The Together Kernel Collection includes custom CUDA kernels, reducing training times and costs with superior throughput.
High-Speed Interconnects
InfiniBand and NVLink ensure fast communication between GPUs, eliminating bottlenecks and enabling rapid processing of large datasets.
Highly Scalable & Reliable
Deploy 16 to 1000+ GPUs across global locations, with 99.9% uptime SLA.
Expert AI Advisory Services
Together AI’s expert team offers consulting for custom model development and scalable training best practices.
Robust Management Tools
Slurm and Kubernetes orchestrate dynamic AI workloads, optimizing training and inference seamlessly.

Together GPU Clusters

Training-ready clusters – Blackwell and Hopper

Reserve your cluster today

THE AI ACCELERATION CLOUD

BUILT ON LEADING AI RESEARCH.

Innovations

Our research team is behind breakthrough AI models, datasets, and optimizations.

See all research

FlashAttention-3

FlashAttention-3 achieves up to 75% GPU utilization on H100s, making AI models up to 2x faster and enabling efficient processing of longer text inputs. It allows for faster training and inference of LLMs, supports lower precision operations for improved efficiency.

Sub-quadratic model architectures

In close collaboration with Hazy Research, we’re working on the next core architecture for generative AI models that will provide even faster performance with longer context. Our research published in this area includes Striped Hyena, Monarch Mixer, and FlashConv.

RedPajama

Our RedPajama project enables leading generative AI models to be available as fully open-source. The RedPajama models have been downloaded millions of times, and the RedPajama dataset has been used to create over 500 leading models.

Cocktail SGD

With Cocktail SGD, we’ve addressed a key hindrance to training generative AI models in a distributed environment: networking overhead. Cocktail SGD is a set of optimizations that reduces network overhead by up to 117x.

Customer Stories

See how we support leading teams around the world. Our customers are creating innovative generative AI applications, faster.

How Hedra Scales Viral AI Video Generation with 60% Cost Savings

From AWS to Together Dedicated Endpoints: Arcee AI's journey to greater inference flexibility

When Standard Inference Frameworks Failed, Together AI Enabled 5x Performance Breakthrough

Testing conducted by Together AI in November 2023 using Llama-2-70B running on Together Inference, TGI, vLLM, Anyscale, Perplexity, and Open AI. Mosaic ML comparison based on published numbers in Mosaic ML blog. Detailed results and methodology published here.
Testing conducted by Together AI in November 2023 using Llama-2-70B running on Together Inference, TGI, vLLM, Anyscale, Perplexity, and Open AI. Mosaic ML comparison based on published numbers in Mosaic ML blog. Detailed results and methodology published here.
Testing conducted by Together AI in November 2023 using Llama-2-70B running on Together Inference. Detailed results and methodology published here.
Based on published pricing November 8th, 2023, comparing Open AI GPT-3.5-Turbo to Llama-2-13B on Together Inference using Serverless Endpoints. Assumes equal number of input and output tokens.
Compared to a standard attention implementation in PyTorch, FlashAttention-2 can be up to 9x faster. Source.
Testing methodology and results published in this research paper.
Based on published pricing November 8th, 2023, comparing AWS Capacity Blocks and AWS p5.48xlarge instances to Together GPU Clusters configured with an equal number of H100 SXM5 GPUs on our 3200 Gbps Infiniband networking configuration.

Build on the AI Native Cloud

Engineered for AI natives, powered by cutting-edge research

Start building now

Contact sales

The Together AI Platform

Accelerate training, fine-tuning and inference on performance-optimized GPU clusters

Reliable at production scale
Built for scale, with customers going to trillions of tokens in a matter of hours without any depletion in experience.
Industry-leading unit economics
Continuously optimizing across inference and training to keep improving performance, delivering better total cost of ownership.
Frontier AI systems research
Proven infra and research teams ensure the latest models, hardware, and techniques are made available on day 1.

Full stack development for AI-native apps

Model Library

Evaluate and build with open-source and specialized models for chat, images, videos, code, and more.

Migrate from closed models with OpenAI-compatible APIs.

Start building now

together.ai

Chat

Chat

Chat

Chat

Image

Video

Chat

Inference

Reliably deploy models with unmatched price-performance at scale. Benefit from inference-focused innovations like the ATLAS speculator system and Together Inference Engine.

Deploy on hardware of choice, such as NVIDIA GB200 NVL72 and GB300 NVL72.

Learn more

Fine-Tuning

Fine-tune open-source models with your data to create task-specific, fast, and cost-effective models that are 100% yours.

Easily deploy into production through Together AI's highly performant inference stack.

Learn more

Pre-Training

Securely and cost effectively train your own models from the ground up, leveraging research breakthroughs such as Together Kernel Collection (TKC) for reliable and fast training.

GPU Clusters

Scale globally with our fleet of data centers (DCs) across the globe.

These DCs feature frontier hardware such as NVIDIA GB200 NVL72 and GB300 NVL72.

Developers can go from self-serve instant clusters to custom AI factories for high-scale workloads.

Learn more

Industry leading AI research and open-source contributions

Flash Attention
Mixture of Agents
Dragonfly
Red Pajama Datasets
DeepCoder
Open Deep Research
Flash Decoding
Open Data Scientist Agent

Customer stories

AI-native companies partner with Together AI to build the next generation of apps

How Hedra Scales Viral AI Video Generation with 60% Cost Savings

From AWS to Together Dedicated Endpoints: Arcee AI's journey to greater inference flexibility

When Standard Inference Frameworks Failed, Together AI Enabled 5x Performance Breakthrough

View all stories

Proven results

Get to market faster and save costs with breakthrough innovations

Faster
Inference
3.5x
Faster
Training
2.3x
Lower
Cost
20%
Network
Compression
117x

Resources

BLOG

Expanding Together AI Model Library into multimedia generation with 40+...

Learn More

RESEARCH

AdapTive-LeArning Speculator System (ATLAS): A New Paradigm...

Learn More

BLOG

Announcing General Availability of Together Instant Clusters...

Learn More

WORKSHOP

Upgrading and Customizing Open Models

Learn More

IN-PERSON

WASHINGTON, D.C.

NVIDIA GTC Washington, D.C.

Learn More

IN-PERSON

PARIS, FRANCE

Upgrading and Customizing Open Models

Learn More

IN-PERSON

ST. LOUIS, MO

SuperComputing 2025

Learn More

IN-PERSON

HELSINKI, FINLAND

Slush 2025

Learn More

CORE ML/TURBO

AdapTive-LeArning Speculator System (ATLAS): A New Paradigm...

Learn More

AGENTS

Large Reasoning Models Fail to Follow Instructions During Reasoning...

Learn More

AGENTS

Back to The Future: Evaluating AI Agents on Predicting Future Events

Learn More

CORE ML/TURBO

DeepSWE: Training a Fully Open-sourced, State-of-the-Art Coding...

Learn More

Start running inference with the best price-performance at scale

Explore our model library

The AI Acceleration Cloud

200+ generative AI models

BGE-Large-EN v1.5

Cogito V1 Preview Qwen 14B

Marin 8B Instruct

FLUX.1 Kontext [max]

DeepSeek R1 Distilled Llama 70B

DeepSeek R1 Distilled Llama 70B Free

Gemma 3 12B

Arcee AI Virtuoso-Large

NIM Llama 3.1 70B Instruct

NIM Llama 3.1 8B Instruct

NIM Llama 3.2 11B Vision Instruct

NIM Llama 3.2 90B Vision Instruct

NIM Llama 3.3 70B Instruct

NIM Llama 3.3 Nemotron Super 49B v1

NIM Mistral-NeMo 12B Instruct

Cogito V1 Preview Llama 3B

DBRX-Instruct

Llama 3.1 405B

NIM Mixtral 8x22B Instruct v0.1

Llama 3.1 8B

NVIDIA-Nemotron-Nano-9B-v2

DeepSeek R1 Distilled Qwen 14B

Gemma 3 1B

Gemma 3 4B

FLUX.1 [dev]

End-to-end platform for the full generative AI lifecycle

Inference

Fine-Tuning

GPU Clusters

Speed, cost, and accuracy. Pick all three.

SPEED RELATIVE TO VLLM

LLAMA-3 8B AT FULL PRECISION

COST RELATIVE TO GPT-4o

Why Together Inference

accelerated by cutting edge research

Flexibility to choose a model that fits your needs

Available via Dedicated instances and serverless API

Control your IP. ‍Own your AI.

Forge the AI frontier. Train on expert-built GPU clusters.

Top-Tier NVIDIA GPUs

Accelerated Software Stack

High-Speed Interconnects

Highly Scalable & Reliable

Expert AI Advisory Services

Robust Management Tools

Training-ready clusters – Blackwell and Hopper

THE AI ACCELERATION CLOUD

BUILT ON LEADING AI RESEARCH.

Innovations

FlashAttention-3

Sub-quadratic model architectures

RedPajama

Cocktail SGD

Customer Stories

How Hedra Scales Viral AI Video Generation with 60% Cost Savings

From AWS to Together Dedicated Endpoints: Arcee AI's journey to greater inference flexibility

When Standard Inference Frameworks Failed, Together AI Enabled 5x Performance Breakthrough

Build on the AI Native Cloud

The Together AI Platform

Reliable at production scale

Industry-leading unit economics

Frontier AI systems research

Full stack development for AI-native apps

Industry leading AI research and open-source contributions

Customer stories

How Hedra Scales Viral AI Video Generation with 60% Cost Savings

From AWS to Together Dedicated Endpoints: Arcee AI's journey to greater inference flexibility

When Standard Inference Frameworks Failed, Together AI Enabled 5x Performance Breakthrough

Proven results

Resources

Start running inference with the best price-performance at scale

Subscribe to newsletter

Control your IP.
‍Own your AI.

THE AI ACCELERATION CLOUD

Full stack development for AI-native apps