Build what’s next
on the AI Native Cloud

Full-stack AI platform, powered by cutting-edge research.

Abstract 3D shapes including a transparent blue circle, blue and purple intersecting discs, and an orange hexagon connected by a purple arrow labeled production inference.Abstract 3D shapes including a red hexagon, a light blue circular disc with a notch, and a translucent blue propeller-like form.
Trusted by

The Together AI Platform

Accelerate inference, model shaping and pre-training on a research-optimized platform.

  • Faster inference

    2x

    powered by cutting-edge research.

  • Lower cost

    60%

    with workload-specific optimization.

  • Faster pre-training

    90%

    with Together Kernel Collection.

Full-stack cloud

Powering every step of the AI development journey
—from experimentation to massive scale.

  • Inference

  • Compute

  • Model shaping

    • Serverless Inference

      The fastest way to run open-source models on demand. Powered by cutting-edge inference research. No infrastructure to manage, no long-term commitments.

    • Batch Inference

      Cost-effectively process massive workloads asynchronously. Scale to 30 billion tokens per model with any serverless model or private deployment.

    • Dedicated Model Inference

      Deploy models on dedicated infrastructure. Purpose-built for teams who need speed, control, and the best economics in the market.

    • Dedicated Container Inference

      GPU infrastructure purpose-built for generative media workloads. Deploy video, audio, and image models with performance acceleration powered by Together Research.

    • Accelerated Compute

      Scale from self-serve instant clusters to thousands of GPUs, all optimized for better performance with Together Kernel Collection.

    • Sandbox

      Use fast, secure code sandboxes at scale to set up full-scale development environments for AI apps and agents.

    • Managed Storage

      High-performance managed storage for AI-native workloads. Object storage and parallel filesystems optimized for AI, with zero egress fees.

    • Fine-Tuning

      Fine-tune open-source models for production workloads, using the latest research techniques. Improve accuracy, reduce hallucinations, and control behavior — without managing training infrastructure.

Grounded in cutting-edge research

Foundational systems research for production AI.

Agents

CoderForge-Preview: SOTA open dataset for training efficient coding agents

By Alpay Ariyak*, Junda Zhang, Junxiong Wang, Shang Zhu, Federico Bianchi, Sanjana Srivastava, Ashwinee Panda, Siddhant Bharti, Chenfeng Xu, John Heo, Xiaoxia Shirley Wu, James Zou, Percy Liang, Leon Song, Ce Zhang, Ben Athiwaratkun, Zhongzhu Zhou*, Qingyang Wu* *Project Core Leads

Agents

How speech models fail where it matters the most and what to do about it

Kaitlyn Zhou, Martijn Bartelds, Federico Bianchi, James Zou

Inference

Consistency diffusion language models: Up to 14x faster inference without sacrificing quality

Minseo Kim, Chenfeng Xu, Coleman Richard Charles Hooper, Harman Singh, Ben Athiwaratkun, Ce Zhang, Kurt Keutzer, Amir Gholami | Seoul National University, University of California, Berkeley, Together AI

Inference

Cache-aware prefill–decode disaggregation (CPD) for up to 40% faster long-context LLM serving

Jiejing Zhang, Yubo Wang, Yinghui Liu, Mourya Vangala Srinivasa, Chenxi Li, Jue Wang, Yineng Zhang, Shuaiwen Leon Song, Ce Zhang

Agents

What do LLMs think when you don't tell them what to think about?

Yongchan Kwon and James Zou

Agents

DSGym: A holistic framework for evaluating and training data science agents

Fan Nie, Junlin Wang, Harper Hua, Federico Bianchi, Yongchan Kwon, Zhenting Qi, Owen Queen, Shang Zhu, James Zou

Kernels

Research POV: Yes, AGI Can Happen – A Computational Perspective

Together AI

Model Shaping

How to run TorchForge reinforcement learning pipelines in the Together AI Native Cloud

Together AI Training and Research, The PyTorch team at Meta

Model Shaping

Introducing AutoJudge: Streamlined inference acceleration via automated dataset curation

ROMAN GARIPOV, FEDOR VELIKONIVTSEV, IVAN ERMAKOV, RUSLAN SVIRSCHEVSKI, VAGE EGIAZARIAN, MAX RYABININ

Agents

Large Reasoning Models Fail to Follow Instructions During Reasoning: A Benchmark Study

Yongchan Kwon, Shang Zhu, Federico Bianchi, Kaitlyn Zhou, James Zou

Inference

AdapTive-LeArning Speculator System (ATLAS): A New Paradigm in LLM Inference via Runtime-Learning Accelerators

Junxiong Wang, Shirley Wu, Zelei Shao, Vikranth Srivatsa, Jue Wang, Roy Yuan, Qingyang Wu, Alpay Ariyak, Rupert Wu, Wai Tong Chung, Chenfeng Xu, Yonatan Oren, Pragaash Ponnusamy, Yineng Zhang, Avner May, Leon Song, Tri Dao, Percy Liang, Ce Zhang, Ben Athiwaratkun

Agents

How Together AI Uses AI Agents to Automate Complex Engineering Tasks: Lessons from Developing Efficient LLM Inference Systems

Shang Zhu, Federico Bianchi, Wai Tong Chung, Zain Hasan, Rupert Wu, Ce Zhang, James Zou, Ben Athiwaratkun

Agents

Back to The Future: Evaluating AI Agents on Predicting Future Events

Federico Bianchi, Junlin Wang, Zain Hasan, Shang Zhu, Roy Yuan, Clémentine Fourrier, James Zou

Inference

DeepSWE: Training a Fully Open-sourced, State-of-the-Art Coding Agent by Scaling RL

Michael Luo*, Naman Jain*, Jaskirat Singh*, Sijun Tan*, Ameen Patel*, Qingyang Wu*, Alpay Ariyak*, Colin Cai*, Tarun Venkat, Shang Zhu, Ben Athiwaratkun, Manan Roongta, Ce Zhang, Li Erran Li, Raluca Ada Popa, Koushik Sen, Ion Stoica

Agents

From Zero to One: Building An Autonomous and Open Data Scientist Agent from Scratch

Federico Bianchi, Shang Zhu, Zain Hasan, Ben Athiwaratkun and James Zou

Inference

Model-Preserving Adaptive Rounding with YAQA

Albert Tseng, Zhaofeng Sun, and Chris De Sa

Agents

Mixture-of-Agents Alignment: Harnessing the Collective Intelligence of Open-Source LLMs to Improve Post-Training

Junlin Wang, Roy Xie, Shang Zhu, Jue Wang, Ben Athiwaratkun, Bhuwan Dhingra, Shuaiwen Leon Song, Ce Zhang, James Zou

Inference

Boosting DeepSeek-R1’s Speed with Customized Speculative Decoding

Wai Tong Chung, Dan Waters, Avner May, Ben Athiwaratkun

Kernels

Chipmunk: Training-Free Acceleration of Diffusion Transformers with Dynamic Column-Sparse Deltas

Austin Silveria, Soham Govande, Dan Fu

Finetuning

Direct Preference Optimization: A Technical Deep Dive

Ivan Provilkov, Zain Hasan, Max Ryabinin

Finetuning

Continued Fine-tuning of LLMs: A Technical Deep Dive

Artem Chumachenko, Zain Hasan, Max Ryabinin

Agents

Open Deep Research

Together AI

Inference

DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level

Michael Luo*, Sijun Tan*, Roy Huang*, Ameen Patel*, Alpay Ariyak*, Qingyang Wu*, Xiaoxiang Shi, Rachel Xin, Colin Cai, Maurice Weber, Ce Zhang, Li Erran Li, Raluca Ada Popa, Ion Stoica

Kernels

ThunderKittens Now Optimized for NVIDIA Blackwell GPUs

Benjamin Spector, Aaryan Singhal, Dan Fu, Chris Ré

Inference

Minions: embracing small LMs, shifting compute on-device, and cutting cloud costs in the process

Avanika Narayan*, Dan Biderman*, Sabri Eyuboglu*, Avner May, Scott Linderman, James Zou, Christopher Ré

Model Shaping

Long Context Fine-Tuning: A Technical Deep Dive

George Grigorev, Zain Hasan, Max Ryabinin

Model Shaping

Fine-Tuning LLMs for Multi-Turn Conversations: A Technical Deep Dive

Artem Chumachenko, Zain Hasan, Max Ryabinin

Inference

Even Better, Even Faster Quantized LLMs with QTIP

Albert Tseng, Qingyao Sun, David Hou, Chris De Sa

Inference

Linearizing LLMs with LoLCATs

Michael Zhang, Simran Arora, Rahul Chalamala, Alan Wu, Benjamin Spector, Aaryan Singhal, Krithik Ramesh, Christopher Ré

Applications

Multimodal Document RAG with Llama 3.2 Vision and ColQwen2

Zain Hasan

Inference

The Mamba in the Llama: Distilling and Accelerating Hybrid Models

Junxiong Wang, Daniele Paliotta, Avner May, Alexander M. Rush, Tri Dao

Inference

Speculative decoding for high-throughput long-context inference

Jian Chen, Vashisth Tiwari, Ranajoy Sadhukhan, Yunho Jin, Zhuoming Chen, Jinyuan Shi, Ian En-Hsu Yen, Avner May, Beidi Chen

Inference

TEAL: Training-Free Activation Sparsity in Large Language Models

James Liu, Pragaash Ponnusamy, Tianle Cai, Han Guo, Yoon Kim, Ben Athiwaratkun

Kernels

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

Jay Shah (Colfax Research), Ganesh Bikshandi (Colfax Research), Ying Zhang (Meta), Vijay Thakkar (NVIDIA), Pradeep Ramani (NVIDIA), Tri Dao (Princeton University, Together AI)

Applications

Building a personalized code assistant with open-source LLMs using RAG Fine-tuning

Kezhen Chen, Linda He, Ben Athiwaratkun, Jue Wang, Maurice Weber, Heejin Jeong, Yonatan Oren, Michael Poli

Inference

SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices

Ruslan Svirschevski, Avner May, Zhuoming Chen, Beidi Chen, Zhihao Jia, Max Ryabinin

Agents

Together MoA — collective intelligence of open-source models pushing the frontier of LLM capabilities

Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, James Zou

Inference

Dragonfly: A large vision-language model with multi-resolution zoom

Kezhen Chen, Rahul Thapa, Rahul Chalamala, Ben Athiwaratkun, Shuaiwen Leon Song, James Zou

Kernels

ThunderKittens: A Simple Embedded DSL for AI kernels

Benjamin Spector, Aaryan Singhal, Simran Arora, Chris Re

Inference

FAQ: Building LLMs with RedPajama-v2, a 30 trillion token web dataset

Together AI

Inference

Hydragen: High-Throughput LLM Inference with Shared Prefixes

Inference

Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding

Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, Beidi Chen

Inference

BASED: Simple linear attention language models balance the recall-throughput tradeoff

Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, Dylan Zinsley, James Zou, Atri Rudra, Christopher Ré

Inference

Evo: Long-context modeling from molecular to genome scale

Eric Nguyen, Michael Poli, Matthew Durrant, Patrick Hsu, Brian Hie

Inference

BitDelta: Your Fine-Tune May Only Be Worth One Bit

James Liu, Guangxuan Xiao, Kai Li, Jason D. Lee, Song Han, Tri Dao, Tianle Cai

Inference

Long context retrieval models with Monarch Mixer

Jon Saad-Falcon, Dan Fu, Simran Arora

Inference

Mamba-3B-SlimPJ: State-space models rivaling the best Transformer architecture

Tri Dao, Albert Gu

Inference

Paving the way to efficient architectures: StripedHyena-7B, open source models offering a glimpse into a world beyond Transformers

Together

Kernels

FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores

Dan Fu, Hermann Kumbong, Eric Nguyen, Chris Ré

Inference

RedPajama-Data-v2: An open dataset with 30 trillion tokens for training large language models

Together

Kernels

Flash-Decoding for long-context inference

Tri Dao, Daniel Haziza, Francisco Massa, Grigory Sizov

Inference

Medusa: Simple framework for accelerating LLM generation with multiple decoding heads

Tianle Cai*, Yuhong Li*, Zhengyang Geng, Hongwu Peng, Tri Dao (* Equal contribution)

Finetuning

Llama-2-7B-32K-Instruct — and fine-tuning for Llama-2 models with Together API

Together

Inference

Faster inference enables up to 5x price reduction on Together API

Together

Inference

Preparing for the era of 32K context: Early learnings and explorations

Together

Kernels

Monarch Mixer: A new model architecture for increased efficiency

Dan Fu, Simran Arora, Chris Ré

Finetuning

Fine-tuning language models over slow networks using activation compression with guarantees

Jue Wang, Binhang Yuan, Luka Rimanic, Yongjun He, Tri Dao, Beidi Chen, Christopher Re, Ce Zhang

Finetuning

Decentralized training of foundation models in heterogeneous environments

Binhang Yuan, Yongjun He, Jared Quincy Davis, Tianyi Zhang, Tri Dao, Beidi Chen, Percy Liang, Christopher Re, Ce Zhang

Kernels

FlashAttention: Fast and memory-efficient exact attention with IO-Awareness

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré

Finetuning

CocktailSGD: Fine-tuning foundation models over 500Mbps networks

Jue Wang, Binhang Yuan, Luka Rimanic, Yongjun He, Tri Dao, Beidi Chen, Christopher Re, Ce Zhang

Inference

FlexGen: High-throughput generative inference of large language models with a single GPU

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, Ce Zhang

Inference

Hyena Hierarchy: Towards larger convolutional language models

Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, Christopher Ré

Kernels

FlashConv: Speeding up state space models

Dan Fu and Tri Dao

Inference

Hungry Hungry Hippos: Towards language modeling with state space models

Daniel Y. Fu, Tri Dao, Khaled K. Saab, Armin W. Thomas, Atri Rudra, Christopher Ré

Finetuning

NeurIPS 2022: Overcoming communication bottlenecks for decentralized training (2/2)

Together

Finetuning

NeurIPS 2022: Overcoming communication bottlenecks for decentralized training (1/2)

Together

Inference

HELM: benchmarking large language models on the Together Research Computer

Together

recognized by

AI natives build on Together AI

See how Together AI powers customers building the next generation of AI products.