Blog

What is an AI Native Cloud?

AI-native companies need infrastructure built for models, not legacy workloads. Learn what defines an AI Native Cloud and why it matters for the next platform shift.

Text reading Kernels Lab on abstract colorful blurred background with red, purple, and blue shapes.

Research

Inside the Together AI kernels team

The team behind FlashAttention and ThunderKittens — how Together AI's kernel researchers close the gap between GPU hardware and production AI.

Company

Together AI at NVIDIA GTC 2026: Explore our latest innovations across research and products

Together AI arrives at NVIDIA GTC 2026 with new launches in inference, agents, voice AI, and open models — plus technical sessions from its research and engineering leaders.

Research

Mamba-3

Meet Mamba-3: the SSM built for inference. Faster than Transformers at decode, stronger than Mamba-2, and open-source from day one.

Research

Key research and product announcements at the AI Native Conf

At AI Native Conf, Together AI announced breakthroughs across kernels, RL, and inference optimization — including FlashAttention-4, ThunderAgent, and together.compile. Research that ships to production. That's the AI Native Cloud.

Research

FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling

As GPU throughput outpaces memory bandwidth, kernels must evolve. We introduce FlashAttention-4, featuring new pipelining for maximum overlap, 2-CTA MMA modes to reduce shared memory traffic, and a hardware-software hybrid approach to softmax exponentials.

Product updates

Inference

Serving MiniMax-M3 for efficient inference: Unlocking 1M-Token Context and Multimodality Without Regrets

How Together served MiniMax-M3 efficiently with KV-block-major sparse attention, paged MSA decode, optimized index scoring, and a Rust-based multimodal gateway.

Inference

How Together AI built the world’s fastest speech-to-text stack

Together AI built the fastest speech-to-text stack on Artificial Analysis by treating ASR as a full-path systems problem, not just a GPU inference problem.

Inference

Benchmarking inference at scale: coding agents

Real-world inference benchmarks for coding agents: 31% more TPS than TensorRT-LLM, 2× better TTFT at saturation, and 76% lower cost than Claude Opus 4.6.

Research blog

Foundational systems research for production AI.

Agents

Violin: An open-source video translation skill that breaks language barriers

Shang Zhu, Kevin Qinghong Lin (Oxford), James Zou

Inference

Accelerate RL rollouts by up to 50% with distribution-aware speculative decoding

Zelei Shao, Vikranth Srivatsa, Sanjana Srivastava, Qingyang Wu, Alpay Ariyak, Xiaoxia Wu, Ameen Patel, Jue Wang, Percy Liang, Tri Dao, Ce Zhang, Yiying Zhang, Ben Athiwaratkun, Chenfeng Xu, Junxiong Wang

Architecture

Parcae: Doing more with fewer parameters using stable looped models

Hayden Prairie, Zachary Novack, Taylor Berg-Kirkpatrick, Dan Fu

Agents

EinsteinArena: Harnessing the collective intelligence of agents in the wild to advance science

Federico Bianchi,* Yongchan Kwon,* James Zou

Agents

AI for Systems: Using LLMs to Optimize Database Query Execution

Mehmet Hamza Erol, Xiangpeng Hao, Federico Bianchi, Ciro Greco, Jacopo Tagliabue, James Zou

Kernels

Inside the Together AI kernels team

Will Van Eaton

Inference

Aurora

Junxiong Wang, Fengxiang Bie, Jisen Li, Zhongzhu Zhou, Zelei Shao, Yubo Wang, Yinghui Liu, Qingyang Wu, Avner May, Sri Yanamandra, Ce Zhang, Tri Dao, Percy Liang, Shuaiwen Leon Song, Ben Athiwaratkun, Chenfeng Xu, Xiaoxia Wu

Agents

Plan, divide, and conquer: How weak models excel at long context tasks

Zhen Xu, Shang Zhu, Jue Wang, Junlin Wang, Ben Athiwaratkun, Chi Wang, James Zou, Ce Zhang

Architecture

Mamba-3

Aakash Lahoti* (CMU), Kevin Y. Li* (CMU), Berlin Chen* (Princeton), Caitlin Wang* (Princeton), Aviv Bick (CMU), J. Zico Kolter (CMU), Tri Dao (Princeton, Together AI), Albert Gu (CMU, Cartesia A)

Kernels

Key research and product announcements at the AI Native Conf

Kernels

FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling

Ted Zadouri (Princeton University, Together AI), Markus Hoehnerbach (Meta), Jay Shah (Colfax Research), Timmy Liu (NVIDIA), Vijay Thakkar (Meta, Georgia Tech), Tri Dao (Princeton University, Together AI)

Inference

Cache-aware prefill–decode disaggregation (CPD) for up to 40% faster long-context LLM serving

Jiejing Zhang, Yubo Wang, Yinghui Liu, Mourya Vangala Srinivasa, Chenxi Li, Jue Wang, Yineng Zhang, Shuaiwen Leon Song, Ce Zhang

Agents

CoderForge-Preview: SOTA open dataset for training efficient coding agents

By Alpay Ariyak*, Junda Zhang, Junxiong Wang, Shang Zhu, Federico Bianchi, Sanjana Srivastava, Ashwinee Panda, Siddhant Bharti, Chenfeng Xu, John Heo, Xiaoxia Shirley Wu, James Zou, Percy Liang, Leon Song, Ce Zhang, Ben Athiwaratkun, Zhongzhu Zhou*, Qingyang Wu* *Project Core Leads

Agents

How speech models fail where it matters the most and what to do about it

Kaitlyn Zhou, Martijn Bartelds, Federico Bianchi, James Zou

Inference

Consistency diffusion language models: Up to 14x faster inference without sacrificing quality

Minseo Kim, Chenfeng Xu, Coleman Richard Charles Hooper, Harman Singh, Ben Athiwaratkun, Ce Zhang, Kurt Keutzer, Amir Gholami | Seoul National University, University of California, Berkeley, Together AI

Agents

What do LLMs think when you don't tell them what to think about?

Yongchan Kwon and James Zou

Agents

DSGym: A holistic framework for evaluating and training data science agents

Fan Nie, Junlin Wang, Harper Hua, Federico Bianchi, Yongchan Kwon, Zhenting Qi, Owen Queen, Shang Zhu, James Zou

Kernels

Research POV: Yes, AGI Can Happen – A Computational Perspective

Together AI

Model Shaping

How to run TorchForge reinforcement learning pipelines in the Together AI Native Cloud

Together AI Training and Research, The PyTorch team at Meta

Model Shaping

Introducing AutoJudge: Streamlined inference acceleration via automated dataset curation

Roman Garipov, Fedor Velikonivtsev, Ivan Ermakov, Ruslan Svirschevski, Vage Egiazarian, Max Ryabinin

Agents

Large Reasoning Models Fail to Follow Instructions During Reasoning: A Benchmark Study

Yongchan Kwon, Shang Zhu, Federico Bianchi, Kaitlyn Zhou, James Zou

Inference

AdapTive-LeArning Speculator System (ATLAS): A New Paradigm in LLM Inference via Runtime-Learning Accelerators

Junxiong Wang, Shirley Wu, Zelei Shao, Vikranth Srivatsa, Jue Wang, Roy Yuan, Qingyang Wu, Alpay Ariyak, Rupert Wu, Wai Tong Chung, Chenfeng Xu, Yonatan Oren, Pragaash Ponnusamy, Yineng Zhang, Avner May, Leon Song, Tri Dao, Percy Liang, Ce Zhang, Ben Athiwaratkun

Agents

How Together AI Uses AI Agents to Automate Complex Engineering Tasks: Lessons from Developing Efficient LLM Inference Systems

Shang Zhu, Federico Bianchi, Wai Tong Chung, Zain Hasan, Rupert Wu, Ce Zhang, James Zou, Ben Athiwaratkun

Agents

Back to The Future: Evaluating AI Agents on Predicting Future Events

Federico Bianchi, Junlin Wang, Zain Hasan, Shang Zhu, Roy Yuan, Clémentine Fourrier, James Zou

Inference

DeepSWE: Training a Fully Open-sourced, State-of-the-Art Coding Agent by Scaling RL

Michael Luo*, Naman Jain*, Jaskirat Singh*, Sijun Tan*, Ameen Patel*, Qingyang Wu*, Alpay Ariyak*, Colin Cai*, Tarun Venkat, Shang Zhu, Ben Athiwaratkun, Manan Roongta, Ce Zhang, Li Erran Li, Raluca Ada Popa, Koushik Sen, Ion Stoica

Agents

From Zero to One: Building An Autonomous and Open Data Scientist Agent from Scratch

Federico Bianchi, Shang Zhu, Zain Hasan, Ben Athiwaratkun and James Zou

Inference

Model-Preserving Adaptive Rounding with YAQA

Albert Tseng, Zhaofeng Sun, and Chris De Sa

Agents

Mixture-of-Agents Alignment: Harnessing the Collective Intelligence of Open-Source LLMs to Improve Post-Training

Junlin Wang, Roy Xie, Shang Zhu, Jue Wang, Ben Athiwaratkun, Bhuwan Dhingra, Shuaiwen Leon Song, Ce Zhang, James Zou

Inference

Boosting DeepSeek-R1’s Speed with Customized Speculative Decoding

Wai Tong Chung, Dan Waters, Avner May, Ben Athiwaratkun

Kernels

Chipmunk: Training-Free Acceleration of Diffusion Transformers with Dynamic Column-Sparse Deltas

Austin Silveria, Soham Govande, Dan Fu

Model Shaping

Direct Preference Optimization: A Technical Deep Dive

Ivan Provilkov, Zain Hasan, Max Ryabinin

Model Shaping

Continued Fine-tuning of LLMs: A Technical Deep Dive

Artem Chumachenko, Zain Hasan, Max Ryabinin

Agents

Open Deep Research

Together AI

Inference

DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level

Michael Luo*, Sijun Tan*, Roy Huang*, Ameen Patel*, Alpay Ariyak*, Qingyang Wu*, Xiaoxiang Shi, Rachel Xin, Colin Cai, Maurice Weber, Ce Zhang, Li Erran Li, Raluca Ada Popa, Ion Stoica

Kernels

ThunderKittens Now Optimized for NVIDIA Blackwell GPUs

Benjamin Spector, Aaryan Singhal, Dan Fu, Chris Ré

Inference

Minions: embracing small LMs, shifting compute on-device, and cutting cloud costs in the process

Avanika Narayan*, Dan Biderman*, Sabri Eyuboglu*, Avner May, Scott Linderman, James Zou, Christopher Ré

Model Shaping

Long Context Fine-Tuning: A Technical Deep Dive

George Grigorev, Zain Hasan, Max Ryabinin

Model Shaping

Fine-Tuning LLMs for Multi-Turn Conversations: A Technical Deep Dive

Artem Chumachenko, Zain Hasan, Max Ryabinin

Inference

Even Better, Even Faster Quantized LLMs with QTIP

Albert Tseng, Qingyao Sun, David Hou, Chris De Sa

Architecture

Linearizing LLMs with LoLCATs

Michael Zhang, Simran Arora, Rahul Chalamala, Alan Wu, Benjamin Spector, Aaryan Singhal, Krithik Ramesh, Christopher Ré

Applications

Multimodal Document RAG with Llama 3.2 Vision and ColQwen2

Zain Hasan

Architecture

The Mamba in the Llama: Distilling and Accelerating Hybrid Models

Junxiong Wang, Daniele Paliotta, Avner May, Alexander M. Rush, Tri Dao

Inference

Speculative decoding for high-throughput long-context inference

Jian Chen, Vashisth Tiwari, Ranajoy Sadhukhan, Yunho Jin, Zhuoming Chen, Jinyuan Shi, Ian En-Hsu Yen, Avner May, Beidi Chen

Inference

TEAL: Training-Free Activation Sparsity in Large Language Models

James Liu, Pragaash Ponnusamy, Tianle Cai, Han Guo, Yoon Kim, Ben Athiwaratkun

Kernels

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

Jay Shah (Colfax Research), Ganesh Bikshandi (Colfax Research), Ying Zhang (Meta), Vijay Thakkar (NVIDIA), Pradeep Ramani (NVIDIA), Tri Dao (Princeton University, Together AI)

Applications

Building a personalized code assistant with open-source LLMs using RAG Fine-tuning

Kezhen Chen, Linda He, Ben Athiwaratkun, Jue Wang, Maurice Weber, Heejin Jeong, Yonatan Oren, Michael Poli

Inference

SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices

Ruslan Svirschevski, Avner May, Zhuoming Chen, Beidi Chen, Zhihao Jia, Max Ryabinin

Agents

Together MoA — collective intelligence of open-source models pushing the frontier of LLM capabilities

Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, James Zou

Architecture

Dragonfly: A large vision-language model with multi-resolution zoom

Kezhen Chen, Rahul Thapa, Rahul Chalamala, Ben Athiwaratkun, Shuaiwen Leon Song, James Zou

Kernels

ThunderKittens: A Simple Embedded DSL for AI kernels

Benjamin Spector, Aaryan Singhal, Simran Arora, Chris Re

Inference

FAQ: Building LLMs with RedPajama-v2, a 30 trillion token web dataset

Together AI

Inference

Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding

Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, Beidi Chen

Architecture

BASED: Simple linear attention language models balance the recall-throughput tradeoff

Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, Dylan Zinsley, James Zou, Atri Rudra, Christopher Ré

Architecture

Evo: Long-context modeling from molecular to genome scale

Eric Nguyen, Michael Poli, Matthew Durrant, Patrick Hsu, Brian Hie

Inference

BitDelta: Your Fine-Tune May Only Be Worth One Bit

James Liu, Guangxuan Xiao, Kai Li, Jason D. Lee, Song Han, Tri Dao, Tianle Cai

Inference

Long context retrieval models with Monarch Mixer

Jon Saad-Falcon, Dan Fu, Simran Arora

Architecture

Mamba-3B-SlimPJ: State-space models rivaling the best Transformer architecture

Tri Dao, Albert Gu

Architecture

Paving the way to efficient architectures: StripedHyena-7B, open source models offering a glimpse into a world beyond Transformers

Together

Kernels

FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores

Dan Fu, Hermann Kumbong, Eric Nguyen, Chris Ré

Inference

RedPajama-Data-v2: An open dataset with 30 trillion tokens for training large language models

Together

Kernels

Flash-Decoding for long-context inference

Tri Dao, Daniel Haziza, Francisco Massa, Grigory Sizov

Inference

Medusa: Simple framework for accelerating LLM generation with multiple decoding heads

Tianle Cai*, Yuhong Li*, Zhengyang Geng, Hongwu Peng, Tri Dao (* Equal contribution)

Model Shaping

Llama-2-7B-32K-Instruct — and fine-tuning for Llama-2 models with Together API

Together

Inference

Faster inference enables up to 5x price reduction on Together API

Together

Inference

Preparing for the era of 32K context: Early learnings and explorations

Together

Architecture

Monarch Mixer: A new model architecture for increased efficiency

Dan Fu, Simran Arora, Chris Ré

Model Shaping

Fine-tuning language models over slow networks using activation compression with guarantees

Jue Wang, Binhang Yuan, Luka Rimanic, Yongjun He, Tri Dao, Beidi Chen, Christopher Re, Ce Zhang

Model Shaping

Decentralized training of foundation models in heterogeneous environments

Binhang Yuan, Yongjun He, Jared Quincy Davis, Tianyi Zhang, Tri Dao, Beidi Chen, Percy Liang, Christopher Re, Ce Zhang

Kernels

FlashAttention: Fast and memory-efficient exact attention with IO-Awareness

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré

Model Shaping

CocktailSGD: Fine-tuning foundation models over 500Mbps networks

Jue Wang, Binhang Yuan, Luka Rimanic, Yongjun He, Tri Dao, Beidi Chen, Christopher Re, Ce Zhang

Inference

FlexGen: High-throughput generative inference of large language models with a single GPU

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, Ce Zhang

Architecture

Hyena Hierarchy: Towards larger convolutional language models

Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, Christopher Ré

Kernels

FlashConv: Speeding up state space models

Dan Fu and Tri Dao

Architecture

Hungry Hungry Hippos: Towards language modeling with state space models

Daniel Y. Fu, Tri Dao, Khaled K. Saab, Armin W. Thomas, Atri Rudra, Christopher Ré

Model Shaping

NeurIPS 2022: Overcoming communication bottlenecks for decentralized training (2/2)

Together

Model Shaping

NeurIPS 2022: Overcoming communication bottlenecks for decentralized training (1/2)

Together

Inference

HELM: benchmarking large language models on the Together Research Computer

Together

recognized by

Company updates

Company

From 732 bytes to nowhere: shutting down Copy Fail in production

Company

What is an AI Native Cloud?

AI-native companies need infrastructure built for models, not legacy workloads. Learn what defines an AI Native Cloud and why it matters for the next platform shift.

Company

Together AI at NVIDIA GTC 2026: Explore our latest innovations across research and products

Together AI arrives at NVIDIA GTC 2026 with new launches in inference, agents, voice AI, and open models — plus technical sessions from its research and engineering leaders.

Inference

Serving DeepSeek-V4: why million-token context is an inference systems problem

DeepSeek-V4 makes million-token context a serving-systems problem. Together AI explores the inference work behind V4 on NVIDIA HGX B200, including compressed KV layouts, prefix caching, kernel maturity, and endpoint profiles for long-context workloads.

Inference

Deploy and inference any model from HuggingFace

Learn how to deploy any Hugging Face model in one session using Goose and Together's Dedicated Container Inference. Skip the setup complexity — one prompt gets your model running in a production-grade GPU environment on release day.

Inference

Foundational research powering efficient inference at scale

As AI moves from research to production, the challenge for AI-native teams shifts from building models to running them — efficiently, reliably, and at scale.

Company

From 732 bytes to nowhere: shutting down Copy Fail in production

Fine-Tuning

Announcing Together AI and Adaption Partnership

Together AI and Adaption partner to bring Together Fine-Tuning natively into Adaptive Data, helping teams optimize datasets, run fine-tuning, evaluate results, and deploy stronger open models.

Model Library

DeepSeek-V4 Pro now available on Together AI

DeepSeek-V4 Pro is now available on Together AI with 512K context, controllable reasoning modes, and cached-input pricing for long-context reasoning workloads like code agents, document intelligence, and research synthesis.

No search result

Try expanding your search or changing the filters.

Blog

Product updates

Research blog

Company updates

Latest blog posts

Start building on Together AI