🚀 Now serving MiniMax-M3 for efficient inference →

⚡ On-demand B200s now available on Together GPU Clusters →

📊 Delivering 31% more TPS than the next-fastest OSS engine for production coding agent workloads →

💬 How Together built the world's fastest speech-to-text stack →

🇫🇷 Join us at RAISE 2026 in Paris →

All blog posts

Research

Published 4/24/2023

CocktailSGD: Fine-tuning foundation models over 500Mbps networks

Authors
Jue Wang, Binhang Yuan, Luka Rimanic, Yongjun He, Tri Dao, Beidi Chen, Christopher Re, Ce Zhang
Table of contents
- 40+ Models Chosen for Production...40+ Models Chosen for Production...40+ Models Chosen for Production...
Links in this article
- Publication
- Get Notified

Distributed training of foundation models, especially large language models (LLMs), is communication-intensive and so has heavily relied on centralized data centers with fast interconnects. Can we train on slow networks and unlock the potential of decentralized infrastructure for foundation models? In this paper, we propose CocktailSGD, a novel communication-efficient training framework that combines three distinct compression techniques -- random sparsification, top-K sparsification, and quantization -- to achieve much greater compression than each individual technique alone. We justify the benefit of such a hybrid approach through a theoretical analysis of convergence. Empirically, we show that CocktailSGD achieves up to 117x compression in fine-tuning LLMs up to 20 billion parameters without hurting convergence. On a 500Mbps network, CocktailSGD only incurs ∼1.2x slowdown compared with data center networks.