How Deep Cogito trained and deployed frontier reasoning models on Together AI

00:00

Summary

Deep Cogito builds open-weight reasoning models that need custom training infrastructure and production-grade serving. Their work spans large-scale model training, hybrid reasoning models, and post-training deployment across quantized variants that make models usable by a broader developer and research community.

Together AI supports that path from training to production. Deep Cogito uses Together for customizable H100/H200 GPU clusters, reliable long-run training infrastructure, dedicated inference endpoints for the Cogito model family, and rapid quantization into FP8, FP4, and INT4 variants after training. Together handles the infrastructure layer so Deep Cogito can focus on model training, evaluation, distribution, and reasoning research.

About Deep Cogito

Deep Cogito is a San Francisco-based AI research lab that builds specialized AI models for enterprises through post-training. Founded by veterans of Google Search and DeepMind, the company develops the Cogito model family and focuses on post-training methods that help models reason more effectively across language, symbolic, and knowledge-work tasks. Their Cogito model family spans 8B to 671B parameters, is fully open-weight, and ranks competitively on standard benchmarks and chat arenas including LM Arena. Their 405B dense model is the highest-performing open-weight model trained entirely in the United States. Deep Cogito’s approach includes hybrid reasoning, where models can operate in thinking and non-thinking modes depending on task complexity, and Iterated Distillation and Amplification (IDA).

The challenge

Deep Cogito’s research ambitions created infrastructure requirements that standard cloud platforms were not equipped to handle:

Training large open models required custom cluster configurations: The 671B parameter Cogito model required multi-node GPU clusters with non-standard CPU and RAM configurations, including up to 8TiB of CPU RAM to support gradient offloading during training, well beyond default node specifications. Deep Cogito needed a partner willing to customize cluster configurations on short notice, not just provision off-the-shelf instances.
Long training runs required stability and visibility: As Deep Cogito’s training runs grew from days to weeks, hardware failures became an increasing problem. Silent node failures mid-run forced restarts, destroyed momentum, and threatened timelines. The team needed infrastructure with both high hardware reliability and a responsive support team that could identify and resolve problems as they arose.
Hybrid reasoning modes required a custom serving path: Deep Cogito’s core innovation — the ability for models to switch between direct responses and extended reasoning chains depending on task complexity — had no existing serving template. They needed an inference partner capable of building a custom pipeline around that behavior quickly, before anyone else had done it.
Reasoning-mode inference raised the serving bar: Longer reasoning outputs put more pressure on latency and throughput. Deep Cogito needed fast time-to-first-token under production load, not just strong benchmark results in controlled tests.
Post-training deployment required fast quantization: Launching a large model means shipping quantized variants (FP8, FP4, INT4) that allow a much broader community of users to run the model on their own hardware. That requires deep, specialized expertise. Deep Cogito needed a partner that had it.

The solution

Deep Cogito uses Together for both large-scale model training and production inference across the full Cogito model family:

Customizable, reliable multi-node training clusters: Together provisioned multi-node H100 and H200 clusters with the non-standard CPU and RAM configurations Deep Cogito’s training jobs required, scaling cluster size as runs grew. The team’s focus on intra- and inter-node connectivity, ensuring reliable high-bandwidth communication across nodes, proved critical for training stability on very large models. Together’s Customer Experience team remained available through long training runs, and the monitoring stack gave Deep Cogito visibility into run health so issues could be caught and resolved before they became costly.

“Working with Together AI has allowed Deep Cogito to execute efficiently in a fast-moving and ambiguous compute environment, and deliver highly performant endpoints to its users.” — Dhruv Malrana, Co-founder, Deep Cogito

Custom inference pipeline for hybrid reasoning: When Deep Cogito was ready to ship hybrid reasoning, the ability for a model to dynamically decide whether to respond directly or engage extended step-by-step reasoning, Together built the inference pipeline to support it. Notably, Deep Cogito deployed this on the DeepSeek pre-trained base model before DeepSeek themselves launched it, a first that required close collaboration between both teams to execute quickly.
Production serving at scale: Together hosts the full Cogito model lineup, from 3B to 671B parameters, on dedicated inference infrastructure. This includes the larger reasoning models, where inference speed matters most. Together delivers sub-500ms time to first token at sustained throughput of over 1,000 requests per minute with 99.9% uptime.
Fast quantization turnaround post-launch: After each training run, Together’s team shipped FP8, FP4, and INT4 quantized variants of the Cogito models in under two weeks, without meaningful quality degradation. These variants dramatically extended the reach of each model release, enabling the broader hobbyist and research community to run Cogito models on consumer hardware.

Results

Together’s infrastructure supported Deep Cogito’s ability to train and ship frontier open-weight reasoning models on a startup timeline, with production reliability typically associated with much larger organizations:

Frontier model training on a startup timeline: Multi-node H100/H200 clusters with custom CPU RAM configurations enabled Deep Cogito to train 671B-parameter models with the hardware flexibility their workloads required. Hardware reliability and monitoring meant long runs completed without disruption, keeping the team on schedule.
First hybrid reasoning deployment in the open ecosystem: Deep Cogito and Together AI collaborated to ship a hybrid reasoning inference pipeline on the DeepSeek model base — enabling thinking and non-thinking modes — before DeepSeek launched that capability themselves. Together’s inference team built the serving layer that made the feature work reliably in production.
Sub-500ms time to first token at production scale: Across the Cogito model lineup, Together delivers under 500ms time to first token at over 1,000 requests per minute, with 99.9% uptime. For reasoning models with extended chain-of-thought outputs, inference speed is a core product requirement, not a secondary metric.
Quantized variants shipped in under two weeks: Together’s team delivered FP8, FP4, and INT4 variants of each Cogito model release within two weeks post-training, without quality degradation. These releases drove significant adoption, contributing to over 1 million downloads across Hugging Face and Ollama.
Research team focused on research: By offloading infrastructure operations to Together, Deep Cogito’s team concentrated entirely on model training, evaluation, and algorithmic research, the work that differentiates them.

“Before Together, we were struggling with adapting to the compute environment for training large open models. Once we found Together, we were able to offload most of it to them and focus on model training and benchmarking, which is our core secret sauce.” — Dhruv Malrana, Co-founder, Deep Cogito

Available on the Together Model Library

Developers can build with Deep Cogito’s models directly on Together AI.‍