How Hedra Scales Viral AI Video Generation with 60% Cost Savings

60%
cost savings
3x
faster inference on Blackwell
300x
GPU usage growth

Executive Summary

Hedra, the AI startup behind viral baby podcasts, needed infrastructure that could scale instantly to handle viral AI-video demand without ballooning costs or tying up engineers.

Partnering with Together AI, the team shifted video inference to optimized H100/H200 clusters with kernel optimizations and flexible autoscaling.

The migration cut cloud costs by 60%, tripled inference speed, supported 300× growth in compute demand, and removed the need for extra platform engineers, preserving healthy margins as usage soared.

About Hedra

Hedra builds AI that creates video content from simple prompts. Founded by PhD researcher Michael Lingelbach, the company created Character-1, a video generation model that powers viral content across TikTok and Instagram while serving enterprise customers incorporating AI-generated media into content strategies.

Operating as both a product company and applied AI research laboratory, Hedra serves millions of videos monthly through their web platform and API access. Since seed stage, the company’s compute needs have grown 300x while maintaining consistent performance and cost efficiency at massive scale.

The Infrastructure Challenge

Viral video generation creates unique infrastructure demands that hyperscalers cannot meet:

Elastic Enterprise Performance Usage can triple overnight without warning when content goes viral. Standard cloud provisioning requires weeks of negotiation and capacity planning, making it impossible to respond to viral moments that drive user growth. Moving upmarket also requires SLA guarantees and consistent performance, but enterprise contracts typically lock startups into rigid, long-term commitments that conflict with rapid iteration cycles.

Custom Model Architectures Unlike text models that can leverage existing inference frameworks, video generation requires entirely new architectures. "It's not like we can take Llama and just fine tune it," explains CEO Michael Lingelbach. "We have to go out and build net new things." Hedra's models require specialized deployment infrastructure unavailable from traditional cloud providers.

Cost Efficiency at Massive Scale Serving millions of videos creates unsustainable unit economics unless infrastructure costs are optimized. "When you're serving millions of videos, you actually start caring a lot about costs because you want to make sure that you have a mix of conversion into paid users," explains Michael. Managing auto-scaling, observability, and GPU procurement typically requires 1-2 dedicated platform engineers—resources that startups need focused on core product development.

The Together AI Solution

Hedra evaluated all major cloud providers and AI infrastructure platforms but chose Together AI for three capabilities that hyperscalers couldn't match:

Joint Engineering Collaboration: Together AI's kernel optimization team works directly with Hedra to maximize GPU performance. Recent collaboration ported Hedra's models to Blackwell architecture, achieving 3x faster inference compared to previous hardware generations. "We wanted to work with the people that were writing kernels that we're using to accelerate our model," says Michael.

Technical Architecture: Hedra's FastAPI backend connects to Celery queues that dispatch jobs to Together AI's inference engine, providing comprehensive monitoring and multi-cluster redundancy across thousands of H100/H200 for training and auto-scaling inference clusters distributed across half a dozen Together datacenters. Together AI's managed Slurm clusters with Weka storage enable high-performance data access for datasets that grew from terabytes to "multiple dozens of petabytes of data in a very small amount of time."

Flexible Auto-Scaling: Together AI provides dynamic scaling that handles viral traffic spikes without advance planning. "Having that fluidity of compute to be able to have a strong base commit that covers our daily usage but that ability to incrementally flex up with auto-scaling is something that very few providers can offer," explains Michael.

The deployment workflow exemplifies this engineering partnership: Hedra pushes code to a joint repository with Together AI, triggering automated deployment through staging to production in 5-10 minutes. Together AI's engineers often help validate deployments by testing with their own video generations, while 24/7 engineering support through shared Slack channels enables immediate issue resolution during viral traffic spikes.

Business Impact

Cost Optimization: Hedra achieved 60% total cost reduction through Together AI's competitive pricing and performance improvements. This cost reduction, combined with halving GPU time per video over 12 weeks of optimization, directly enables competitive pricing while maintaining healthy unit economics for millions of video generations.

Operational Efficiency: The partnership eliminates the need for 1-2 platform engineers who would otherwise manage auto-scaling, Kubernetes orchestration, observability dashboards, and GPU procurement negotiations. "We probably save one to two platform engineers because otherwise we'd be managing our own auto-scaling," explains Michael.

Enterprise Enablement: Consistent performance during viral spikes enabled Hedra to offer SLA guarantees to enterprise customers, supporting their transition from prosumer to enterprise business model.

Partnership Evolution: The relationship evolved seamlessly with Hedra's business growth. Michael can text Together AI's team for emergency GPU additions, with 20% cluster expansions completed within hours. This direct access to engineering resources provides the operational agility needed for rapid business growth without infrastructure constraints.

Technical Innovation

Hedra's video models require fundamentally different infrastructure than text generation models. Video content creation involves massive datasets and complex multi-modal processing combining image, audio, and video inputs.

Together AI's infrastructure provides managed storage through Weka, object storage integration with Tigris, and high-speed data processing capabilities. The platform supports novel model architectures that require custom development and optimization.

"We really wanted to work with AI infrastructure experts who could help us figure out how to get the most out of our GPUs," explains Michael. "Together AI’s team has done a really amazing job helping us optimize our model performance to give us one of the fastest and best cost-performing models in the industry."

"Training our omnimodal Character-3 model required infrastructure designed for large-scale AI. The Together Frontier AI Factory delivered the performance we needed to push the boundaries of multimodal video generation. Together AI understands what builders need — and that made all the difference." — Michael Lingelbach, CEO and Founder, Hedra

Unlock Peak GPU Performance with Together Kernels Lab

Pioneering the next frontier of computational efficiency, Together Kernels Lab is a world-class research team dedicated to pushing hardware to its absolute limits. Led by Dan Fu, VP of Kernels and co-creator of FlashAttention and ThunderKittens, our experts engineer bleeding-edge GPU kernels that redefine speed and scalability for AI workloads. We democratize innovation through open-source contributions, delivering state-of-the-art acceleration tools that empower developers and researchers worldwide. The Together Kernel Collection accelerates some of the most common operations in AI training (10% faster) and inference (75% faster). Experience the future of high-performance computing—built on transparency, rigor, and hardware-optimized brilliance.

Join the team!

Use case details

Products used