Scaling AI Companions: How Dippy AI Reached Over 4 Million Tokens/Minute with Together Dedicated Endpoints

0.4 seconds
Time to First Token (median)
4.1M tks/min
Peak volume (P99)
3.44 seconds
Latency (average)
Executive Summary
In early 2025, over 4 million Dippy AI users were creating and publishing more than 200,000 unique AI characters and exchanging over 300 million messages with them using Dippy AI’s web and mobile apps.
As their user base grew, Dippy AI’s team encountered a new challenge: figuring out how to get reliable infrastructure at scale that allowed them to fully focus on building user-facing features.
By partnering with Together AI engineers, Dippy AI deployed their custom models on Together Dedicated Endpoints. Leveraging this highly optimized GPU infrastructure, the company now seamlessly handles volumes of 4M+ tokens/minute with optimal “throughput per dollar” without having to spend time managing this infrastructure.
Meet Dippy AI
Imagine an AI companion that is always available to chat with you—whether you're seeking someone to talk to late at night, crafting fan-fiction adventures, or practicing important conversations. That's the vision behind Dippy AI, founded by Akshat Jagga and Angad Arneja in April 2024.

As often is the case with startups, Dippy AI’s founders quickly found their users seeking these AI companions for much more than originally thought: some find comfort chatting with an AI character at late hours to combat loneliness, while others spend hours role-playing adventures inspired by favorite movies or TV series.
“We've seen cases of people who use Dippy at 3 AM. They might not have anyone to text, so they chat with an AI therapist for comfort. Others create spin-offs of characters from their favorite series and will chat with them for hours.” – Akshat Jagga, CEO
AI Inference: The Core of Dippy AI
AI inference—the technology powering conversations between users and their AI companions—is central to Dippy’s experience. Without inference, the interactive conversations users love simply wouldn't exist.
“Inference is the core piece—without LLM inference, we essentially wouldn’t have a product.” – Manav Shah, Founding Engineer
Initially, Dippy used large, general-purpose AI models with 100B+ parameters. These models were effective but expensive and challenging to scale, especially once Dippy realized that their product usage was quite cyclical throughout the day, fluctuating according to their core users’ schedules.
As the team learned more about user behaviors, they began transitioning toward smaller, specialized AI models that were equally engaging but easier and more cost-effective to manage.
Preparing for Scale
Dippy AI originally managed its AI inference infrastructure independently. However, the rapid growth of their user base quickly led to challenges. Engineering resources got tied up managing infrastructure issues, distracting from app improvements and enhancing the user experience.
“We were hosting ourselves initially, and when we hit scale, we needed to offload inference optimization to focus on features and improving the AI.” – Manav Shah, Founding Engineer
Latency and cost quickly became key concerns for the business, so Dippy AI looked for a partner who would provide the right infrastructure and guidance to get the best price-performance ratio at scale.
Finding the Right Partner
After learning about Together AI’s end-to-end AI platform, including its own GPU clusters and an inference engine optimized by Together AI researchers, Dippy AI saw an opportunity to work with a partner that deeply understands these challenges and was ready to handle and continuously optimize this infrastructure.
Coming in with their custom model, Dippy worked closely with Together AI engineers to figure out the ideal Together Dedicated Endpoints deployment to meet their needs. After experimenting with different configurations, we found that NVIDIA HGX H100 GPUs provided the optimal “throughput per dollar” ratio for Dippy’s specific use case, volume, throughput, and usage patterns.
Unlike other providers they tested, Dippy AI found that Together AI was uniquely prepared to provide LLM optimizations that allowed them to quickly test and refine their smaller, specialized models.
“We chose Together Dedicated Endpoints due to a combination of cost, latency, and having dedicated optimization without needing internal resources.” – Akshat Jagga, CEO
Handling Peak Volumes
Dippy AI’s highly optimized LLMs reach global peak volumes of over 4 million tokens per minute. Given their cyclical traffic patterns and potential spiky usage, they also needed to ensure that the infrastructure was able to seamlessly handle peak volumes.
With the out-of-the-box auto-scaling of Together Dedicated Endpoints, Dippy experienced predictable, steady availability, with no capacity issues. As a result, their users experience consistent, uninterrupted interactions, even during busy periods.
Reliability, Throughput, and Focus
Once Dippy AI started serving their optimized models through Together Dedicated Endpoints in production, they consistently met and improved their KPIs:
- Time to First Token (TTFT): Reduced to 0.4 seconds (median).
- Throughput: Managed peak volumes up to 4.1 million tokens/minute (99th percentile).
- Latency: Reduced to an average of 3.44 seconds.
Beyond the immediate positive impact these improvements had on their product KPIs, these gains allowed Dippy AI’s team to focus fully on improving the product and enhancing the user experience, without having to worry about their infrastructure.
“Latency wasn't initially our main concern; it was more about throughput, cost, uptime, and reliability. Together AI delivered on these, allowing us to focus on building user-facing features.” – Manav Shah, Founding Engineer
Dippy also valued Together’s quick and responsive support. From joining discussions on Discord to creating tailored analytics dashboards, Together’s hands-on approach simplified Dippy's scaling process.
Future Developments
We're excited to see Dippy AI continue leveraging Together Dedicated Endpoints to support upcoming innovations like allowing their users to talk to their AI companions using voice calls, leveraging state-of-the-art AI audio models.
Together AI remains committed to enabling Dippy to deliver richer, more meaningful experiences to their growing community of millions.
Use case details
Products used
Highlights
- Reduced TTFT to 0.4 seconds
- Peak volumes up to 4.1M tokens/min
- Reduced latency to 3.44 seconds
- Team able to focus 100% on product
Use case
Deploy custom models at scale
Company segment
AI Native Startup