Inference

How The Washington Post Achieved AI Independence with Reliable Inference

2 seconds
response time
1.79B
monthly input tokens
Thousands
of queries per day

Executive Summary

The Washington Post needed reliable, independent AI inference for a public-facing journalism product.

They deployed open models on Together AI with dedicated endpoints and hybrid serverless capacity, maintaining full model control and predictable pricing.

As a result, they achieved consistent ~2-second responses, 1.79B tokens processed monthly, and enterprise-grade reliability for thousands of daily queries.

About The Washington Post

The Washington Post serves millions of users with award-winning journalism for all of America while pioneering AI innovation. Under CTO Vineet Khosla's leadership, The Post launched "Ask The Post AI" in November 2024 to deliver answers to users’ questions through the publication’s deeply-sourced, fact-based reporting powered by Together AI's model inference platform.

The Challenge

The Washington Post needed a reliable, independent model inference to power its AI journalism platform without vendor lock-in.

Critical Requirements:

Enterprise-grade model inference handling thousands of daily queries
Sub-10-second response times for real-time user interactions
Complete model ownership vs. dependence on proprietary APIs
Mission-critical reliability for public-facing journalism features

Why Inference Alternatives Failed:

Proprietary Model Providers: Zero model control, unpredictable pricing, vendor dependency
Hyperscalers: Limited large GPU availability, higher inference costs
Self-Hosting Models: Massive MLOps overhead for model serving at scale

The Solution

Together AI's model inference platform delivered enterprise-grade performance with complete strategic independence.

Model Inference Infrastructure:

Dedicated Endpoints: Custom-configured GPU infrastructure for reliable Llama and Mistral hosting
Open-Source Model Access: Full control over model selection and deployment
Fine-Tuning Capabilities: Available for future experimentation and optimization
Hybrid Deployment: Combination of dedicated and serverless inference for optimal performance

Partnership Support: Together AI's engineering team provided direct collaboration on infrastructure optimization and custom configurations.

"Together AI's flexibility in creating customized configurations for our journalism use case has been invaluable to our success."
— Sam Han, Chief AI Officer, The Washington Post.

Results

1.79B

Monthly Input Tokens

2-Second

Model Response Time

Thousands

of queries per day

Enterprise Impact Delivered

✓ Massive Scale Processing 1.79 billion input tokens processed monthly with consistent 2-second response times

✓ Cost Predictability Fixed monthly pricing vs. volatile per-token costs that scale unpredictably

✓ User Engagement Success Searches drive answer expansion clicks and source article clicks to enhance visibility into the sourced journalism

✓ Model Independence Own and control Llama/Mistral models vs. API dependency on proprietary providers

Performance Comparison

Approach	Control	Performance	Cost Model
Together AI	✓Full model ownership	✓2-sec consistent response	✓Fixed monthly pricing
Proprietary APIs	✗Zero model control	~Variable API performance	✗Unpredictable per-token costs
Self-Hosting	✓Model control	✗Complex infrastructure	~High operational overhead