Models / LLAMA
Llama
Maverick has landed. Let Together AI satisfy your need for speed.
The full lineup of Llama 4 and 3 models, enabling custom enterprise AI, with complete model ownership.

Get Started in Minutes
Deploy Llama models with just a few lines of code. Switch from closed models to Llama instantly, with Open-AI-compatible endpoints.
Why Llama on Together AI?
Closed models lock you in.
Take the Llama leap to open source.
Class-leading open models for multimodal reasoning, long-context understanding, and efficient enterprise deployment.
Unmatched Performance
Outperforms GPT-4o and Gemini 2.0 Flash across key benchmarks including 73.4 MMMU and 80.5 MMLU-Pro
Llama 4 Maverick beats GPT-4o on key benchmarks
BUILT FOR EfficiencY
Mixture-of-experts routing activates only 17B of 400B parameters on Maverick, delivering low latency and costs from $0.19 per M tokens.
ELO 1417 on LMArena at breakthrough pricing
Full Model Control
Download the weights or call the API—deploy on Together’s cloud or on-prem. No vendor lock-in.
Complete data & model ownership vs closed models
Meet the Whole Llama Herd
From cost-effective reasoning to massive-scale multimodal understanding, choose the Llama model that fits your needs.

A real* llama herd seen hanging out near the Golden Gate Bridge.
Breakthrough Technical Innovations
Llama 4 introduces game-changing architectural advances that redefine what's possible with open-source AI.
Mixture of Experts (MoE)
First Llama models with MoE architecture. Only activates a fraction of parameters per token, delivering higher quality at lower computational cost.
Maverick: 17B active of 400B total parameters
Native Multimodality
Early fusion architecture seamlessly integrates text and vision tokens into a unified model backbone, jointly pre-trained on text, image, and video data.
Training: Up to 48 images in pre-training
iRoPE Architecture
Revolutionary interleaved attention layers without positional embeddings, enabling industry-leading 10M token context length with superior generalization.
Scout: 10M context window, 75x longer than GPT-4o
MetaP Training
Novel training technique for reliably setting critical model hyper-parameters, enabling efficient FP8 precision training at scale.
Efficiency: 390 TFLOPs/GPU on 32K GPUs
Advanced Distillation
Novel distillation from 288B parameter Llama 4 Behemoth teacher model, with dynamic weighting of soft and hard targets through training.
Teacher: Outperforms GPT-4.5, Claude 3.7, Gemini 2.0 Pro
Massive Scale
Trained on 30+ trillion tokens (2x Llama 3), with 200 languages including 100+ with over 1B tokens each. 10x more multilingual tokens than Llama 3.
Scale: 30T tokens, 200 languages
Deploy on Together AI
Access Llama models through Together's optimized inference platform.
Serverless Endpoints
Pay-per-token pricing with automatic scaling. Perfect for getting started or variable workloads.
Best for:
Prototyping and development
Variable or unpredictable traffic
Cost optimization for low volume
Getting started quickly
Llama 4 Scout:
$0.18/1M tokens
Llama 4 Maverick:
$0.27/1M tokensOn-Demand Dedicated
Dedicated GPU capacity with guaranteed performance. No rate limits. Built for production.
Best for:
Production applications
Extended model library access
Predictable latency requirements
Enterprise SLA needs
Llama 4 Scout:
$0.45/minute (8x H100)
Llama 4 Maverick:
$0.45/minute (8x H100)Monthly Reserved
Committed GPU capacity, enterprise features and volume discounts. Optimized for scale.
Best for:
High-volume committed usage
Enterprise security requirements
Priority hardware access
Maximum cost efficiency
Reserved GPU pricing:
Starting $0.98/hr
Volume Discounts:
Up to 40% savings
Enterprise-Grade Security
Your data and models remain fully under your control with industry-leading security standards.
SOC 2 Type II
Comprehensive security controls audited by third parties.
HIPAA Compliant
Healthcare-grade data protection for sensitive workloads.
Model Ownership
You own your fine-tuned models and can deploy anywhere.
Data Privacy
Your data never trains our models or leaves your control.
Real Performance Benchmarks
See how Llama 4 models stack up against the competition on actual benchmarks that matter.
Trusted by Industry Leaders
See how companies are using Llama models to transform their AI applications.
"Our endeavor is to deliver exceptional customer experience at all times. Together AI has been our long standing partner and with Together Inference Engine 2.0 and Together Turbo models, we have been able to provide high quality, fast, and accurate support that our customers demand at tremendous scale."
Rinshul Chandra
COO, Food Delivery, Zomato"Together AI offers optimized performance at scale, and at a lower cost than closed-source providers – all while maintaining strict privacy standards. As an AI-forward publication, we look forward to expanding our collaboration with Together AI for larger-scale in-house efforts."
Vineet Khosla
CTO, The Washington Post"We've been thoroughly impressed with the Together Enterprise Platform. It has delivered a 2x reduction in latency (time to first token) and cut our costs by approximately a third. These improvements allow us to launch AI-powered features and deliver lightning-fast experiences faster than ever before."
Caiming Xiong
VP Salesforce AI Research
Try Llama Models Now - Free
Experience the performance difference in Together Chat.
Frequently Asked Questions
How do Llama 4 models compare to GPT-4o and other frontier models?
Llama 4 Maverick beats GPT-4o and Gemini 2.0 Flash across key benchmarks including MMMU (73.4 vs 69.1), LiveCodeBench (43.4 vs 32.3), and image understanding. It's competitive with the much larger DeepSeek V3.1 on coding and reasoning while using less than half the active parameters.
What makes the Mixture of Experts (MoE) architecture special?
Llama 4 models use alternating dense and MoE layers for inference efficiency. Each token activates only a fraction of total parameters (17B of 400B for Maverick), dramatically improving inference efficiency while maintaining quality. This enables single H100 deployment for Scout with Int4 quantization and exceptional performance-to-cost ratios.
How does the 10M context window work in practice?
Llama 4 Scout's iRoPE architecture with interleaved attention layers enables true 10M token processing. This means you can process entire codebases, multiple research papers, or extensive user histories in a single request.
What is native multimodality and why does it matter?
Unlike models that bolt on vision capabilities later, Llama 4 uses early fusion to jointly pre-train text and vision tokens in a unified backbone. This enables superior image understanding, grounding, and the ability to reason across multiple images simultaneously. Models were pre-trained on up to 48 images.
How does Llama 4 Behemoth compare to other frontier models?
Llama 4 Behemoth (288B active parameters, ~2T total) outperforms GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on STEM benchmarks like MATH-500 and GPQA Diamond. It serves as the teacher model for distilling Scout and Maverick, enabling their exceptional performance at much smaller sizes.
What deployment options are available for Llama 4 models?
Llama 4 Scout fits on a single H100 GPU with quantization, while Maverick fits on a single H100 host or can use distributed inference. Both models support serverless endpoints, dedicated deployments, VPC hosting, and on-premise deployment. You maintain full model ownership and can migrate freely between providers.