Llama

Maverick has landed. Let Together AI satisfy your need for speed.

The full lineup of Llama 4 and 3 models, enabling custom enterprise AI, with complete model ownership.

Get Started in Minutes

Deploy Llama models with just a few lines of code. Switch from closed models to Llama instantly, with Open-AI-compatible endpoints.


from together import Together

client = Together()

response = client.chat.completions.create(
  model="meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
  messages=[{
      "role": "user",
      "content": [
          {"type": "text", "text": "What can you see in this image?"},
          {"type": "image_url", "image_url": {"url": "https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/yosemite.png"}}
      ]
  }]
)

print(response.choices[0].message.content)

View API Docs

Why Llama on Together AI?

Closed models lock you in. 

Take the Llama leap to open source.

‍

Class-leading open models for multimodal reasoning, long-context understanding, and efficient enterprise deployment.

Unmatched Performance
Outperforms GPT-4o and Gemini 2.0 Flash across key benchmarks including 73.4 MMMU and 80.5 MMLU-Pro
‍
Llama 4 Maverick beats GPT-4o on key benchmarks
‍
BUILT FOR EfficiencY
Mixture-of-experts routing activates only 17B of 400B parameters on Maverick, delivering frontier performance at $0.27 per M tokens.
‍
400B parameter model performance at 17B activation cost
‍
Full Model Control
Download the weights or call the API—deploy on Together’s cloud or on-prem. No vendor lock-in.
‍
Complete data & model ownership vs closed models
‍

Meet the Whole Llama Herd

From cost-effective reasoning to massive-scale multimodal understanding, choose the Llama model that fits your needs.

Llama 4 Maverick

Industry-leading multimodal intelligence

Starting at $0.27/1M tokens

17B
Active Params
400B
Total Params
128
Experts (MoE)
1417
LMArena ELO

Key Strengths:

Beats GPT-4o and Gemini 2.0 Flash on key benchmarks
Superior image understanding and grounding
Best-in-class performance to cost ratio
Competitive with DeepSeek V3 at half the parameters

Try it out

Llama 4 Scout

Class-leading long-context model

Starting at $0.18/1M tokens

17B
Active Params
109B
Total Params
16
Experts (MoE)
10M
Context Length

Key Strengths:

Industry-leading 10M token context window
Superior image grounding and understanding
Fits on single H100 GPU with quantization
Multi-document analysis & entire codebase processing

Try it out

Llama 3.3

Enhanced reasoning & mathematics

Starting at $0.88/1M tokens

70B
Parameters
109B
Total Params
405B
Performance Level
FP8
Turbo Speed

Key Strengths:

Complex mathematical reasoning
Advanced instruction following
High-accuracy text generation
Fast production deployments

Try it out

Llama 3 8B Reference

Uncompressed baseline model

Starting at $0.20/1M tokens

8B
Parameters
8K
context
BF16
trained
FP16
Quantized

Key Strengths:

Perfect for research and experimentation
Maximum accuracy requirements
Benchmark comparisons
Fine-tuning base model

Try it out

Llama 3 70B Reference

High-precision baseline

Starting at $0.90/1M tokens

70B
Parameters
8K
context
BF16
trained
FP16
Quantized

Key Strengths:

Perfect for academic research applications
Highest quality requirements
Performance benchmarking
Advanced fine-tuning projects

Try it out

Llama 3.1 8B

High-efficiency performance

Starting at $0.18/1M tokens

8B
Parameters
128K
context
73.0
MMLU
84.5
GSM8K

Key Strengths:

Perfect for high-volume chatbots
Content classification
Text summarization
Cost-sensitive applications

Try it out

Llama 3.1 70B

Balanced performance and efficiency

Starting at $0.88/1M tokens

70B
parameters
128K
context
86.0
MMLU
FP8
Quantized

Key Strengths:

Perfect for enterprise chat applications
Document processing & analysis
API integration & automation
High-volume production workload

Try it out

Llama 3.1 405B

Maximum intelligence and capability

Starting at $3.50/1M tokens

11B
parameters
128K
context
88.6
MMLU
FP8
Quantized

Key Strengths:

Perfect for complex research and analysis
Advanced mathematical reasoning
Multi-step problem solving
Academic and scientific applications

Try it out

Llama 3.2 11B Vision

Efficient multimodal processing

Starting at $0.18/1M tokens

11B
parameters
128K
context
<1s
image processing
50.7
MMMU

Key Strengths:

Perfect for image understanding and analysis, chart and diagram interpretation
Cost-effective vision applications
Document visual processing

Try it out

Llama 3.2 90B Vision

Advanced Visual Intelligence

Starting at $0.18/1M tokens

90B
parameters
128K
context
78.1
VQA
85.5
ChartQA

Key Strengths:

Perfect for complex visual reasoning tasks
Detailed image analysis
Technical diagram understanding

Try it out

A real* llama herd seen hanging out near the Golden Gate Bridge.

Breakthrough Technical Innovations

Llama 4 introduces game-changing architectural advances that redefine what's possible with open-source AI.

Mixture of Experts (MoE)
First Llama models with MoE architecture. Only activates a fraction of parameters per token, delivering higher quality at lower computational cost.
Maverick: 17B active of 400B total parameters
Native Multimodality
Early fusion architecture seamlessly integrates text and vision tokens into a unified model backbone, jointly pre-trained on text, image, and video data.
Training: Up to 48 images in pre-training
iRoPE Architecture
Revolutionary interleaved attention layers without positional embeddings, enabling industry-leading 10M token context length with superior generalization.
Scout: 10M context window, 75x longer than GPT-4o
MetaP Training
Novel training technique for reliably setting critical model hyper-parameters, enabling efficient FP8 precision training at scale.
Efficiency: 390 TFLOPs/GPU on 32K GPUs
Advanced Distillation
Novel distillation from 288B parameter Llama 4 Behemoth teacher model, with dynamic weighting of soft and hard targets through training.
Teacher: Outperforms GPT-4.5, Claude 3.7, Gemini 2.0 Pro
Massive Scale
Trained on 30+ trillion tokens (2x Llama 3), with 200 languages including 100+ with over 1B tokens each. 10x more multilingual tokens than Llama 3.
Scale: 30T tokens, 200 languages

Deploy on Together AI

Access Llama models through Together's optimized inference platform.

Serverless Endpoints
Pay-per-token pricing with automatic scaling. Perfect for getting started or variable workloads.
Best for:
- Prototyping and development
- Variable or unpredictable traffic
- Cost optimization for low volume
- Getting started quickly
Llama 4 Scout: 
‍$0.18/1M tokens

‍Llama 4 Maverick: 
$0.27/1M tokens
Try Serverless
On-Demand Dedicated
Dedicated GPU capacity with guaranteed performance. No rate limits. Built for production.
Best for:
- Production applications
- Extended model library access
- Predictable latency requirements
- Enterprise SLA needs
Llama 4 Scout: 
‍$0.45/minute (8x H100)

‍Llama 4 Maverick: 
$0.45/minute (8x H100)
Deploy Endpoint
Monthly Reserved
Committed GPU capacity, enterprise features and volume discounts. Optimized for scale.
Best for:
- High-volume committed usage
- Enterprise security requirements
- Priority hardware access
- Maximum cost efficiency
Reserved GPU pricing:
Starting $0.98/hr

Volume Discounts: 
Up to 40% savings
Contact Sales

Enterprise-Grade Security

Your data and models remain fully under your control with industry-leading security standards.

SOC 2 Type II 
Comprehensive security controls audited by third parties.
HIPAA Compliant
Healthcare-grade data protection for sensitive workloads.
Model Ownership
You own your fine-tuned models and can deploy anywhere.
Data Privacy 
Your data never trains our models or leaves your control.

Real Performance Benchmarks

See how Llama 4 models stack up against the competition on actual benchmarks that matter.

Model	MMMU (Image Reasoning)	LiveCodeBench (Coding)	MMLU Pro (Knowledge)	Cost per 1M tokens
Llama 4 Maverick	73.4	43.4	80.5	$0.27
Gemini 2.0 Flash	71.7	34.5	77.6	$0.17
GPT-4o	69.1	32.3	—	$4.38 output
DeepSeek V3.1	—	45.8/49.2	81.2	$0.48

Model	MMMU (Image Reasoning)	LiveCodeBench (Coding)	MMLU Pro (Knowledge)	Context Window
Llama 4 Scout	69.4	32.8	74.3	10M tokens
Mistral 3.1 (24B)	62.8	—	66.8	128K tokens
GPT-4o	64.9	—	67.5	128K tokens
DeepSeek V3.1	68.0	28.9	71.6	1M tokens

Trusted by Industry Leaders

See how companies are using Llama models to transform their AI applications.

"Our endeavor is to deliver exceptional customer experience at all times. Together AI has been our long standing partner and with Together Inference Engine 2.0 and Together Turbo models, we have been able to provide high quality, fast, and accurate support that our customers demand at tremendous scale."
Rinshul Chandra 
COO, Food Delivery, Zomato
"Together AI offers optimized performance at scale, and at a lower cost than closed-source providers – all while maintaining strict privacy standards. As an AI-forward publication, we look forward to expanding our collaboration with Together AI for larger-scale in-house efforts."
Vineet Khosla 
‍CTO, The Washington Post
"We've been thoroughly impressed with the Together Enterprise Platform. It has delivered a 2x reduction in latency (time to first token) and cut our costs by approximately a third. These improvements allow us to launch AI-powered features and deliver lightning-fast experiences faster than ever before."
Caiming Xiong
VP Salesforce AI Research

Try Llama Models Now - Free

Experience the performance difference in Together Chat.

Launch Together Chat View API Docs

Frequently Asked Questions

How do Llama 4 models compare to GPT-4o and other frontier models?

Llama 4 Maverick beats GPT-4o and Gemini 2.0 Flash across key benchmarks including MMMU (73.4 vs 69.1), LiveCodeBench (43.4 vs 32.3), and image understanding. It's competitive with the much larger DeepSeek V3.1 on coding and reasoning while using less than half the active parameters.

What makes the Mixture of Experts (MoE) architecture special?

Llama 4 models use alternating dense and MoE layers for inference efficiency. Each token activates only a fraction of total parameters (17B of 400B for Maverick), dramatically improving inference efficiency while maintaining quality. This enables single H100 deployment for Scout with Int4 quantization and exceptional performance-to-cost ratios.

How does the 10M context window work in practice?

Llama 4 Scout's iRoPE architecture with interleaved attention layers enables true 10M token processing. This means you can process entire codebases, multiple research papers, or extensive user histories in a single request.

What is native multimodality and why does it matter?

Unlike models that bolt on vision capabilities later, Llama 4 uses early fusion to jointly pre-train text and vision tokens in a unified backbone. This enables superior image understanding, grounding, and the ability to reason across multiple images simultaneously. Models were pre-trained on up to 48 images.

How does Llama 4 Behemoth compare to other frontier models?

Llama 4 Behemoth (288B active parameters, ~2T total) outperforms GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on STEM benchmarks like MATH-500 and GPQA Diamond. It serves as the teacher model for distilling Scout and Maverick, enabling their exceptional performance at much smaller sizes.

What deployment options are available for Llama 4 models?

Llama 4 Scout fits on a single H100 GPU with quantization, while Maverick fits on a single H100 host or can use distributed inference. Both models support serverless endpoints, dedicated deployments, VPC hosting, and on-premise deployment. You maintain full model ownership and can migrate freely between providers.

Llama

Get Started in Minutes

Why Llama on Together AI?

Unmatched Performance

BUILT FOR EfficiencY

Full Model Control

Meet the Whole Llama Herd

Breakthrough Technical Innovations

Mixture of Experts (MoE)

Native Multimodality

iRoPE Architecture

MetaP Training

Advanced Distillation

Massive Scale

Deploy on Together AI

Serverless Endpoints

On-Demand Dedicated

Monthly Reserved

Enterprise-Grade Security

Real Performance Benchmarks

Trusted by Industry Leaders

Try Llama Models Now - Free

Frequently Asked Questions

Subscribe to newsletter