This website uses cookies to anonymously analyze website traffic using Google Analytics.

Models / LLAMA

Llama

Maverick has landed. Let Together AI satisfy your need for speed.

The full lineup of Llama 4 and 3 models, enabling custom enterprise AI, with complete model ownership.

Get Started in Minutes

Deploy Llama models with just a few lines of code. Switch from closed models to Llama instantly, with Open-AI-compatible endpoints.

# Install the Together AI library
pip install together 

# Get started with Llama 4 Maverick 
from together import Together 

client = Together() 

response = client.chat.completions.create( 
    model="meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
    messages=[ 
        {
            "role": "user", 
            "content": "Analyze this image and explain quantum computing"
        } 
    ] 
)  

print(response.choices[0].message.content)

View API Docs

Why Llama on Together AI?

Closed models lock you in.


Take the Llama leap to open source.

Class-leading open models for multimodal reasoning, long-context understanding, and efficient enterprise deployment.

Meet the Whole Llama Herd

From cost-effective reasoning to massive-scale multimodal understanding, choose the Llama model that fits your needs.

Llama 4 Maverick

Industry-leading multimodal intelligence

  • 17B

    Active Params

  • 400B

    Total Params

  • 128

    Experts (MoE)

  • 1417

    LMArena ELO

Key Strengths:

  • Beats GPT-4o and Gemini 2.0 Flash on key benchmarks

  • Superior image understanding and grounding

  • Best-in-class performance to cost ratio

  • Competitive with DeepSeek V3 at half the parameters

Llama 4 Scout

Class-leading long-context model

  • 17B

    Active Params

  • 109B

    Total Params

  • 16

    Experts (MoE)

  • 10M

    Context Length

Key Strengths:

  • Industry-leading 10M token context window

  • Superior image grounding and understanding

  • Fits on single H100 GPU with quantization

  • Multi-document analysis & entire codebase processing

Llama 3.3

Enhanced reasoning & mathematics

  • 70B

    Parameters

  • 109B

    Total Params

  • 405B

    Performance Level

  • FP8

    Turbo Speed

Key Strengths:

  • Complex mathematical reasoning

  • Advanced instruction following

  • High-accuracy text generation

  • Fast production deployments

Llama 3 8B Reference

Uncompressed baseline model

  • 8B

    Parameters

  • 8K

    context

  • BF16

    trained

  • FP16

    Quantized

Key Strengths:

  • Perfect for research and experimentation

  • Maximum accuracy requirements

  • Benchmark comparisons

  • Fine-tuning base model

Llama 3 70B Reference

High-precision baseline

  • 70B

    Parameters

  • 8K

    context

  • BF16

    trained

  • FP16

    Quantized

Key Strengths:

  • Perfect for academic research applications

  • Highest quality requirements

  • Performance benchmarking

  • Advanced fine-tuning projects

Llama 3.1 8B

High-efficiency performance

  • 8B

    Parameters

  • 128K

    context

  • 73.0

    MMLU

  • 84.5

    GSM8K

Key Strengths:

  • Perfect for high-volume chatbots

  • Content classification

  • Text summarization

  • Cost-sensitive applications

Llama 3.1 70B

Balanced performance and efficiency

  • 70B

    parameters

  • 128K

    context

  • 86.0

    MMLU

  • FP8

    Quantized

Key Strengths:

  • Perfect for enterprise chat applications

  • Document processing & analysis

  • API integration & automation

  • High-volume production workload

Llama 3.1 405B

Maximum intelligence and capability

  • 11B

    parameters

  • 128K

    context

  • 88.6

    MMLU

  • FP8

    Quantized

Key Strengths:

  • Perfect for complex research and analysis

  • Advanced mathematical reasoning

  • Multi-step problem solving

  • Academic and scientific applications

Llama 3.2 11B Vision

Efficient multimodal processing

  • 11B

    parameters

  • 128K

    context

  • <1s

    image processing

  • 50.7

    MMMU

Key Strengths:

  • Perfect for image understanding and analysis, chart and diagram interpretation

  • Cost-effective vision applications

  • Document visual processing

Llama 3.2 90B Vision

Advanced Visual Intelligence

  • 90B

    parameters

  • 128K

    context

  • 78.1

    VQA

  • 85.5

    ChartQA

Key Strengths:

  • Perfect for complex visual reasoning tasks

  • Detailed image analysis

  • Technical diagram understanding

A real* llama herd seen hanging out near the Golden Gate Bridge.

Breakthrough Technical Innovations

Llama 4 introduces game-changing architectural advances that redefine what's possible with open-source AI.

  • Mixture of Experts (MoE)

    First Llama models with MoE architecture. Only activates a fraction of parameters per token, delivering higher quality at lower computational cost.

    Maverick: 17B active of 400B total parameters

  • Native Multimodality

    Early fusion architecture seamlessly integrates text and vision tokens into a unified model backbone, jointly pre-trained on text, image, and video data.

    Training: Up to 48 images in pre-training

  • iRoPE Architecture

    Revolutionary interleaved attention layers without positional embeddings, enabling industry-leading 10M token context length with superior generalization.

    Scout: 10M context window, 75x longer than GPT-4o

  • MetaP Training

    Novel training technique for reliably setting critical model hyper-parameters, enabling efficient FP8 precision training at scale.

    Efficiency: 390 TFLOPs/GPU on 32K GPUs

  • Advanced Distillation

    Novel distillation from 288B parameter Llama 4 Behemoth teacher model, with dynamic weighting of soft and hard targets through training.

    Teacher: Outperforms GPT-4.5, Claude 3.7, Gemini 2.0 Pro

  • Massive Scale

    Trained on 30+ trillion tokens (2x Llama 3), with 200 languages including 100+ with over 1B tokens each. 10x more multilingual tokens than Llama 3.

    Scale: 30T tokens, 200 languages

Deploy on Together AI

Access Llama models through Together's optimized inference platform.

  • Serverless Endpoints

    Pay-per-token pricing with automatic scaling. Perfect for getting started or variable workloads.

    Best for:

    • Prototyping and development

    • Variable or unpredictable traffic

    • Cost optimization for low volume

    • Getting started quickly

    Llama 4 Scout:

    $0.18/1M tokens

    Llama 4 Maverick:

    $0.27/1M tokens

  • On-Demand Dedicated

    Dedicated GPU capacity with guaranteed performance. No rate limits. Built for production.

    Best for:

    • Production applications

    • Extended model library access

    • Predictable latency requirements

    • Enterprise SLA needs

    Llama 4 Scout:

    $0.45/minute (8x H100)

    Llama 4 Maverick:

    $0.45/minute (8x H100)

  • Monthly Reserved

    Committed GPU capacity, enterprise features and volume discounts. Optimized for scale.

    Best for:

    • High-volume committed usage

    • Enterprise security requirements

    • Priority hardware access

    • Maximum cost efficiency

    Reserved GPU pricing:
    Starting $0.98/hr

    Volume Discounts:

    Up to 40% savings

Enterprise-Grade Security

Your data and models remain fully under your control with industry-leading security standards.

  • SOC 2 Type II


    Comprehensive security controls audited by third parties.

  • HIPAA Compliant

    Healthcare-grade data protection for sensitive workloads.

  • Model Ownership

    You own your fine-tuned models and can deploy anywhere.

  • Data Privacy


    Your data never trains our models or leaves your control.

Real Performance Benchmarks

See how Llama 4 models stack up against the competition on actual benchmarks that matter.

Model

MMMU (Image Reasoning)

LiveCodeBench (Coding)

MMLU Pro (Knowledge)

Cost per 1M tokens

Llama 4 Maverick

73.4

43.4

80.5

$0.27

Gemini 2.0 Flash

71.7

34.5

77.6

$0.17

GPT-4o

69.1

32.3

$4.38 output

DeepSeek V3.1

45.8/49.2

81.2

$0.48

Model

MMMU (Image Reasoning)

LiveCodeBench (Coding)

MMLU Pro (Knowledge)

Context
Window

Llama 4 Scout

69.4

32.8

74.3

10M tokens

Mistral 3.1 (24B)

62.8

66.8

128K tokens

GPT-4o

64.9

67.5

128K tokens

DeepSeek V3.1

68.0

28.9

71.6

1M tokens

Trusted by Industry Leaders

See how companies are using Llama models to transform their AI applications.

  • "Our endeavor is to deliver exceptional customer experience at all times. Together AI has been our long standing partner and with Together Inference Engine 2.0 and Together Turbo models, we have been able to provide high quality, fast, and accurate support that our customers demand at tremendous scale."

    Rinshul Chandra

    COO, Food Delivery, Zomato

  • "Together AI offers optimized performance at scale, and at a lower cost than closed-source providers – all while maintaining strict privacy standards. As an AI-forward publication, we look forward to expanding our collaboration with Together AI for larger-scale in-house efforts."

    Vineet Khosla

    CTO, The Washington Post

  • "We've been thoroughly impressed with the Together Enterprise Platform. It has delivered a 2x reduction in latency (time to first token) and cut our costs by approximately a third. These improvements allow us to launch AI-powered features and deliver lightning-fast experiences faster than ever before."

    Caiming Xiong
    VP Salesforce AI Research

Try Llama Models Now - Free

Experience the performance difference in Together Chat.

Frequently Asked Questions

How do Llama 4 models compare to GPT-4o and other frontier models?

Llama 4 Maverick beats GPT-4o and Gemini 2.0 Flash across key benchmarks including MMMU (73.4 vs 69.1), LiveCodeBench (43.4 vs 32.3), and image understanding. It's competitive with the much larger DeepSeek V3.1 on coding and reasoning while using less than half the active parameters.

What makes the Mixture of Experts (MoE) architecture special?

Llama 4 models use alternating dense and MoE layers for inference efficiency. Each token activates only a fraction of total parameters (17B of 400B for Maverick), dramatically improving inference efficiency while maintaining quality. This enables single H100 deployment for Scout with Int4 quantization and exceptional performance-to-cost ratios.

How does the 10M context window work in practice?

Llama 4 Scout's iRoPE architecture with interleaved attention layers enables true 10M token processing. This means you can process entire codebases, multiple research papers, or extensive user histories in a single request.

What is native multimodality and why does it matter?

Unlike models that bolt on vision capabilities later, Llama 4 uses early fusion to jointly pre-train text and vision tokens in a unified backbone. This enables superior image understanding, grounding, and the ability to reason across multiple images simultaneously. Models were pre-trained on up to 48 images.

How does Llama 4 Behemoth compare to other frontier models?

Llama 4 Behemoth (288B active parameters, ~2T total) outperforms GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on STEM benchmarks like MATH-500 and GPQA Diamond. It serves as the teacher model for distilling Scout and Maverick, enabling their exceptional performance at much smaller sizes.

What deployment options are available for Llama 4 models?

Llama 4 Scout fits on a single H100 GPU with quantization, while Maverick fits on a single H100 host or can use distributed inference. Both models support serverless endpoints, dedicated deployments, VPC hosting, and on-premise deployment. You maintain full model ownership and can migrate freely between providers.