Models / Meta

Meta

Deploy Llama 4 Maverick and Scout on Together AI. Frontier multimodal performance, 10M token context, and 80%+ cost savings versus GPT-4o.

Why Meta on Together AI?

Designed for production workloads that need 
consistent performance and operational control.

Open source freedom, enterprise grade

Full model ownership — download the weights, deploy on Together AI’s cloud, or run on-premises. Your data never trains our models and never leaves your control.

Frontier multimodal performance

Llama 4 Maverick beats GPT-4o and Gemini 2.0 Flash on key benchmarks at just $0.27/1M tokens — an 80%+ cost reduction versus closed-source alternatives.

Built for scale, ready for enterprise

SOC 2 Type II certified, HIPAA compliant, with dedicated endpoints, monthly reserved capacity, and up to 40% savings at volume.

Meet the Meta family

Explore top-performing models across text, image, video, code, and voice.

Chat

Llama 4 Maverick

Chat

NIM Llama 3.1 Nemotron 70B Instruct

Free

Chat

Llama 3.3 70B Instruct Turbo Free

Chat

Llama 3.1 405B

Chat

Llama 3.3 70B

Chat

Llama 4 Scout

Chat

NIM Mixtral 8x7B Instruct v0.1

New

Moderation

Llama Guard 4 12B

Chat

NIM Llama 3.1 70B Instruct

Chat

NIM Llama 3.1 8B Instruct

Chat

LLaMA-2

Vision

NIM Llama 3.2 11B Vision Instruct

Chat

Llama 3 70B Instruct Reference

Vision

NIM Llama 3.2 90B Vision Instruct

Chat

NIM Llama 3.3 70B Instruct

Chat

NIM Llama 3.3 Nemotron Super 49B v1

Chat

NIM Mistral-NeMo 12B Instruct

Chat

NIM Mixtral 8x22B Instruct v0.1

Chat

Llama 3.1 8B

Chat

Llama 3.2 3B Instruct Turbo

Chat

Llama 3.1 70B

Moderation

Llama Guard (7B)

Moderation

Llama Guard 2 8B

Moderation

Llama Guard 3 11B Vision Turbo

Moderation

Llama Guard 3 8B

Chat

Llama 3 8B Instruct Lite

Chat

LLaMA-2 Chat (13B)

Chat

LLaMA-2 Chat (7B)

Breakthrough technical innovations

Explore all the game-changing architectural advances that make Meta models shine.

  • Mixture of Experts (MoE)

    Sparse expert routing activates only 37B out of 671B parameters for each token in V3. Advanced load balancing without auxiliary losses maintains performance while reducing computational cost.

  • Group Relative Policy Optimization

    New RL approach that removes separate value networks in RLHF, using grouped relative advantage estimation to cut compute requirements while maintaining training stability.

  • Native Reasoning Transparency

    First reasoning model to expose complete thinking process in <think> tags. Native reasoning capabilities built into model foundation through large-scale reinforcement learning.

  • MetaP Training

    First successful implementation of FP8 mixed precision training on a 671B parameter model. Pioneering reinforcement learning approach without supervised fine-tuning as preliminary step.

  • Multi-Head Latent Attention

    Innovative attention mechanism that reduces KV-cache memory requirements while maintaining modeling performance. Optimized for efficient inference deployment.

  • Multi-Token Prediction

    Novel training objective that allows the model to predict multiple tokens simultaneously. Enhanced performance and efficiency through advanced training techniques.

Deployment options

Run models using different deployment options depending on latency needs, traffic patterns, and infrastructure control.

  • Serverless

  • Inference

Serverless Inference

Real-time

A fully managed inference API that automatically scales with request volume.

Best for

Variable or unpredictable traffic

Rapid prototyping and iteration

Cost-sensitive or early-stage production workloads

Batch

Process massive workloads of up to 30 billion tokens asynchronously, at up to 50% less cost.

Best for

Classifying large datasets

Offline summarization

Synthetic data generation

Dedicated Inference

Dedicated Model Inference

An inference endpoint backed by reserved, isolated compute resources and the Together AI inference engine.

Best for

Predictable or steady traffic

Latency-sensitive applications

High-throughput production workloads

Dedicated Container Inference

Run inference with your own engine and model on fully-managed, scalable infrastructure.

Best for

Generative media models

Non-standard runtimes

Custom inference pipelines