Models / Meta
Chat
Vision

Llama 4 Scout

SOTA 109B model with 17B active params & large context, excelling at multi-document analysis, codebase reasoning, and personalized tasks.

About model

Together AI offers day 1 support for the new Llama 4 multilingual vision models that can analyze multiple images and respond to queries about them.

Register for a Together AI account  to get an API key. New accounts come with free credits to start. Install the Together AI library for your preferred language.

Performance benchmarks

Model

AIME 2025

GPQA Diamond

HLE

LiveCodeBench

MATH500

SWE-bench verified

51.8%

Related open-source models

Competitor closed-source models

Claude Opus 4.6

90.5%

34.2%

78.7%

OpenAI o3

83.3%

24.9%

99.2%

62.3%

OpenAI o1

76.8%

96.4%

48.9%

GPT-4o

49.2%

2.7%

32.3%

89.3%

31.0%

  • How to use model

    Input

        
          from together import Together
    
          client = Together()  # API key via api_key param or TOGETHER_API_KEY env var
    
          # Query image with Llama 4 Scout model
          response = client.chat.completions.create(
            model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
            messages=[{
                "role": "user",
                "content": [
                    {"type": "text", "text": "What can you see in this image?"},
                    {"type": "image_url", "image_url": {"url": "https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/yosemite.png"}}
                ]
            }]
          )
    
          print(response.choices[0].message.content)
        
    

    Output

        
          The image depicts a serene landscape of Yosemite National Park, featuring a river flowing through a valley surrounded by towering cliffs and lush greenery.
    
          *   **River:**
              *   The river is calm and peaceful, with clear water that reflects the surrounding scenery.
              *   It flows gently from the bottom-left corner to the center-right of the image.
              *   The riverbank is lined with rocks and grasses, adding to the natural beauty of the scene.
          *   **Cliffs:**
              *   The cliffs are massive and imposing, rising steeply from the valley floor.
              *   They are composed of light-colored rock, possibly granite, and feature vertical striations.
              *   The cliffs are covered in trees and shrubs, which adds to their rugged charm.
          *   **Trees and Vegetation:**
              *   The valley is densely forested, with tall trees growing along the riverbanks and on the cliffsides.
              *   The trees are a mix of evergreen and deciduous species, with some displaying vibrant green foliage.
              *   Grasses and shrubs grow in the foreground, adding texture and color to the scene.
          *   **Sky:**
              *   The sky is a brilliant blue, with only a few white clouds scattered across it.
              *   The sun appears to be shining from the right side of the image, casting a warm glow over the scene.
    
          In summary, the image presents a breathtaking view of Yosemite National Park, showcasing the natural beauty of the valley and its surroundings. The calm river, towering cliffs, and lush vegetation all contribute to a sense of serenity and wonder.
    
        
    

    Function Calling

    Input

        
          import os
          import json
          import openai
    
          client = openai.OpenAI(
              base_url = "https://api.together.xyz/v1",
              api_key = os.environ['TOGETHER_API_KEY'],
          )
    
          tools = [
            {
              "type": "function",
              "function": {
                "name": "get_current_weather",
                "description": "Get the current weather in a given location",
                "parameters": {
                  "type": "object",
                  "properties": {
                    "location": {
                      "type": "string",
                      "description": "The city and state, e.g. San Francisco, CA"
                    },
                    "unit": {
                      "type": "string",
                      "enum": [
                        "celsius",
                        "fahrenheit"
                      ]
                    }
                  }
                }
              }
            }
          ]
    
          messages = [
              {"role": "system", "content": "You are a helpful assistant that can access external functions. The responses from these function calls will be appended to this dialogue. Please provide responses based on the information from these function calls."},
              {"role": "user", "content": "What is the current temperature of New York, San Francisco and Chicago?"}
          ]
              
          response = client.chat.completions.create(
              model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
              messages=messages,
              tools=tools,
              tool_choice="auto",
          )
    
          print(json.dumps(response.choices[0].message.model_dump()['tool_calls'], indent=2))
        
    

    Output

        
          [
            {
              "id": "call_1p75qwks0etzfy1g6noxvsgs",
              "function": {
                "arguments": "{\"location\":\"New York, NY\",\"unit\":\"fahrenheit\"}",
                "name": "get_current_weather"
              },
              "type": "function"
            },
            {
              "id": "call_aqjfgn65d0c280fjd3pbzpc6",
              "function": {
                "arguments": "{\"location\":\"San Francisco, CA\",\"unit\":\"fahrenheit\"}",
                "name": "get_current_weather"
              },
              "type": "function"
            },
            {
              "id": "call_rsg8muko8hymb4brkycu3dm5",
              "function": {
                "arguments": "{\"location\":\"Chicago, IL\",\"unit\":\"fahrenheit\"}",
                "name": "get_current_weather"
              },
              "type": "function"
            }
          ]
        
    

    Query models with multiple images

    Currently this model supports 5 images as input.

    Input

        
          # Multi-modal message with multiple images
          response = client.chat.completions.create(
              model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
              messages=[{
                  "role": "user",
                  "content": [
                      {
                          "type": "text",
                          "text": "Compare these two images."
                      },
                      {
                          "type": "image_url",
                          "image_url": {
                              "url": "https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/yosemite.png"
                          }
                      },
                      {
                          "type": "image_url",
                          "image_url": {
                              "url": "https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/slack.png"
                          }
                      }
                  ]
              }]
          )
          print(response.choices[0].message.content)
    
        
    

    Output

        
          The first image is a collage of multiple identical landscape photos showing a natural scene with rocks, trees, and a stream under a blue sky. The second image is a screenshot of a mobile app interface, specifically the navigation menu of the Canva app, which includes icons for Home, DMs (Direct Messages), Activity, Later, Canvases, and More.
    
          ### Comparison:
          1. **Content**:
            - The first image focuses on a natural landscape.
            - The second image shows a digital interface from an app.
    
          2. **Purpose**:
            - The first image could be used for showcasing nature, design elements in graphic work, or as a background.
            - The second image represents the functionality and layout of the Canva app's navigation system.
    
          3. **Visual Style**:
            - The first image has vibrant colors and realistic textures typical of outdoor photography.
            - The second image uses flat design icons with a simple color palette suited for user interface design.
    
          4. **Context**:
            - The first image is likely intended for artistic or environmental contexts.
            - The second image is relevant to digital design and app usability discussions.
        
    
  • Model card

    • Model String: meta-llama/Llama-4-Scout-17B-16E-Instruct
    • Specs:
      • 17B active parameters (109B total)
      • 16-expert MoE architecture
      • 327,680 context length (will be increased to 10M)
      • Support for 12 languages: Arabic, English, French, German, Hindi, Indonesian, Italian, Portuguese, Spanish, Tagalog, Thai, and Vietnamese
      • Multimodal capabilities (text + images)
      • Support Function Calling
    • Best for: Multi-document analysis, codebase reasoning, and personalized tasks
    • Knowledge Cutoff: August 2024
  • Applications & use cases

    • Multi-document summarization for legal/financial analysis: Analyze multiple legal contracts or financial statements simultaneously, identifying key terms, inconsistencies, and patterns across documents to generate comprehensive summaries and risk assessments.
    • Personalized task automation using years of user data: Create tailored automation workflows by analyzing an individual's historical data patterns, communication style, and preferences, enabling highly personalized digital assistants that adapt to specific user needs.
    • Efficient image parsing for multimodal applications: Process and understand image content in conjunction with text to power applications like visual search, content moderation, and accessibility features that require understanding the relationship between visual and textual elements.
Related models
  • Model provider
    Meta
  • Type
    Chat
    Vision
  • Main use cases
    Chat
    Function Calling
    Vision
  • Features
    Function Calling
  • Fine tuning
    Supported
  • Deployment
    On-Demand Dedicated
    Monthly Reserved
  • Parameters
    109B
  • Context length
    1M
  • Input price

    $0.18 / 1M tokens

  • Output price

    $0.59 / 1M tokens

  • Input modalities
    Text
    Image
  • Output modalities
    Text
  • Released
    April 2, 2025
  • Last updated
    February 5, 2026
  • Quantization level
    FP16
  • External link
  • Category
    Chat