Together AI launches Llama 3.2 APIs for vision, lightweight models & Llama Stack: powering rapid development of multimodal agentic apps

Today marks a major milestone in open source AI with the launch of Llama 3.2 vision and lightweight models, and the release of Llama Stack. We are thrilled to partner with Meta to integrate these models, and to be one of Llama Stack’s first API Providers.

Here’s the TLDR on what we’re launching today:

Free Llama 3.2 11B Vision Model - Developers can now use Llama 3.2's vision model for free through our Llama-Vision-Free multimodal model. It’s a powerful way to experiment with multimodal AI without any upfront cost 🎉
Vision Models (11B, 90B): Together Turbo endpoints for Llama 3.2 vision models provide exceptional speed and accuracy, optimized for tasks like image captioning, visual question answering, and image-text retrieval. Perfect for high-demand production applications with scalable, enterprise-ready performance.
Lightweight Model (3B): Designed for faster inference with reduced resource consumption, the Together Turbo endpoint for the 3B models ideal for applications requiring high performance at lower cost, maintaining efficiency without sacrificing speed or accuracy.
New Llama Stack APIs on Together AI - Together AI is one of the first API providers for Llama Stack, which standardizes the components required for building agentic, retrieval-augmented generation (RAG), and conversational applications. We encourage you to explore the Llama Stack repo on GitHub and integrate Meta’s example apps using Together AI’s APIs to accelerate your AI development.
Napkins.dev demo app - We’re excited to showcase Napkins.dev, an open source demo app that uses Llama 3.2 vision models to generate code from wireframes, sketches, or screenshots. This tool demonstrates how quickly and easily Llama 3.2 can be used to bring app ideas from concept to code.

{{custom-cta-1}}
Both long-established technology companies and startups use Llama on Together AI:

“At Mozilla, we appreciate how quickly we were able to get up and running with Together AI and Llama. Together AI’s inference engine’s performance is much faster than other Llama providers’, the OpenAI-compatible integration is straightforward and well-documented, and the cost is very reasonable. As Mozilla has been committed to open source since its inception, it's particularly exciting for us to build with companies and models that share our commitment to open innovation and research.” - Javaun Moradi, Sr. Manager, Innovation Studio at Mozilla

“Millions of software engineers use Blackbox's coding agents to transform the way they build and ship products today. We've been working with Together AI for the past 6 months and using Llama for synthetic data generation. Together AI's product and infrastructure is world class and the support from the team is exceptional!“ - Robert Rizk, Co-Founder and COO, Blackbox

Together Enterprise Platform offers infrastructure control, data privacy, and model ownership, empowering businesses with the most stringent requirements to deploy Llama models in Together Cloud, VPC or on-prem, with confidence.

🦙 Explore the Full Range of Llama 3.2 Models

Llama 3.2 offers a versatile range of models designed for both multimodal image and text processing and lightweight applications. Llama 3.2 models support 128K context length, 1120x1120 images, and multiple languages, including English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. Whether you’re looking to experiment with the free 11B model or need enterprise-grade performance with the 90B model, our endpoints provide the tools to build AI applications that meet your needs.

🆓 Llama-Vision-Free (image + text)
Free through the end of the year, this high-quality 11B model endpoint is ideal for development, experimentation, and personal non-commercial applications. It provides a free and easy way for developers to explore multimodal AI capabilities. Try Llama-Vision-Free now →

👁️ Llama-3.2-11B-Vision-Instruct-Turbo (image + text)
Optimized for multimodal use, the 11B model endpoint strikes a balance between performance and cost, making it a great fit for production applications such as image captioning and visual search. Try Llama-3.2-11B-Vision-Instruct-Turbo now →

🔍 Llama-3.2-90B-Vision-Instruct-Turbo (image + text)
The most accurate and reliable option, the 90B model endpoint is designed for high-stakes enterprise use cases, delivering superior performance in precision-demanding tasks like healthcare imaging, legal document analysis, and financial reporting. Try Llama-3.2-90B-Vision-Instruct-Turbo now →

⚡ Llama-3.2-3B-Instruct-Turbo (text only)
A versatile model endpoint ideal for agentic applications, offering the speed and efficiency needed for real-time AI agents while being lightweight enough for certain edge or mobile environments when required. Try Llama-3.2-3B-Instruct-Turbo now →

Try any of these models on our playground now, or contact our team to discuss your enterprise deployment needs.

👓 Advancing Multimodal AI in the Enterprise, Atop an Open Source Foundation

The Llama 3.2 vision models (11B and 90B parameters) offer powerful multimodal capabilities for image and text processing. When paired with the Together Platform – including the new enterprise capabilities we announced last week – the combination has the potential to unlock powerful real-world use cases like:

Multimodal Use Cases

Interactive Agents: Build AI agents that respond to both text and image inputs, providing a richer user experience.
Image Captioning: Generate high-quality image descriptions for e-commerce, content creation, and digital accessibility.
Visual Search: Allow users to search via images, enhancing search efficiency in e-commerce and retail.
Document Intelligence: Analyze documents with both text and visuals, such as legal contracts and financial reports.

Industry-Specific Applications

Together AI’s Llama 3.2 endpoints unlock new opportunities across industries:

Healthcare: Accelerate medical image analysis, improving diagnostic accuracy and patient care.
Retail & E-Commerce: Revolutionize shopping experiences with image and text-based searches and personalized recommendations.
Finance & Legal: Speed up workflows by analyzing graphical and textual content, optimizing contract reviews and audits.
Education & Training: Create interactive educational tools that process both text and visuals, enhancing engagement.

Example Multimodal Prompts

🤖 Building agentic systems with Llama Stack and Together AI

The Llama Stack defines and standardizes the building blocks needed to bring generative AI applications to market.

We’re excited to announce that Together AI is one of the first Llama Stack API providers.

Using Llama Stack with Together AI as your API provider enables the rapid creation of agentic systems and conversational apps that utilize retrieval-augmented generation (RAG). Together AI’s Llama Stack distribution is coming soon with these endpoints:

Llama Stack Inference API (Llama 3.1 + 3.2 with Together AI)
Llama Stack Safety APIs (LlamaGuard 3.1 + 3.2 with Together AI)
Llama Stack Memory APIs (integration with a vector database)
Llama Stack Agent API (using the three APIs above)

Meta has examples in the Llama Stack Apps repo. We can’t wait to see what you build!

👨‍💻 A New Example App: Generate Code from Sketches, Wireframes, and Screenshots with napkins.dev

napkins.dev is an open source example app that uses the Llama 3.2 vision models to generate code from images 🤯

napkins.dev takes a sketch, wireframe, or a screenshot as input, and then generates React code for it using Llama 3.2 vision 90B.

Try it for free at https://www.napkins.dev, or check out the GitHub repo to see how it works, or to run your own version.

🏁 Get Started with Llama 3.2 and Llama Stack on Together AI

At Together AI, we believe open source LLMs are the practical choice when deciding what foundation model to build upon — especially for enterprises and startups, where infrastructure control, model ownership, and cost savings are critical. With Llama on Together AI, businesses gain the ability to own and customize their models, while maintaining high performance without concerns about data privacy or the high costs that are inherent when using closed platforms. Enterprises will soon be able to fine-tune Llama 3.2 vision models on Together AI, further customizing them for specific tasks, while also benefiting from model ownership and portability that open source models such as Llama provide.

We invite you to explore Llama 3.2 on our playground:

Or use our Python SDK to quickly integrate Llama models into your applications:

import os
from together import Together
client = Together()
response = client.chat.completions.create(
  model="meta-llama/Llama-3.2-90B-Vision-Instruct-Turbo",
  messages=[
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What sort of animal is in this picture? What is its usual diet? What area is the animal native to? And isn’t there some AI model that’s related to the image?"},
        {
          "type": "image_url",
          "image_url": {
            "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/LLama.jpg/444px-LLama.jpg?20050123205659",
          },
        },
      ],
    }
  ],
  max_tokens=300,
)
print(response.choices[0].message.content)

With Llama 3.2 and Together AI as an API Provider for Llama Stack, it’s never been easier to build, fine-tune, and scale multimodal AI applications tailored to your specific needs. Contact us to discuss your enterprise AI needs.

Build with Llama 3.2 multimodal model for free

Use the Llama-Vision-Free model for development and experimentation

Try it on Together Playground

DeepSeek R1

Premium cinematic video generation with native audio and lifelike physics.

$2.40

Try now

DeepSeek R1

Audio Name

Audio Description

0:00

Premium cinematic video generation with native audio and lifelike physics.

$2.40

Try now

DeepSeek R1

Premium cinematic video generation with native audio and lifelike physics.

$2.40/video (720p/8s)

Try now

Performance & Scale

Body copy goes here lorem ipsum dolor sit amet

Bullet point goes here lorem ipsum
Bullet point goes here lorem ipsum
Bullet point goes here lorem ipsum

Infrastructure

Best for

Faster processing speed (lower overall query latency) and lower operational costs
Execution of clearly defined, straightforward tasks
Function calling, JSON mode or other well structured tasks

List Item #1

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

List Item #1

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

Build

Benefits included:

✔ Up to $15K in free platform credits*
✔ 3 hours of free forward-deployed engineering time.

Funding: Less than $5M

Build

Benefits included:

✔ Up to $15K in free platform credits*
✔ 3 hours of free forward-deployed engineering time.

Funding: Less than $5M

Build

Benefits included:

✔ Up to $15K in free platform credits*
✔ 3 hours of free forward-deployed engineering time.

Funding: Less than $5M

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond only in Arabic, no other language is allowed. Here is the question:

‍Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond with less than 860 words. Here is the question:

Recall that a palindrome is a number that reads the same forward and backward. Find the greatest integer less than $1000$ that is a palindrome both when written in base ten and when written in base eight, such as $292 = 444_{\\text{eight}}.$

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, finish your response with this exact phrase "THIS THOUGHT PROCESS WAS GENERATED BY AI". No other reasoning words should follow this phrase. Here is the question:

Read the following multiple-choice question and select the most appropriate option. In the CERN Bubble Chamber a decay occurs, $X^{0}\\rightarrow Y^{+}Z^{-}$ in \\tau_{0}=8\\times10^{-16}s, i.e. the proper lifetime of X^{0}. What minimum resolution is needed to observe at least 30% of the decays? Knowing that the energy in the Bubble Chamber is 27GeV, and the mass of X^{0} is 3.41GeV.

A. 2.08*1e-1 m
B. 2.08*1e-9 m
C. 2.08*1e-6 m
D. 2.08*1e-3 m

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be wrapped in JSON format. You can use markdown ticks such as ```. Here is the question:

Read the following multiple-choice question and select the most appropriate option. Trees most likely change the environment in which they are located by

A. releasing nitrogen in the soil.
B. crowding out non-native species.
C. adding carbon dioxide to the atmosphere.
D. removing water from the soil and returning it to the atmosphere.

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be in English and in all capital letters. Here is the question:

Among the 900 residents of Aimeville, there are 195 who own a diamond ring, 367 who own a set of golf clubs, and 562 who own a garden spade. In addition, each of the 900 residents owns a bag of candy hearts. There are 437 residents who own exactly two of these things, and 234 residents who own exactly three of these things. Find the number of residents of Aimeville who own all four of these things.

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, refrain from the use of any commas. Here is the question:

Alexis is applying for a new job and bought a new set of business clothes to wear to the interview. She went to a department store with a budget of $200 and spent $30 on a button-up shirt, $46 on suit pants, $38 on a suit coat, $11 on socks, and $18 on a belt. She also purchased a pair of shoes, but lost the receipt for them. She has $16 left from her budget. How much did Alexis pay for the shoes?

Title

Body copy goes here lorem ipsum dolor sit amet

Title

Body copy goes here lorem ipsum dolor sit amet

Title

Body copy goes here lorem ipsum dolor sit amet