Pre-Training

Salesforce, Zoom, InVideo Train Faster with Together AI Turbocharged with NVIDIA Blackwell

April 24, 2025

By 

Together AI

At NVIDIA GTC, we shared our bold plans to scale out thousands of NVIDIA Blackwell GPUs as an NVIDIA Cloud Partner, launch self-serve Instant GPU Clusters, and offer seamless deployment of NVIDIA NIM microservices from build.nvidia.com.

As we’ve brought NVIDIA Blackwell systems online over these past several weeks, we’ve invited a number of pioneering AI companies to take a free test drive of NVIDIA Blackwell on Together AI. Today, we’re excited to share what Zoom, Salesforce, and InVideo discovered when they tried NVIDIA Blackwell infrastructure, turbocharged by the Together Training Stack and Kernel Collection.

The Results

In collaboration with NVIDIA, we have worked hand in hand with customers, helping them step up to NVIDIA HGX™ B200, accelerating both training and inference workloads. Below, we highlight some of the incredible results across AI-native companies and tech-forward enterprise customers. This incredible work would not have been possible without Together AI’s expertise in AI research and systems optimization along with NVIDIA’s cutting-edge accelerated computing platform.

Salesforce: Bringing Agentforce to life

Salesforce leverages Together AI for the entire AI journey: from training, to fine-tuning to inference of their models to deliver Agentforce. Salesforce Research is at the cutting edge of innovation with building agentic frameworks and was very keen to experiment with NVIDIA Blackwell GPUs to accelerate their training pipelines.

Training a mistral-24b model, Salesforce saw a 2x improvement in training speeds upgrading from NVIDIA HGX H200 to HGX B200. This is enabling Salesforce to rapidly train various models and accelerate the integration of research results into Agentforce, thereby enhancing product velocity.

{{custom-cta-1}}

Zoom: Accelerating the amazing Zoom AI Companion

1.2 million people use Zoom AI Companion, featuring incredible AI-powered tools such as real-time transcriptions, meeting summaries, and phone call analyses. Zoom has partnered with Together AI to leverage our research and deliver accelerated performance when training the models powering various AI Companion features. Recently, they took it a step further by trying out Together GPU Clusters accelerated by NVIDIA HGX B200.

Out-of-the-box, Zoom experienced a 1.9X improvement in training speeds over previous generation NVIDIA Hopper GPUs. The teams look forward to taking it a step further by profiling for additional optimizations.

InVideo: Bringing ideas to life through video

InVideo has generated millions of videos, helping its users tell stories like never before through its generative video foundation model. Considering some of the intricacies around current software stack support on NVIDIA Blackwell, the team was initially uncertain that they would see the gains needed to take the leap onto the new architecture.

However, during initial tests with the NVIDIA HGX B200, InVideo immediately saw a 25% improvement when running a training job from NVIDIA HGX H200. Then, in partnership with our researchers, the team made further optimizations and more than doubled this improvement – making the step up to the NVIDIA Blackwell platform even more appealing. This level of performance gain is largely unheard of for modalities outside of text at this time and speaks volumes to the level of expertise of the teams involved. We share some of those optimizations later in this blog.

The Together Training Stack

The Together AI research team has custom-built a training container that gives developers the best representation of the hardware’s capabilities and potential. This container features a co-optimized Llama 3 70B golden model, achieving state-of-the-art (SOTA) Model FLOPS Utilization (MFU).

The stack includes Together AI researchers’ tools for debugging and running diagnostics at scale across many nodes and thousands of processes. These tools deliver:

  • Comprehensive MFU benchmarks at various levels (e.g., GEMM, ThunderKittens (TK)-based attention kernels)
  • Full bandwidth benchmarking toolkits
  • Collective communication diagnostics toolkits for performance analysis and debugging

Getting the most out of the hardware 

Price-performance is widely considered the most important metric when it comes to GPU cloud infrastructure. Together AI specializes in delivering higher tokens / sec / node and overall MFU than other providers on the same hardware. This section covers some of the optimizations we’ve found on the NVIDIA Blackwell platform.

FlashAttention

FlashAttention-3 is a key optimization that speeds up LLM inference performance. It requires specific memory access patterns and Tensor Core optimizations that are now supported in the latest version of NVIDIA cuDNN for the Blackwell architecture. 

This support includes FP8 FlashAttention optimized for Blackwell, using Blackwell’s FP8 precision and decompression engines for 4x higher throughput vs. FP16 on H100. By fusing several of the training operations together, we significantly reduce bottlenecks in training to gain performance advantages. cuDNN’s FP8 FlashAttention matches FA3’s FP16 performance while using 50% less memory.

Computation / Graph Optimization

Another key optimization technique we leverage is to enable torch.compile for graph-level performance improvements, which compiles PyTorch models into optimized NVIDIA CUDA graphs, reducing Python overhead and kernel launch latency. We work very closely with customers by looking at their end-to-end profiles, identifying critical segments and adjusting the model to leverage torch.compile in the best way possible.

Parallelism Optimization

By tuning Distributed Data Parallel (DDP) settings and overlapping device-to-device (D2D) copies, we’re able to overlap gradient synchronization with backward passes. We leverage CUDA streams to overlap D2D transfers (e.g. GPU-to-GPU sharding) with computation. Combining this with overall reduction in communication overhead and logging costs, overall throughput of the system improves.

Get started with Together GPU Clusters accelerated by the NVIDIA Blackwell platform

Together AI was recently recognized as a ClusterMAX™ Gold provider by SemiAnalysis, a leading independent research and analysis company specializing in the Semiconductor and AI industries.

Outside of strong GPU price-performance, Together AI shines in its overall GPU Cluster offering:

  • Infrastructure and Security
  • Technical Expertise and Support
    • Deep research expertise on GPU performance
    • Strong technical collaboration with NVIDIA
  • Business Model
    • Flexible Consumption Models, including new self-service Instant GPU Clusters
    • GPU Availability across current and next-gen hardware needs 

If you are interested in a free test drive of Together GPU Clusters accelerated by the NVIDIA Blackwell platform, please contact us. And if you’d like to try our new Instant GPU Clusters, with self-service provisioning, please request access at together.ai/instant

Request Access to Together GPU Clusters accelerated by NVIDIA Blackwell GPUs

Top-Tier NVIDIA hardware: NVIDIA GB200 NVL72 and HGX B200

LOREM IPSUM

Tag

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

$0.030/image

Try it out

LOREM IPSUM

Tag

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

$0.030/image

Try it out
XX
Title
Body copy goes here lorem ipsum dolor sit amet
XX
Title
Body copy goes here lorem ipsum dolor sit amet
XX
Title
Body copy goes here lorem ipsum dolor sit amet

Value Prop #1

Body copy goes here lorem ipsum dolor sit amet

  • Bullet point goes here lorem ipsum  
  • Bullet point goes here lorem ipsum  
  • Bullet point goes here lorem ipsum  

Value Prop #1

Body copy goes here lorem ipsum dolor sit amet

  • Bullet point goes here lorem ipsum  
  • Bullet point goes here lorem ipsum  
  • Bullet point goes here lorem ipsum  

Value Prop #1

Body copy goes here lorem ipsum dolor sit amet

  • Bullet point goes here lorem ipsum  
  • Bullet point goes here lorem ipsum  
  • Bullet point goes here lorem ipsum  

List Item  #1

  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

List Item  #1

  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

List Item  #1

  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

List Item  #1

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

List Item  #2

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

List Item  #3

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

Build

Benefits included:

  • ✔ Up to $15K in free platform credits*

  • ✔ 3 hours of free forward-deployed engineering time.

Funding: Less than $5M

Grow

Benefits included:

  • ✔ Up to $30K in free platform credits*

  • ✔ 6 hours of free forward-deployed engineering time.

Funding: $5M-$10M

Scale

Benefits included:

  • ✔ Up to $50K in free platform credits*

  • ✔ 10 hours of free forward-deployed engineering time.

Funding: $10M-$25M

Multilinguality

Word limit

Disclaimer

JSON formatting

Uppercase only

Remove commas

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond only in Arabic, no other language is allowed. Here is the question:

Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond with less than 860 words. Here is the question:

Recall that a palindrome is a number that reads the same forward and backward. Find the greatest integer less than $1000$ that is a palindrome both when written in base ten and when written in base eight, such as $292 = 444_{\\text{eight}}.$

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, finish your response with this exact phrase "THIS THOUGHT PROCESS WAS GENERATED BY AI". No other reasoning words should follow this phrase. Here is the question:

Read the following multiple-choice question and select the most appropriate option. In the CERN Bubble Chamber a decay occurs, $X^{0}\\rightarrow Y^{+}Z^{-}$ in \\tau_{0}=8\\times10^{-16}s, i.e. the proper lifetime of X^{0}. What minimum resolution is needed to observe at least 30% of the decays? Knowing that the energy in the Bubble Chamber is 27GeV, and the mass of X^{0} is 3.41GeV.

  • A. 2.08*1e-1 m
  • B. 2.08*1e-9 m
  • C. 2.08*1e-6 m
  • D. 2.08*1e-3 m

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be wrapped in JSON format. You can use markdown ticks such as ```. Here is the question:

Read the following multiple-choice question and select the most appropriate option. Trees most likely change the environment in which they are located by

  • A. releasing nitrogen in the soil.
  • B. crowding out non-native species.
  • C. adding carbon dioxide to the atmosphere.
  • D. removing water from the soil and returning it to the atmosphere.

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be in English and in all capital letters. Here is the question:

Among the 900 residents of Aimeville, there are 195 who own a diamond ring, 367 who own a set of golf clubs, and 562 who own a garden spade. In addition, each of the 900 residents owns a bag of candy hearts. There are 437 residents who own exactly two of these things, and 234 residents who own exactly three of these things. Find the number of residents of Aimeville who own all four of these things.

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, refrain from the use of any commas. Here is the question:

Alexis is applying for a new job and bought a new set of business clothes to wear to the interview. She went to a department store with a budget of $200 and spent $30 on a button-up shirt, $46 on suit pants, $38 on a suit coat, $11 on socks, and $18 on a belt. She also purchased a pair of shoes, but lost the receipt for them. She has $16 left from her budget. How much did Alexis pay for the shoes?

Start
building
yours
here →