GPU Clusters

Published 3/10/2026

New in Together GPU Clusters: Autoscaling, observability, and self-healing

Autoscaling, access control, full-stack observability, and self-healing operations — built in

AI infrastructure has quietly become production infrastructure. Teams are no longer experimenting with a handful of GPUs.  A single-node prototype can quickly evolve into a distributed training workload spanning hundreds of accelerators. Inference systems serving real users can experience unpredictable traffic spikes. And as clusters become shared environments, the operational bar changes for everyone — from ML researchers to enterprise platform engineers.

But as this infrastructure scales, manual management becomes a liability. Static provisioning is inefficient and expensive. Permission management turns brittle. Observability gaps obscure performance bottlenecks, and when GPU hardware fails — as it inevitably does — a single unstable node can derail hours of training time.

Today, we’re introducing major enterprise enhancements to Together GPU Clusters (formerly Instant Clusters). We are integrating autoscaling, Role-Based Access Control (RBAC), full-stack observability, self-serve node repair, and active health checks directly into the core cluster experience — giving teams the elasticity of a virtualized stack with the performance profile of bare metal.

Autoscaling: Elastic capacity without overprovisioning

Instead of statically allocating GPU capacity for peak load, you enable “auto-scaling” and allow the cluster to expand or contract based on real-time resource needs.

Powered by the  Kubernetes Cluster Autoscaler which monitors for GPU-constrained workloads (pending pods) from distributed training or bursty inference traffic. When demand spikes, additional nodes are automatically brought online. When demand subsides, capacity scales down.

The outcome is straightforward: you maintain performance under load without paying for idle GPU Nodes. This makes Together GPU Clusters well suited for both long-running training jobs and variable inference workloads. To learn more about this feature, visit our documentation

Active health checks, deeper acceptance testing and self-serve node repair: Reduce MTTR for failures 

Hardware instability is not a hypothetical risk in large GPU fleets — it is an operational reality. For distributed training workloads, a single node failure can invalidate an entire job run.

Together GPU Clusters now includes self-serve active health checks. Before spinning up a massive training job, users can trigger tests ranging from basic DCGM Diag 3 to multi-node NCCL or InfiniBand write bandwidth tests directly from the UI, receiving pass/fail results with detailed outputs.

This capability is especially critical for distributed training workloads, where a single node failure can invalidate an entire job run. Users can now trigger a series of deep checks on the infra, before spinning up a big training job and preserve workload continuity and reduce wasted compute cycles.

If a node fails, users can execute a self-repair in three clicks. The control plane will automatically cordon, drain, and recreate the node on a new or existing host, bringing the cluster back to a healthy state within minutes. Acceptance tests now run automatically during provisioning, and clusters are not marked Ready until they pass. See the complete list of acceptance tests here documentation.

Role-Based Access Control: Structured multi-team governance

As clusters move from experimentation to shared infrastructure, access control becomes foundational. In Together GPU Clusters, "Projects" now define the collaboration and isolation boundaries for teams, with clusters and storage volumes strictly scoped to each project.

Administrators can enforce structured access controls aligned with enterprise governance. By default, projects include two roles:

  1. Admin: Full read/write access to the control plane (create/delete clusters) and sudo access for the Slurm cluster.
  2. Member: Write access to the data plane (access to GPU worker nodes and running workloads).

This clean split allows platform engineers to lock down infrastructure provisioning while giving research and application teams the freedom to run workloads safely within their boundaries. 

You can manage your project membership and user roles from within the cloud console by navigating to Settings > GPU Cluster Projects

To learn more about this feature, visit our documentation here.

Full-stack observability (private preview)

Every Together GPU Cluster project now includes a dedicated Grafana instance with pre-built dashboards, accessible directly from the cluster details page.

Telemetry spans the full stack:

  • GPU utilization: DCGM metrics provide direct insight into accelerator health and performance.
  • Networking: InfiniBand and NIC-level telemetry expose throughput and bandwidth patterns.
  • Storage & orchestration: I/O performance metrics surface hidden bottlenecks, while Kubernetes telemetry provides visibility into orchestration health and resource allocation.

Telemetry is available as soon as the cluster is provisioned. For platform teams, this accelerates debugging and performance tuning. For finance and operations teams, it improves capacity planning and cost efficiency.

This feature is under private preview, please contact support or your account team to get access to your Grafana instances.

Move from experimental to operational

With autoscaling, RBAC, observability, turn-key health checks, and remediations integrated into the platform, Together GPU Clusters move beyond raw GPU provisioning into production ready fully managed infrastructure.

This gives teams the confidence to run large-scale distributed training jobs without worrying that hardware failures will cascade into lost compute time. It also provides tailored value across the organization:

  • Platform engineers can safely support multiple internal stakeholders within shared environments.
  • Operators can pinpoint networking or storage bottlenecks before they degrade model performance.
  • Finance teams can align GPU spend more closely with actual utilization patterns.

Most importantly, organizations can move from experimental AI systems to operational AI platforms — without stitching together third-party tools or building internal control planes from scratch.

Getting started

These capabilities are available today within Together GPU Clusters.

To get started, sign-up at Together AI and spin up your cluster

8S
DeepSeek R1
Premium cinematic video generation with native audio and lifelike physics.
$2.40
Try now
DeepSeek R1
8S

Audio Name

Audio Description

0:00
Premium cinematic video generation with native audio and lifelike physics.
$2.40
Try now
8S
DeepSeek R1
Premium cinematic video generation with native audio and lifelike physics.
$2.40/video (720p/8s)
Try now

Performance & Scale

Body copy goes here lorem ipsum dolor sit amet

  • Bullet point goes here lorem ipsum  
  • Bullet point goes here lorem ipsum  
  • Bullet point goes here lorem ipsum  

Infrastructure

Best for

  • Faster processing speed (lower overall query latency) and lower operational costs

  • Execution of clearly defined, straightforward tasks

  • Function calling, JSON mode or other well structured tasks

List Item  #1

  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

List Item  #1

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

Build

Benefits included:

  • ✔ Up to $15K in free platform credits*

  • ✔ 3 hours of free forward-deployed engineering time.

Funding: Less than $5M

Build

Benefits included:

  • ✔ Up to $15K in free platform credits*

  • ✔ 3 hours of free forward-deployed engineering time.

Funding: Less than $5M

Build

Benefits included:

  • ✔ Up to $15K in free platform credits*

  • ✔ 3 hours of free forward-deployed engineering time.

Funding: Less than $5M

Multilinguality

Word limit

Disclaimer

JSON formatting

Uppercase only

Remove commas

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond only in Arabic, no other language is allowed. Here is the question:

Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond with less than 860 words. Here is the question:

Recall that a palindrome is a number that reads the same forward and backward. Find the greatest integer less than $1000$ that is a palindrome both when written in base ten and when written in base eight, such as $292 = 444_{\\text{eight}}.$

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, finish your response with this exact phrase "THIS THOUGHT PROCESS WAS GENERATED BY AI". No other reasoning words should follow this phrase. Here is the question:

Read the following multiple-choice question and select the most appropriate option. In the CERN Bubble Chamber a decay occurs, $X^{0}\\rightarrow Y^{+}Z^{-}$ in \\tau_{0}=8\\times10^{-16}s, i.e. the proper lifetime of X^{0}. What minimum resolution is needed to observe at least 30% of the decays? Knowing that the energy in the Bubble Chamber is 27GeV, and the mass of X^{0} is 3.41GeV.

  • A. 2.08*1e-1 m
  • B. 2.08*1e-9 m
  • C. 2.08*1e-6 m
  • D. 2.08*1e-3 m

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be wrapped in JSON format. You can use markdown ticks such as ```. Here is the question:

Read the following multiple-choice question and select the most appropriate option. Trees most likely change the environment in which they are located by

  • A. releasing nitrogen in the soil.
  • B. crowding out non-native species.
  • C. adding carbon dioxide to the atmosphere.
  • D. removing water from the soil and returning it to the atmosphere.

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be in English and in all capital letters. Here is the question:

Among the 900 residents of Aimeville, there are 195 who own a diamond ring, 367 who own a set of golf clubs, and 562 who own a garden spade. In addition, each of the 900 residents owns a bag of candy hearts. There are 437 residents who own exactly two of these things, and 234 residents who own exactly three of these things. Find the number of residents of Aimeville who own all four of these things.

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, refrain from the use of any commas. Here is the question:

Alexis is applying for a new job and bought a new set of business clothes to wear to the interview. She went to a department store with a budget of $200 and spent $30 on a button-up shirt, $46 on suit pants, $38 on a suit coat, $11 on socks, and $18 on a belt. She also purchased a pair of shoes, but lost the receipt for them. She has $16 left from her budget. How much did Alexis pay for the shoes?

XX
Title
Body copy goes here lorem ipsum dolor sit amet
XX
Title
Body copy goes here lorem ipsum dolor sit amet
XX
Title
Body copy goes here lorem ipsum dolor sit amet