A practitioner's guide to testing and running large GPU clusters for training generative AI models
Training generative AI models require clusters of expensive cutting-edge hardware: H100 GPUs & fast storage wired together in multi-network topologies involving Infiniband links, switches, transceivers and ethernet connections. While an increasing number of HPC and AI cloud services now offer these specialized clusters, they demand substantial capital commitments. However, not all clusters are created equal.
Due to the complexity of this novel hardware, clusters often contain misassembled, misconfigured, or dead-on-arrival (DoA) components that may inadvertently be passed on to customers. Operating under high thermal loads, these clusters are prone to frequent component failures.
To mitigate the risk of low-performance clusters, we employ a process called 'acceptance testing.' For companies training generative AI models, this is not merely a procedural step. As we push the boundaries of AI capabilities, ensuring that our hardware infrastructure—particularly GPU clusters—meets the highest standards of reliability and performance becomes increasingly critical.
This article outlines the acceptance testing process we've developed at Together AI, which we've successfully implemented across clusters containing thousands of GPUs.
Introduction to GPU Cluster Testing
The reliability of GPU clusters varies dramatically, ranging from minor issues to critical failures. Even industry giants like Meta have reported significant hardware challenges. During a 54-day training run of their Llama 3.1 model, “GPU issues were the largest category, accounting for 58.7% of all unexpected issues”.
At Together AI, we serve many AI startups and Fortune 500 companies on their mission-critical infrastructure needs for AI; and during that process, similar hardware issues are often what blocks our customers. These challenges prompted us to develop a robust validation framework for assessing and ensuring hardware quality before deployment to our cloud service.
As a result, we've created a systematic approach to acceptance testing, designed to guarantee reliability for our end customers as we expand our globally distributed cloud service. We share this framework with our providers before cluster delivery and employ this repeatable process to verify quality and performance prior to cluster acceptance.
The Process of Testing Clusters at Together AI
The overarching goal of acceptance testing is to guarantee that the hardware infrastructure not only meets the specified requirements but also delivers the reliability and performance necessary for demanding AI/ML workloads. This process aids in optimizing operational efficiency and plays a crucial role in maintaining the trust and satisfaction of customers who rely on these computational resources.
A key concept in Together AI's acceptance testing is the hierarchical structure of the tests. We test hierarchically to ensure the issues can be pinpointed accurately, starting from basic functionality and gradually moving to more complex integrations and performance evaluations.
1. Preparation and Configuration
The initial phase involves configuring the new hardware within the GPU clusters environment. This setup mimics the end-use scenario, allowing for a comprehensive evaluation of the hardware's performance in an operational context. At a high level, we prepare a cluster by:
- Installing NVIDIA drivers
- Installing OFED drivers (for Infiniband)
- Installing CUDA
- Installing NCCL
- Installing HPCX
- Configuring SLURM cluster
- Configuring PCI settings for performance
Once we have all the dependencies installed, we begin the process of validating the cluster by stress testing and benchmarking every subsystem and component individually. Our testing builds on each phase culminating with running a reference task tailored to our customer’s use case (e.g., a model build), so we know that the cluster is ready for training.
2. GPU Validation
One of the first subsystems to validate are the GPUs. We start by checking that the GPU type and count matches what’s expected — this can catch simple problems like NVML driver mismatch errors, or “GPU fell off the bus” errors that some have experienced. We can quickly check the number and type of GPUs, for example a machine with 8x H100 should look like this:
The heart of GPU validation lies in the stress testing. For this, we utilize DCGM Diagnostics from NVIDIA as well as gpu-burn. DCGM will perform a number of tests including measuring power consumption and temperature while the GPU is under load. If any of the various test cases from DCGM don’t pass, we know we’ve likely got a problem. We generally run DCGM with Apptainer. You can build the sif file like this:
And then run DCGM diagnostics:
Another great tool for stress testing GPUs is gpu-burn. With gpu-burn we can do a long running stress test, which ensures that even under consistent heavy load we don’t start to see memory errors or other failures. We expect the GPUs to be able to handle consistent stress similar to the load they will be when training. We can also use Apptainer to run gpu-burn:
3. NVLink and NVSwitch Validation
If we have a positive result when evaluating each of the GPUs individually, then we need to make sure the GPUs can work together on a single machine. There are two main tools for this: NCCL tests and nvbandwidth.
NCCL tests can test GPU to GPU communication over NVLink when run on a single machine, and we should see that for large message sizes, we approach the unidirectional performance of NVLink. If the performance is lower than expected, or we get some errors, we can quickly diagnose problems like a bad NVSwitch or down NVLinks. Similarly the nvbandwidth tool will measure copy performance from GPU to GPU.
The nvbandwidth tool can be built from source from the NVIDIA repository https://github.com/NVIDIA/nvbandwidth and then run with default arguments. It will run a lot of tests, for many of which the results look like this, showing the speed of GPU to GPU memcpy:
4. Network Validation
If the GPUs within a machine are able to communicate at full NVLink bandwidth, we’ll proceed to validating the network configuration to enable full speed distributed training. Most training clusters are built with Infiniband or RoCE networking fabrics to enable extremely fast communication between GPUs on different machines.
In order to test an Infiniband fabric, we use standard tools like ibping, ib_read_bw, ib_write_bw to test that latency and throughput are as expected.
For Machine Learning calculations, we are very interested in making sure that GPUDirect RDMA is working optimally, and for this we again use NCCL tests, similar to validating NVLink. This time we will include multiple nodes in the NCCL test, starting with 2 nodes all the way up to the entire cluster. Generally we are looking for the all_reduce_perf test to show bandwidth around 92% of the theoretical maximum of the fabric: so around 370 GB/s on a 400GB/s fabric.
NCCL tests help identify numerous issues: by validating the entire cluster, we can find individual nodes, leaf switches, or spine switches that may be having an issue. We will algorithmically test smaller groups, from individual nodes, to pairs of nodes, to groups of nodes, to the entire cluster in order to quickly determine if there are any faults with Infiniband or GPU Direct RDMA. The most common failure mode for NCCL tests is to run slower than expected, if something is not right on the Infiniband fabric for example. A good result on a whole cluster NCCL test is a good sign that this cluster will perform well running distributed training workloads.
The nccl-tests repo should be compiled from source from the NVIDIA repo: https://github.com/NVIDIA/nccl-tests. We generally run this with Slurm, so we can easily control which hosts to schedule the job on. Here is a simple example script for running NCCL tests via Slurm. Some of the NCCL environment variables will need to be adjusted to match the configuration on the machine:
It is also important to validate Ethernet networks, for which we use iperf3.
5. Storage Validation
Storage performance is usually very important for Machine Learning workloads as well, so another crucial test is to measure storage performance. There are many different storage configurations that can have different performance characteristics.
To measure storage performance we use fio which is a very flexible tool for I/O benchmarks. With fio, we build jobs that measure different scenarios like random reads, random writes, sustained reads, or sustained writes, at various block sizes. Here is an example of running a fio job that tests the read bandwidth:
6. Model Build
The last phase of our acceptance testing is to run a collection of reference tasks, tailored to the use case of our customers, to ensure that they can achieve expected end-to-end performance. This phase is crucial for validating the operational integrity and performance efficiency of the GPU clusters under real-world conditions.
One popular reference task is to build a model with off-the-shelf frameworks such as PyTorch’s Fully Sharded Data Parallel (FSPD). For customers who are interested in training models at 1-10B scale, we often train a Llama-3 8B architecture and scale up its pretraining to 16 nodes, using FSDP as our distributed training backend and use standard publicly available pre-training datasets for our train and validation splits.
During the training process, we monitor training throughput (tokens per second), model flops utilization (MFU), GPU utilization, and network communication latencies for standard collective communications like all reduce, in addition to a myriad of other profiling metrics available in PyTorch’s profiler.
Through this exercise we verify our cluster is able to achieve reasonable MFU performance and communication efficiency for models in the 1 to 10 billion parameter range.
7. Observability
After testing, it’s important to ensure that we are continuously monitoring for hardware failures. At Together AI, our brand is trusted by enterprises because we are able to continuously monitor 24x7 and react in case there’s a hardware failure. Acceptance testing (phases 1-6) is just the first step, but with large clusters, it’s inevitable that we need to deal with hardware failures.
To monitor our hardware, we use Telegraf, an open-source, lightweight server agent used to collect system metrics. Telegraf is highly customizable and extensible, allowing us to monitor a wide variety of system metrics to ensure maximum uptime and reliability of the hardware. We collect two types of observability metrics, cluster-level and host-level, gathered from the Telegraf metrics. The host-level metrics include the amount of CPUs/GPUs on a node, CPU/GPU usage %, available memory, available disk space, network connectivity, etc.
The cluster-level metrics and dashboards are used to quickly verify cluster-wide health and to help diagnose problems. In the example below, it is immediately noticeable there is an issue with the average number of GPUs in the cluster.
We can then use host-level metrics to pinpoint the server that has a bad GPU, indicating that a GPU “fell of the bus”.
We gather GPU temperature and power draw metrics as we’ve seen that a single GPU can get too hot and become a straggler, which will slow down an entire training run.
Another interesting metric is DNS lookup errors. We noticed that DNS lookup errors affected training runs, so we quickly added this monitor to all of our infrastructure. We now monitor the ability of our servers to perform DNS lookups to interesting domains like R2 and S3 (for dataset download) and wandb.ai (for training with Weights & Biases).
The DNS lookup error metric is a great example of Telegraf’s customizability. It is easy to use the DNS Query plugin, with this simple config, to gather these metrics:
Conclusion
Acceptance testing is an indispensable practice for AI/ML startups striving to deliver top-tier computational resources. By adopting a comprehensive and structured approach to testing, companies can navigate the complexities of the hardware lottery, ensuring that their infrastructure is stable and reliable, and that it can support the types of workloads they intend to run on the GPUs. We encourage our users to run acceptance testing on the GPU clusters we deliver to you, and to flag any issues you encounter that we can help troubleshoot.
- Lower
Cost20% - faster
training4x - network
compression117x
Q: Should I use the RedPajama-V2 Dataset out of the box?
RedPajama-V2 is conceptualized as a pool of data that serves as a foundation for creating high quality datasets. The dataset is thus not intended to be used out of the box and, depending on the application, data should be filtered out using the quality signals that accompany the data. With this dataset, we take the view that the optimal filtering of data is dependent on the intended use. Our goal is to provide all the signals and tooling that enables this.