Brief IA

AWS and NVIDIA: Revolutionizing AI Model Training

🔬 Research·Tom Levy·

AWS and NVIDIA: Revolutionizing AI Model Training

AWS and NVIDIA: Revolutionizing AI Model Training
Key Takeaways
1AWS integrates advanced NVIDIA GPUs to optimize large-scale AI model training.
2The new EC2 instances, such as P5 and P6, offer enhanced computing and communication capabilities.
3The AWS infrastructure relies on open-source software to improve resource management and observability.
💡Why it mattersThis synergy between AWS and NVIDIA accelerates AI development, making models more efficient and accessible.
Le brief IA que lisent les pros

Le brief IA que les pros lisent chaque soir

Les 7 actus IA du jour, décryptées en 5 min. Gratuit.

Inclus dès l'inscription : notre sélection des meilleurs guides & comparatifs IA.

Choisis ton rythme

Gratuit · Pas de spam · Désabonnement en 1 clic

📄
Full Analysis

AWS and NVIDIA: An Alliance for the AI of Tomorrow

In the field of artificial intelligence, the "scaling" of foundational models has long been synonymous with an increase in computing power for pre-training. This approach, supported by research such as that of Kaplan et al. (2020), has allowed for the prediction of performance improvements based on the increase in model parameters and computing power. However, recent advancements show that scaling is no longer limited to this simple relationship.

For a long time, "scaling" in foundational models primarily meant one thing: spending more computing power on pre-training and seeing capabilities increase. This intuition was supported by empirical work such as that of Kaplan et al. (2020), which reported predictable power law trends in loss as model parameters, dataset size, and training computing power increased. In practice, these trends justified sustained investment in large-scale acceleration capabilities and the distributed infrastructure needed to use them effectively.

However, the boundary has evolved, and scaling is no longer a single curve. NVIDIA's framework of the "three laws of scaling" usefully highlights that, beyond pre-training, performance increasingly improves through post-training (for example, supervised fine-tuning (SFT) and methods based on reinforcement learning (RL)) and through computing power at the time of testing ("long thinking," search/verification, multi-sample strategies).

These scaling regimes push the lifecycle of foundational models—pre-training, post-training, and inference—toward converging infrastructure requirements: tightly coupled accelerator computing, high-bandwidth and low-latency networking, and distributed storage. They also raise the importance of orchestration for resource management and observability at the application and hardware levels to maintain cluster health and diagnose large-scale performance pathologies.

Another key trend is the increasing dependence of the foundational model lifecycle on an ecosystem of open-source software (OSS) that encompasses model development frameworks, cluster resource management, and operational tools. At the cluster level, resource management is typically handled by systems such as Slurm and Kubernetes. Model development and distributed training are often implemented in frameworks such as PyTorch and JAX. Monitoring and visualization—i.e., observability—are often carried out using Prometheus for metric collection and Grafana for visualization and alerting, positioned as an operational layer above infrastructure and resource management.

This post is aimed at engineers and researchers in machine learning involved in training and inference of foundational models, with a particular focus on workflows built on OSS frameworks. It analyzes how AWS infrastructure—including multi-node accelerator computing, high-bandwidth and low-latency networking, distributed shared storage, and associated managed services—interacts with common OSS stacks throughout the lifecycle of foundational models. The primary goal is to provide a technical foundation for understanding system bottlenecks and scaling characteristics covering pre-training, post-training, and inference. This introductory post highlights the overall system architecture, emphasizing the integration points between AWS infrastructure components and OSS tools that support large-scale distributed training and inference.

AWS Building Blocks

The remainder of this series examines how this layered architecture is realized on AWS, progressing through infrastructure, resource orchestration, the ML software stack, and observability. The following sections provide an overview of each layer.

Infrastructure: Computing, Networking, and Storage

As illustrated, the infrastructure relies on three coupled building blocks: accelerated computing with large device memory, high-bandwidth interconnect for collective communication, and scalable distributed storage for data and checkpoints.

Accelerated computing forms the foundation for pre-training, post-training, and inference of large-scale foundational models. AWS offers several generations of NVIDIA GPUs as part of its Amazon EC2 accelerated computing instances, including the Amazon EC2 P instance family. The P5 instance family includes p5.48xlarge with eight NVIDIA H100 GPUs, p5.4xlarge with a single H100 GPU for smaller-scale workloads, and p5e.48xlarge/p5en.48xlarge variants with NVIDIA H200 GPUs. The P6 family introduces the NVIDIA Blackwell B200 architecture with p6-b200.48xlarge and Blackwell Ultra B300 with p6-b300.48xlarge.

Across these generations, the dominant scaling axes are maximum Tensor throughput, HBM capacity and bandwidth, and interconnect bandwidth (within and between nodes).

At first approximation, the maximum throughput of Tensor Cores—measured in floating-point operations per second (FLOPS)—helps situate these accelerators on a common axis. The table below summarizes the maximum throughput per GPU for dense BF16/FP16 and FP8 Tensor operations, as well as HBM capacity and HBM bandwidth, using SXM/HGX class specifications that align with NVSwitch/NVLink-based multi-GPU nodes.

  • GPU (representative variant)
  • BF16/FP16 Tensor peak (dense)
  • FP8 Tensor peak (dense)
  • FP4 Tensor peak (dense)
  • B200 (HGX, per GPU)
  • B300 (HGX, per GPU)

Note: NVIDIA product tables often report "sparsity" Tensor throughput; this table reports dense throughput. Where applicable, dense throughput is taken as half of sparse throughput, following NVIDIA's recommendations for HGX-class platforms. DGX figures are at the system level; HBM capacity and bandwidth values for B200 are expressed per GPU by dividing DGX totals by eight.

As models scale, the step time is often dominated by collective communication and memory movement rather than raw compute throughput, motivating an explicit counting of bandwidth for scaling.

For multi-GPU instances, GPU communication spans two regimes. Internal scaling (NVLink/NVSwitch) provides high-bandwidth, low-latency GPU-to-GPU connectivity within a node, allowing collectives such as all-reduce and all-gather to execute without traversing the host networking stack. External scaling (EFA) provides a bypassing network between nodes, which AWS uses as a foundational element for Amazon EC2 UltraClusters where high-communication collectives span thousands of instances.

The following table summarizes key specifications for these types of instances:

  • NVLink BW (aggregated)
  • EFA BW (aggregated)
  • p6-b200.48xlarge
  • p6-b300.48xlarge

Note: EFA bandwidth is converted from Gbps to GB/s (÷8) for consistency with other bandwidth metrics; see the networking specifications for EC2 accelerated computing instances. NVLink and EFA bandwidth figures are presented as aggregated values per instance rather than per link.

The Elastic Fabric Adapter (EFA) is a network interface for Amazon EC2 that provides remote direct memory access (RDMA) capability bypassing the operating system using the Scalable Reliable Datagram (SRD) protocol. By allowing applications to communicate directly with the network device via the Libfabric API—bypassing the operating system kernel—EFA reduces latency and improves throughput for collective operations in distributed training.

Several generations of EFA are available across different instance families. Amazon EC2 P5 and P5e instances are equipped with EFA version 2 (EFAv2). EFA version 3 (EFAv3), provided on P5en instances, reduces packet latency by approximately 35% compared to EFAv2. EFA version 4 (EFAv4), available on P6 instances, offers an additional 18% improvement in collective communication performance compared to EFAv3.

At scale, both distributed training (streaming of corpora and writing multi-terabyte checkpoints) and large-scale inference (staging weights and managing KV cache growth) motivate a multi-tiered storage hierarchy—local NVMe SSD for hot data, Lustre for high-throughput shared access, and Amazon S3 for durable persistence.

In the main multi-GPU instances of this series, local NVMe is provided as instance storage (ephemeral) with a raw capacity of 30.72 TB (8 × 3.84 TB NVMe SSD).

Lustre is a distributed open-source file system, POSIX-compliant, widely used in high-performance computing (HPC) to provide a shared namespace with high aggregate throughput across many clients. Amazon FSx for Lustre provides Lustre as a fully managed service and exposes it as a parallel file system capable of multiple terabytes per second of throughput, millions of IOPS, and sub-millisecond latencies. Data Repository Associations enable integration with Amazon S3, supporting lazy loading of training datasets and automatic checkpoint export for durability.

At the cluster scale, these instances are deployed in Amazon EC2 UltraClusters, which provision thousands of accelerated instances as a single tightly placed cluster within an availability zone and interconnect them using a non-blocking petabit-scale network.

For workloads with high communication intensity per step (for example, expert parallelism in MoE models where the distribution of all tokens spans many GPUs), the size of the NVLink domain can become a first-order constraint. As an extension of the internal scaling axis, increasing the NVLink domain reduces the frequency at which performance-critical communication must leave the NVLink fabric.

Amazon EC2 UltraServers extend the NVLink domain beyond a single EC2 instance by connecting multiple component instances via a dedicated accelerator interconnect. AWS reports that the UltraServers P6e-GB200 are built on the NVIDIA GB200 NVL72 platform and expose up to 72 Blackwell GPUs and 13.4 TB of aggregated HBM3e within an NVLink domain.

Brief IA — L'actualité IA en français

L'essentiel de l'actualité de l'intelligence artificielle, décrypté et expliqué chaque jour.