HPC on AWS

6 Cloud Services and 8 Critical Best Practices

How Does AWS Enable High Performance Computing (HPC)?

High performance computing differs from everyday computing in speed and processing power. An HPC system contains the same elements as a regular desktop computer, but enabling massive processing power. Today, most HPC workloads use massively distributed clusters of small machines rather than monolithic supercomputers.

Each machine in an HPC cluster, known as a node, typically contains multiple processors, each with two to four cores. A small HPC cluster might have four nodes with 16 cores, although most organizations are likely to use clusters containing 16-64 nodes and 64-256 cores.

Amazon Web Services (AWS) provides a variety of services that support HPC scenarios. These include Amazon EC2, which is commonly used to run massively parallel workloads, and include specialized services like AWS ParallelCluster and FSx for Lustre which were purpose built for an HPC environment.

In this article:

6 AWS High Performance Computing Services

Below are the most common AWS services used to support HPC deployments.

Amazon EC2

Amazon Elastic Compute Cloud (EC2) provides over 400 instance types, which provide a variety of options to choose from for HPC compute instances. It supports all major CPU vendors (Intel, AMD, and Arm), Windows, Linux and Mac, graphical processing units (GPUs), and 400 Gbps networking for fast interconnection within HPC clusters.

Pricing: offers several pricing models including on-demand pricing, spot instances, and long-term reserved instances

Use cases for HPC: large-scale HPC applications, macOS workloads, cloud-native applications with HPC capabilities, and machine learning projects.

Elastic Fabric Adapter

Elastic Fabric Adapter (EFA) is a fast network interface that powers inter-node communication in AWS. It has an optimized operating system that enables fast communication using protocols like message passing interface (MPI) and NVIDIA collective communications library (NCCL), scaling up to thousands of CPUs and GPUs. This lets you set up the equivalent of large-scale on-premises HPC clusters in the cloud.

Pricing: EFA is available on all EC2 instances at no extra cost.

Use cases for HPC: fluid dynamics computations that require very large scale with a large number of tunable parameters, large-scale weather modeling, and large-scale deep learning models built using frameworks like TensorFlow, PyTorch, and Caffe2.

AWS ParallelCluster

AWS ParallelCluster is a service especially designed for management of HPC clusters in the cloud. It provides a simple text-based configuration to model and provision HPC resources in a fully automated manner. You can build HPC clusters using multiple EC2 instance types, and manage job submission and scheduling via AWS Batch or the open source Slurm scheduler.

Pricing: Offered at no extra cost—you pay only for AWS resources that run your clusters.

Use cases for HPC: production HPC workloads that need specific compute, storage, and networking resources, and rapid prototyping of HPC clusters without requiring custom scripting or complex configurations.

Amazon FSx for Lustre

Amazon FSx for Lustre is a fully managed file system for huge-scale compute workloads. It is based on Lustre, a popular, open source, high-performance file system that supports sub-millisecond latency and throughputs up to hundreds of GB/s. The service integrates with Amazon S3, meaning your workloads can pull data from S3 to a high-performance Lustre-based file system.

Pricing: based on storage volume, performance tiers, and number of operations performed on the data.

Use cases for HPC: HPC workloads like oil and gas discovery and genome analysis which need to process very large datasets, and machine learning models with massive training data.

AWS Batch

AWS Batch lets you run a large number of batch computing jobs on AWS resources. It automatically provisions the optimal compute and memory resources according to the specific requirements of each batch job. AWS Batch lets you specify up to hundreds of thousands of batch jobs and execute them dynamically across Amazon services like EC2, Fargate, and spot instances.

Pricing: Offered at no charge—you pay only for the Amazon resources used to run your batch jobs.

Use cases for HPC: financial services HPC jobs in fields like pricing and risk management; life sciences projects such as genomics, clinical modeling, and molecular dynamics.

NICE DCV

NICE DCV is a remote desktop service for teams running workloads on AWS. Instead of running expensive workstations on-premises, the service enables users to run graphics-intensive applications directly in EC2, and stream the user interface to a thin client machine. The protocol secures transmissions with AES-256 encryption and supports clients with up to 4 monitors operating at 4K resolution.

Pricing: NICE DCV is offered at no charge for any resources running on EC2.

Use cases for HPC: 3D graphics visualization in fields like life sciences, design and engineering; enabling browser-based access to HPC applications; and building remote applications for control and visualization of HPC workloads.

8 Best Practices for Running HPC on AWS

Here are some best practices to help you make the most of AWS for HPC:

Use a Placement Group

Cluster placement groups are logical groupings of instances in the same Availability Zone (AZ). You should use a cluster placement group for applications that can benefit from high network throughput or low network latency, especially when most network traffic is limited to the instances within the group.

Avoid Hyper-Threading (HT)

Intel HT Technology allows multiple threads to run simultaneously on the same Intel Xeon CPU core, with each thread acting as a vCPU on an EC2 instance. For HPC jobs, hyper-threading can be slow, especially if the job requires floating-point calculations. Each core has two threads that share the same floating pointing unit and can thus block each other. Make sure HT is disabled on your EC2 instances.

Use a Fast Inter-Node Connection

EC2 P3dn and C5n instances can provide a network throughput of up to 100Gbps. They also provide a higher ceiling on packet-per-second for simulations, applications that rely on fast communication, data lakes, and memory caches.

Use Amazon EC2 P3 or G3 Instances with GPU for Graphics Tasks

P3 instances provide cloud-based HPC with a network throughput of up to 100 Gbps and eight NVIDIA V100 Tensor Core GPUs for HPC and machine learning (ML) applications. G3 instances allow you to use NVIDIA Tesla M60 GPUs containing 8 GiB of memory and up to 2,048 parallel processing cores.

Related content: Read our guide to HPC GPU

Use the Elastic Fabric Adapter (EFA)

This network interface for EC2 instances lets customers run HPC applications that require intensive communication between instances.

Manage Clusters with AWS Parallel Cluster

This open-source tool is fully maintained and supported, making it easy for researchers and IT admins to manage HPC clusters in AWS.

Process Data with a Parallel File System

You can use Amazon FSx for Lustre for ML and HPC workloads. This fully-managed file system lets you launch a Lustre file system to process massive volumes of data at a throughput of hundreds of GB/s and millions of IOPS, ensuring sub-millisecond latency.

Choose the Right Instance Type

Use the largest possible compute instance—c5n.18xlarge compute-optimized instance or m5.24xlarge general purpose instance. Use compute-optimized instances for CPU-bound applications requiring high-performance processors.

HPC on AWS with Run:AI

Run:AI automates resource management and orchestration for HPC clusters utilizing GPU hardware—in AWS, other public clouds, and on-premises. With Run:AI, you can automatically run as many compute intensive workloads as needed.

Here are some of the capabilities you gain when using Run:AI:

  • Advanced visibility—create an efficient pipeline of resource sharing by pooling GPU compute resources.
  • No more bottlenecks—you can set up guaranteed quotas of GPU resources, to avoid bottlenecks and optimize billing.
  • A higher level of control—Run:AI enables you to dynamically change resource allocation, ensuring each job gets the resources it needs at any given time.

Run:AI simplifies HPC infrastructure, helping teams accelerate their productivity and conserve costs by running more jobs on fewer resources.

Learn more about the Run:AI GPU virtualization platform.