Question 1

How Does AWS Enable High Performance Computing (HPC)?

Accepted Answer

High performance computing differs from everyday computing in speed and processing power. An HPC system contains the same elements as a regular desktop computer, but enabling massive processing power. Today, most HPC workloads use massively distributed clusters of small machines rather than monolithic supercomputers.

Each machine in an HPC cluster, known as a node, typically contains multiple processors, each with two to four cores. A small HPC cluster might have four nodes with 16 cores, although most organizations are likely to use clusters containing 16-64 nodes and 64-256 cores.

Amazon Web Services (AWS) provides a variety of services that support HPC scenarios. These include Amazon EC2, which is commonly used to run massively parallel workloads, and include specialized services like AWS ParallelCluster and FSx for Lustre which were purpose built for an HPC environment.

Question 2

6 AWS High Performance Computing Services

Accepted Answer

Below are the most common AWS services used to support HPC deployments.

Amazon EC2
Amazon Elastic Compute Cloud (EC2) provides over 400 instance types, which provide a variety of options to choose from for HPC compute instances. It supports all major CPU vendors (Intel, AMD, and Arm), Windows, Linux and Mac, graphical processing units (GPUs), and 400 Gbps networking for fast interconnection within HPC clusters.

Pricing: offers several pricing models including on-demand pricing, spot instances, and long-term reserved instances

Use cases for HPC: large-scale HPC applications, macOS workloads, cloud-native applications with HPC capabilities, and machine learning projects.

Elastic Fabric Adapter
Elastic Fabric Adapter (EFA) is a fast network interface that powers inter-node communication in AWS. It has an optimized operating system that enables fast communication using protocols like message passing interface (MPI) and NVIDIA collective communications library (NCCL), scaling up to thousands of CPUs and GPUs. This lets you set up the equivalent of large-scale on-premises HPC clusters in the cloud.

Pricing: EFA is available on all EC2 instances at no extra cost.

Use cases for HPC: fluid dynamics computations that require very large scale with a large number of tunable parameters, large-scale weather modeling, and large-scale deep learning models built using frameworks like TensorFlow, PyTorch, and Caffe2.

AWS ParallelCluster
AWS ParallelCluster is a service especially designed for management of HPC clusters in the cloud. It provides a simple text-based configuration to model and provision HPC resources in a fully automated manner. You can build HPC clusters using multiple EC2 instance types, and manage job submission and scheduling via AWS Batch or the open source Slurm scheduler.

Pricing: Offered at no extra cost—you pay only for AWS resources that run your clusters.

Use cases for HPC: production HPC workloads that need specific compute, storage, and networking resources, and rapid prototyping of HPC clusters without requiring custom scripting or complex configurations.

Amazon FSx for Lustre
Amazon FSx for Lustre is a fully managed file system for huge-scale compute workloads. It is based on Lustre, a popular, open source, high-performance file system that supports sub-millisecond latency and throughputs up to hundreds of GB/s. The service integrates with Amazon S3, meaning your workloads can pull data from S3 to a high-performance Lustre-based file system.

Pricing: based on storage volume, performance tiers, and number of operations performed on the data.

Use cases for HPC: HPC workloads like oil and gas discovery and genome analysis which need to process very large datasets, and machine learning models with massive training data.

AWS Batch
AWS Batch lets you run a large number of batch computing jobs on AWS resources. It automatically provisions the optimal compute and memory resources according to the specific requirements of each batch job. AWS Batch lets you specify up to hundreds of thousands of batch jobs and execute them dynamically across Amazon services like EC2, Fargate, and spot instances.

Pricing: Offered at no charge—you pay only for the Amazon resources used to run your batch jobs.

Use cases for HPC: financial services HPC jobs in fields like pricing and risk management; life sciences projects such as genomics, clinical modeling, and molecular dynamics.

NICE DCV
NICE DCV is a remote desktop service for teams running workloads on AWS. Instead of running expensive workstations on-premises, the service enables users to run graphics-intensive applications directly in EC2, and stream the user interface to a thin client machine. The protocol secures transmissions with AES-256 encryption and supports clients with up to 4 monitors operating at 4K resolution.

Pricing: NICE DCV is offered at no charge for any resources running on EC2.

Use cases for HPC: 3D graphics visualization in fields like life sciences, design and engineering; enabling browser-based access to HPC applications; and building remote applications for control and visualization of HPC workloads.

HPC on AWS

6 Cloud Services and 8 Critical Best Practices

How Does AWS Enable High Performance Computing (HPC)?

6 AWS High Performance Computing Services

Amazon EC2

Elastic Fabric Adapter

AWS ParallelCluster

Amazon FSx for Lustre

AWS Batch

NICE DCV

8 Best Practices for Running HPC on AWS

Use a Placement Group

Avoid Hyper-Threading (HT)

Use a Fast Inter-Node Connection

Use Amazon EC2 P3 or G3 Instances with GPU for Graphics Tasks

Use the Elastic Fabric Adapter (EFA)

Manage Clusters with AWS Parallel Cluster

Process Data with a Parallel File System

Choose the Right Instance Type

HPC on AWS with Run:AI