NVIDIA HGX

4 Key Features, Architecture, and Use Cases

What Is NVIDIA HGX A100?

The NVIDIA HGX A100 is a computing platform featuring the new generation of A100 80GB GPUs. A single HGX A100 now offers up to 1.3 terabytes (TB) of GPU memory. It is designed to accelerate workloads like high-performance computing (HPC), AI, and data analytics. For example, it could be used for genomic sequencing in HPC, real-time speech recognition in AI, or fraud detection in data analytics.

HGX A100 integrates NVIDIA's most advanced GPU architecture, Ampere, with high-speed memory and high-bandwidth interconnect technology. The HGX A100 combines computing power with efficiency and versatility. Using NVIDIA’s innovative multi-instance GPU (MIG) technology, each GPU can be partitioned into as many as seven independent GPU instances to simultaneously handle varied workloads. This means organizations can increase utilization and get more out of their investment in the platform.

Note: In our article, we focus on NVIDIA HGX with the A100 GPU, which is primarily intended for AI and HPC workloads. There is another version of the HGX platform, based on the NVIDIA H100 GPU, which is more suitable for graphics workloads. Refer to the data sheet for HGX H100.

This is part of a series of articles about NVIDIA A100.

In this article:

NVIDIA HGX A100 Features

1. Scalable GPU Architecture

At the heart of the NVIDIA HGX A100 is its scalable GPU architecture. It's built on the NVIDIA Ampere architecture, which, according to NVIDIA, represents the biggest performance improvement in the company's history. It's designed to efficiently scale to handle the world's largest AI and high-performance computing workloads.

Ampere's architecture introduces several groundbreaking features. For instance, it has third-generation Tensor Cores, which provide significant acceleration for AI workloads and up to 10X higher performance than the previous architecture. It also boasts of the new streaming multiprocessor (SM) that delivers twice the FP32 throughput (the number of single-precision floating point operations it can perform per second).

2. High-Bandwidth Memory (HBM)

Another important feature of the HGX A100 is its use of high-bandwidth memory (HBM). HBM is a memory architecture that allows for much higher bandwidth than traditional DDR memory. The NVIDIA HGX A100 uses HBM2E memory, delivering a world-first 2 terabytes per second (TB/s) of memory bandwidth.. This massive bandwidth is critical for AI and high-performance computing workloads, which often require moving vast amounts of data quickly.

3. Integrated AI Software Stack

NVIDIA HGX A100 is not just a hardware product, it includes software as well. The platform comes with an integrated AI software stack, featuring CUDA, cuDNN, and TensorRT. These software components are optimized to deliver maximum performance for AI workloads.

CUDA is a parallel computing platform and API model that allows developers to use NVIDIA GPUs for general purpose processing. cuDNN is a GPU-accelerated library for deep neural networks, providing primitives for the design of deep neural network models. TensorRT is a high-performance deep learning inference optimizer and runtime.

4. Customization and Configuration Options

The NVIDIA HGX A100 offers a high degree of customization and configuration options. With its MIG technology, it can be partitioned into as many as seven GPU instances, each with its own high-bandwidth memory, cache, and compute cores.

In addition, NVIDIA offers different HGX A100 configurations to suit different computational demands. Organizations can choose from 4-GPU, 8-GPU, or 16-GPU systems, each offering varying degrees of computational power, memory, and bandwidth.

NVIDIA HGX A100 Architecture

GPU Architecture

The NVIDIA HGX A100, featuring the A100 80GB GPUs, is based on NVIDIA's latest Ampere architecture, which represents the current state of the art in GPU performance. It incorporates the third generation of Tensor Cores, which are specialized hardware for AI workloads, and the third iteration of NVLink and NVSwitch technologies, which enable high-speed GPU-to-GPU interconnect.

The NVIDIA HGX A100 is designed to scale from an individual GPU to large data centers with thousands of GPUs. This scalability is a critical feature as it allows organizations to start small and scale their infrastructure as their requirements grow. The architecture is also specifically designed to support AI, HPC, and analytics workloads.

Tensor Cores and Capabilities

Tensor Cores are a unique feature of NVIDIA GPUs that accelerate matrix operations, which are at the heart of AI and deep learning computations. The NVIDIA HGX A100 incorporates the third generation of Tensor Cores, which provide significant performance improvements over previous generations. These Tensor Cores are capable of delivering up to 312 teraflops of FP16 performance (half precision floating point operations), or 156 teraflops of FP32 performance (single precision floating point operations).

In addition to raw performance, the Tensor Cores in the NVIDIA HGX A100 offer advanced capabilities that help to accelerate AI workloads. These include mixed-precision computing, which allows for higher performance and accuracy in AI computations, and sparsity, which reduces the computational load of AI models by focusing on the relevant data in sparse datasets.

DGX GH200 Software

The NVIDIA DGX GH200 comes with a comprehensive suite of software tools designed to help you get the most out of your system. This includes the DGX software stack, a set of software tools for managing and monitoring your system, optimizing your workloads, and more.

The DGX software stack includes tools for managing and monitoring the system's resources, optimizing workload performance, and providing a secure and stable operating environment. It also includes built-in AI software libraries and frameworks, pre-integrated and optimized, making it easy to develop and train AI models.

What Is NVIDIA HGX-2?

It is worthwhile to briefly discuss the predecessor of HGH A100, known as NVIDIA HGX-2. The HGX-2 is a powerful platform designed for AI and HPC workloads, which combines 16 Tesla V100 32GB GPUs into a high-performance computing platform.

Like the NVIDIA HGX A100, the HGX-2 is based on NVIDIA's GPU architecture and incorporates Tensor Cores and NVLink and NVSwitch technologies. While the HGX-2 was a significant step forward in AI and HPC computing, the NVIDIA HGX A100 takes it a step further, offering higher performance, better scalability, and more tighter software integrations.

NVIDIA HGX vs. DGX

Alongside HGX A100 and HGX-2, NVIDIA also offers the DGX line of integrated systems, which are pre-built, fully integrated HPC solutions that come with NVIDIA’s powerful GPUs and high-speed interconnects. It’s important to understand how the HGX A100 compares to the DGX offering.

DGX systems, such as the DGX A100, are designed to be easy to deploy and use. They come with all the necessary hardware and software, including NVIDIA’s GPU-accelerated software stack and deep learning software development kit (SDK). This makes them ideal for organizations that want to start using HPC and AI quickly and easily, without having to worry about building and configuring their own systems.

On the other hand, the HGX A100 is more flexible and customizable. It is a hardware platform that can be used to build custom HPC and AI systems, tailored to the specific needs of an organization. It provides the raw computing power and high-speed interconnects, but it is up to the organization to choose the rest of the hardware and software. This makes the HGX A100 ideal for organizations that have specific, unique requirements for their HPC and AI workloads.

Learn more in our detailed guide to NVIDIA HGX vs. DGX

NVIDIA HGX Use Cases

Cloud Data Centers and Cloud Services

The powerful performance and scalability of the NVIDIA HGX A100 make it an ideal solution for cloud service providers who need to deliver high-performance computing capabilities to their customers.

With the NVIDIA HGX A100, cloud service providers can offer a wide range of services, from AI training and inference to data analytics and scientific simulations. The scalability of the NVIDIA HGX A100 also allows cloud service providers to efficiently manage their infrastructure, scaling up or down as customer demand changes.

AI Research and Development

The NVIDIA HGX A100 is also extensively used in AI research and development. Its powerful Tensor Cores and advanced capabilities make it an ideal platform for developing and training new AI models.

In addition to AI research, the NVIDIA HGX A100 is also used in the development of AI applications. Developers can use the platform to train their AI models and optimize them for deployment, reducing the time to market for new AI applications.

Big Data Analytics

The HGX platform's high-performance computing capabilities enable it to process large volumes of data quickly, making it an ideal solution for big data analytics.

Organizations can leverage the NVIDIA HGX A100 to gain insights from their data, helping them to make informed decisions and drive business growth. The platform's scalability also allows organizations to handle growing data volumes efficiently, making it a future-proof solution for big data analytics.

Scientific Simulations and Computational Tasks

The NVIDIA HGX A100 is also used in scientific simulations and computational tasks. Its high-performance computing capabilities enable it to handle complex simulations and computations, making it an ideal solution for scientific research and engineering applications.

Researchers and engineers can leverage the NVIDIA HGX A100 to accelerate their work, leading to faster discoveries and innovations. The platform's scalability makes it possible to handle larger and more complex simulations and computations, making it a versatile and powerful tool for scientific research and engineering.

NVIDIA HGX with Run:ai

Run:ai automates resource management and orchestration for machine learning infrastructure, including HGX servers and workstations. With Run:ai, you can automatically run as many compute intensive experiments as needed.

Here are some of the capabilities you gain when using Run:ai:

  • Advanced visibility—create an efficient pipeline of resource sharing by pooling GPU compute resources.
  • No more bottlenecks—you can set up guaranteed quotas of GPU resources, to avoid bottlenecks and optimize billing.
  • A higher level of control—Run:ai enables you to dynamically change resource allocation, ensuring each job gets the resources it needs at any given time.

Run:ai simplifies machine learning infrastructure pipelines, helping data scientists accelerate their productivity and the quality of their models.

Learn more about the Run:ai GPU virtualization platform