NVIDIA DGX GH200

Massive Memory for Large AI Models

What Is NVIDIA DGX GH200?

The NVIDIA DGX GH200 is a high-performance, AI-optimized supercomputer. The system connects 256 Grace Hopper Superchips into one GPU, making it possible to provide 144 TB of shared memory, almost 500X more than previous solutions. This massive memory capacity allows the model to support very large models such as recommender systems, generative AI, and graph analytics.

The DGX GH200 is part of NVIDIA’s DGX series, a range of AI supercomputers designed to handle the most demanding AI and high-performance computing (HPC) tasks.

This is part of a series of articles about NVIDIA A100.

In this article:

DGX GH200 Features  

1. Giant Memory for Giant Models

The NVIDIA DGX GH200 is designed to cater to the growing demands of AI models that require very large memory. The system leverages the Grace CPU and Hopper GPU, integrated into the Grace Hopper Superchip. The Grace CPU provides up to 480 GB of LPDDR5X memory, while the Hopper GPU offers up to 96 GB of HBM3 memory. With 256 Grace Hopper Superchips in the GH200 system, together they provide a shared memory space of 144 TB, which can dramatically accelerate training and inference for very large models.

2. Power-Efficient Computing

The DGX GH200 is designed for power efficiency. Each Grace Hopper Superchip is both a CPU and GPU in one unit. The Grace CPU uses LPDDR5X memory, which uses only 1/8 the power of traditional DDR5 system memory, with higher memory bandwidth. In addition, because the CPU and GPU are packaged together, the interconnect consumes 1/5 the power while providing as much as 7X more bandwidth compared to traditional PCIe technology.

3. Integrated and Ready to Run

The NVIDIA DGX GH200 comes pre-installed with all the software and drivers needed to get started, making it easy to set up and start running your workloads.

The DGX GH200 also comes with NVIDIA’s DGX software stack, a comprehensive suite of software tools designed to help you get the most out of your system. This includes tools for managing and monitoring your system, optimizing your workloads, and more.

DGX GH200 Architecture

Grace Hopper Superchips

The NVIDIA Grace Hopper Superchip is known as a ‘data center accelerated CPU’. It is a combination of the Grace CPU, which comes equipped with up to 480 GB LPDDR5X memory, and the Hopper GPU, which has a memory capacity of up to 96 GB (HBM3).

The Superchip features an NVLINK C2C interconnect with a data transfer speed of 900 GB/s and a High-Speed I/O NVLINK Network that supports up to 256 GPUs. The architecture also offers 4x 16x PCIe-5 slots with speeds up to 512 GB/s, and 18x NVLINK 4 with transfer rates up to 900 GB/s.

Compute InfiniBand Fabric

The NVIDIA DGX GH200 features an advanced computing fabric known as the InfiniBand fabric. This high-speed interconnect technology provides a fast and efficient way to transfer data between the system's components, helping to maximize performance and efficiency.

InfiniBand fabric offers many advantages over traditional interconnect technologies. It provides high bandwidth and low latency, making it ideal for data-intensive tasks like AI and HPC workloads. It also supports advanced features like remote direct memory access (RDMA), which allows for direct memory transfers between systems without the need for CPU intervention.

Storage Fabric

In addition to its advanced computing fabric, the DGX GH200 also features a high-speed storage fabric. This storage fabric is designed to provide fast, efficient access to the system's storage resources, helping to maximize performance and data throughput.

The storage fabric is built on top of NVMe technology, which provides fast and efficient access to solid-state storage devices. This allows for high-speed data transfers, reducing latency and improving overall system performance.

DGX GH200 Software

The NVIDIA DGX GH200 comes with a comprehensive suite of software tools designed to help you get the most out of your system. This includes the DGX software stack, a set of software tools for managing and monitoring your system, optimizing your workloads, and more.

The DGX software stack includes tools for managing and monitoring the system's resources, optimizing workload performance, and providing a secure and stable operating environment. It also includes built-in AI software libraries and frameworks, pre-integrated and optimized, making it easy to develop and train AI models.

Fabric Management

The DGX GH200 also features an advanced fabric management system. This system is designed to manage and optimize the performance of the system's InfiniBand fabric, helping to ensure high performance and efficiency.

The fabric management system provides a range of capabilities, including fabric monitoring, performance optimization, and fault management. It also supports advanced features like adaptive routing, which allows for dynamic rerouting of data to avoid congestion and maximize performance.

Learn more in our detailed guide to NVIDIA DGX station

Key Use Cases for NVIDIA DGX GH200

Deep Learning Recommendation Models (DLRM)

One of the primary use cases for the NVIDIA DGX GH200 is running deep learning recommendation models (DLRM). These models are used extensively in industries such as retail, entertainment, and social media, making them an integral part of our digital lives.

The NVIDIA DGX GH200, with its advanced GPU architecture and high-speed interconnects, provides an ideal platform for running DLRM workloads. It offers a high level of parallelism that can handle large-scale recommendation systems, allowing businesses to deliver personalized content and experiences to their users in real time.

Moreover, the DGX GH200 also features a dedicated AI accelerator, which significantly speeds up the training of these deep learning models. This not only reduces time to market but can also improve accuracy of predictions.

Advanced HPC Models

HPC models are used in various scientific, engineering, and business applications that require heavy computational power. These include climate modeling, aerospace simulations, genomic sequencing, financial modeling, and more.

The DGX GH200, with its array of powerful GPUs, provides the necessary computational horsepower to run these advanced HPC models. It uses the NVIDIA A100 Tensor Core GPUs, which are designed to deliver high performance for both AI and HPC workloads. These GPUs are capable of performing billions of calculations per second, enabling the DGX GH200 to tackle the most demanding HPC tasks.

Furthermore, the DGX GH200 also includes advanced networking capabilities that allow for fast data transfers between the GPUs, reducing the latency and improving the overall performance of the HPC models. This makes it an ideal choice for running large-scale simulations and data-intensive workloads.

Generative AI

Generative AI is an exciting field of artificial intelligence that involves creating new content from scratch. This includes generating images, music, text, and even videos.

With its advanced GPU architecture, the DGX GH200 provides the necessary computational power to train generative models effectively. It also features dedicated AI accelerators that can speed up the training process, allowing for faster iterations and improvements.

Moreover, the DGX GH200 also supports popular AI frameworks, which makes it easier to develop and deploy these generative models. This, combined with its high-speed networking and storage capabilities, makes it an excellent platform for running generative AI workloads.

Speech AI

Speech AI involves tasks such as speech recognition, synthesis, and understanding, which are critical for building voice assistants and other conversational AI applications.

The DGX GH200, with its powerful GPUs and AI accelerators, provides the necessary computational power to run these speech AI workloads. It supports various AI frameworks, making it easier to develop and deploy these applications.

Moreover, the DGX GH200 also includes advanced networking capabilities that allow for fast data transfers, which is crucial for delivering real-time responses in conversational AI applications. This makes it an ideal choice for running speech AI workloads.

Natural Language Processing (NLP) and Large Language Models (LLMs)

The NVIDIA DGX GH200 also excels in running natural language processing (NLP) and large language models (LLMs). These are critical for understanding and generating human-like text, which is essential for various applications such as chatbots, translation services, and content generation.

The DGX GH200, with its powerful GPUs and AI accelerators, provides the necessary computational power to train these NLP and LLM workloads.

Furthermore, the DGX GH200 also includes high-speed networking and storage capabilities, which are crucial for handling the large datasets typically involved in NLP and LLM tasks.

Learn more in our detailed guide to NVIDIA deep learning GPU

NVIDIA DGX with Run:ai

Run:ai automates resource management and orchestration for machine learning infrastructure, including DGX servers and workstations. With Run:ai, you can automatically run as many compute intensive experiments as needed.

Here are some of the capabilities you gain when using Run:ai:

  • Advanced visibility—create an efficient pipeline of resource sharing by pooling GPU compute resources.
  • No more bottlenecks—you can set up guaranteed quotas of GPU resources, to avoid bottlenecks and optimize billing.
  • A higher level of control—Run:ai enables you to dynamically change resource allocation, ensuring each job gets the resources it needs at any given time.

Run:ai simplifies machine learning infrastructure pipelines, helping data scientists accelerate their productivity and the quality of their models.

Learn more about the Run:ai GPU virtualization platform