DGX is a line of servers and workstations built by NVIDIA, which can run large, demanding machine learning and deep learning workloads on GPUs. DGX provides a massive amount of computing power—between 1-5 PetaFLOPS in one DGX system. It also provides advanced technology for interlinking GPUs and enabling massive parallelization across thousands of GPU cores.
Beyond the powerful hardware they provide, DGX systems come out of the box with an optimized operating system and a complete pre-integrated environment for running deep learning projects. They provide a containerized software architecture that lets data scientists easily deploy the deep learning frameworks and management tools they need with minimal setup or configuration.
Related content: read our guide to NVIDIA A100 GPUs
In this article, you will learn:
NVIDIA DGX-1 is an integrated deep learning workstation with a large computing capacity, which can be used to run demanding deep learning workloads. It provides GPU computing power of 1 PetaFLOPS (1 quadrillion floating-point operations per second).
It provides the following hardware features:
NVIDIA DGX-1 is built on a hardware architecture with several key components especially designed for large-scale deep learning workloads;
DGX-1 comes with pre-installed, pre-configured deep learning software, enabling practitioners to get up and running quickly.
The software architecture includes:
Containers available for DGX-1 include:
The containerized architecture means that each deep learning framework runs separately, can have its own dependencies and different versions of core libraries. Users can easily deploy the frameworks or tools they need for development. NVIDIA is responsible for maintaining container images, and constantly updates them when new versions of frameworks are released.
Because all applications are managed as containers, the operating system stays clean, and NVIDIA can deliver operating system and driver updates without interference.
The DGX-2 has a similar architecture to the DGX-1, but offers more computing power. With 16 Tesla V100 GPUs, it delivers 2 PetaFLOPS. According to NVIDIA, in a traditional x86 architecture, training ResNet-50 at the same speed as DGX-2 would require 300 servers with dual Intel Xeon Gold CPUs, which would cost more than $2.7 million.
Hardware features include:
DGX A100 is NVIDIA's third generation dedicated AI system. It provides a massive 5 PetaFLOPS of computing power in one system.
There are two models of the A100, one with 320GB system RAM and the other with 640GB RAM. The following table summarizes the hardware features.
DGX A100 is a powerful system on its own, but a main focus of its design is to enable massive scalability. It can be used to build large AI infrastructure clusters, using the DGX SuperPOD deployment pattern (see below). DGX A100 supports elastic scalability, enabling users to:
Massively parallel GPU workloads depend on very high I/O performance. To this end, DGX A100 provides:
This makes it possible to use DGX A100 for the most demanding deep learning use cases, such as conversational AI, image and video classification. These workloads can be distributed across hundreds or thousands of nodes, as needed.
Like previous generations, DGX-100 uses a containerized architecture:
The NGC Private Registry provides:
The DGX Station is a lightweight version of the 3rd generation DGX A100 for developers and small teams. Its Tensor Core architecture enables AVolta V100 GPUs to use mixed-precision multiply-accumulate operations, to significantly accelerate training for large neural networks. There are two DGX Station models, one with 160GB GPU RAM, and the other with 320GB. The following table summarizes the hardware features.
NVIDIA DGX SuperPOD is a full-stack computing platform including compute, storage, networking, and tools to support data science pipelines. It comes with an implementation service from NVIDIA, to assist with deployment and ongoing maintenance.
SuperPOD can support up to 140 DGX A100 systems, combined into one AI infrastructure cluster. The cluster’s capabilities are as follows:
The NVIDIA SuperPod solution includes the following software tools that support the AI cluster, in addition to AI tooling provided on each DGX-100 system:
Run:AI automates resource management and orchestration for machine learning infrastructure, including on DGX servers and workstations. With Run:AI, you can automatically run as many compute intensive experiments as needed.
Our AI Orchestration Platform for GPU-based computers running AI/ML workloads provides:
Run:AI simplifies machine learning infrastructure pipelines, helping data scientists accelerate their productivity and the quality of their models.
Learn more about the Run:AI GPU virtualization platform.