What is NVIDIA DGX?
DGX is a line of servers and workstations built by NVIDIA, which can run large, demanding machine learning and deep learning workloads on GPUs. DGX provides a massive amount of computing power—between 1-5 PetaFLOPS in one DGX system. It also provides advanced technology for interlinking GPUs and enabling massive parallelization across thousands of GPU cores.
Beyond the powerful hardware they provide, DGX systems come out of the box with an optimized operating system and a complete pre-integrated environment for running deep learning projects. They provide a containerized software architecture that lets data scientists easily deploy the deep learning frameworks and management tools they need with minimal setup or configuration.
Related content: read our guide to NVIDIA A100 GPUs
In this article, you will learn:
NVIDIA DGX-1—First Generation DGX Server
NVIDIA DGX-1 is an integrated deep learning workstation with a large computing capacity, which can be used to run demanding deep learning workloads. It provides GPU computing power of 1 PetaFLOPS (1 quadrillion floating-point operations per second).
It provides the following hardware features:
NVIDIA DGX-1 is built on a hardware architecture with several key components especially designed for large-scale deep learning workloads;
- Physical enclosure—the DX-1 system takes up three rack units (3U), which include its eight GPUs, two CPUs, power, cooling, networking, and an SSD file system cache.
- Hybrid cube-mesh NVLink network topology—NVLink is a high-bandwidth interconnect, making it possible for GPUs to communicate with each other with bandwidth of up to 300 GB/s (9x the speed of 3rd generation PCIe interconnects).
- Tesla V100 Page Migration Engine—this feature of the Tesla V100 GPU allows fast data sharing within the multi-GPU system, and between GPUs and system RAM.
- Multi-node capabilities—for clusters of DGX machines, DGX-1 provides fast InfiniBand networking and uses GPUDirect RDMA, which provides direct communication between NVIDIA GPUs and remote hosts.
DGX-1 comes with pre-installed, pre-configured deep learning software, enabling practitioners to get up and running quickly.
The software architecture includes:
- Optimized, minimal operating systems and drivers
- Applications and SDKs needed for deep learning run in Docker
- NGC Private Registry holds certified versions of all Docker images
- Available containers include popular deep learning frameworks and the NVIDIA DIGITS application for training deep learning
Containers available for DGX-1 include:
- Deep learning frameworks optimized to run on DGX
- NVIDIA CUDA Toolkit
- NVIDIA DIGITS which simplifies deep learning training with multi-GPU
- Multiple third-party tools
- Frameworks and tools are custom-tuned for high multi-GPU performance
The containerized architecture means that each deep learning framework runs separately, can have its own dependencies and different versions of core libraries. Users can easily deploy the frameworks or tools they need for development. NVIDIA is responsible for maintaining container images, and constantly updates them when new versions of frameworks are released.
Because all applications are managed as containers, the operating system stays clean, and NVIDIA can deliver operating system and driver updates without interference.
NVIDIA DGX-2—Second Generation DGX Server
The DGX-2 has a similar architecture to the DGX-1, but offers more computing power. With 16 Tesla V100 GPUs, it delivers 2 PetaFLOPS. According to NVIDIA, in a traditional x86 architecture, training ResNet-50 at the same speed as DGX-2 would require 300 servers with dual Intel Xeon Gold CPUs, which would cost more than $2.7 million.
Hardware features include:
NVIDIA DGX A100—Third Generation DGX Server
DGX A100 is NVIDIA's third generation dedicated AI system. It provides a massive 5 PetaFLOPS of computing power in one system.
There are two models of the A100, one with 320GB system RAM and the other with 640GB RAM. The following table summarizes the hardware features.
DGX A100 is a powerful system on its own, but a main focus of its design is to enable massive scalability. It can be used to build large AI infrastructure clusters, using the DGX SuperPOD deployment pattern (see below). DGX A100 supports elastic scalability, enabling users to:
- Scale up by connecting up to thousands of DGX A100 systems together
- Scale down by splitting each GPU into seven separate GPU instances, each with its own cache, memory, and compute resources, using NVIDIA multi-Instance GPU (MIG) technology
Massively parallel GPU workloads depend on very high I/O performance. To this end, DGX A100 provides:
- Next generation NVLink—10x faster than 4th generation PCIe
- NVSwitch—8 Mellanox ConnectX-6 HDR InfiniBand adapters, each of them running at 200 GB/s
- Magnum IO software SDK—makes it possible to distribute workloads across thousands of GPUs
This makes it possible to use DGX A100 for the most demanding deep learning use cases, such as conversational AI, image and video classification. These workloads can be distributed across hundreds or thousands of nodes, as needed.
Like previous generations, DGX-100 uses a containerized architecture:
- A minimal operating system and driver install
- All applications and SDKs provisioned as containers, accessible through NGC Privacy Registry
The NGC Private Registry provides:
- Containers enabling deep learning, machine learning, high performance computing (HPC), and deep learning orchestration with Kubernetes
- Applications with pretrained models, scripts, Kubernetes Helm charts
- Ability to store custom containers, model code, scripts and Helm charts and share them within the organization
NVIDIA DGX Station A100—Third Generation DGX Workstation
The DGX Station is a lightweight version of the 3rd generation DGX A100 for developers and small teams. Its Tensor Core architecture enables AVolta V100 GPUs to use mixed-precision multiply-accumulate operations, to significantly accelerate training for large neural networks. There are two DGX Station models, one with 160GB GPU RAM, and the other with 320GB. The following table summarizes the hardware features.
NVIDIA DGX SuperPOD—Multi-Node DGX System
NVIDIA DGX SuperPOD is a full-stack computing platform including compute, storage, networking, and tools to support data science pipelines. It comes with an implementation service from NVIDIA, to assist with deployment and ongoing maintenance.
SuperPOD can support up to 140 DGX A100 systems, combined into one AI infrastructure cluster. The cluster’s capabilities are as follows:
The NVIDIA SuperPod solution includes the following software tools that support the AI cluster, in addition to AI tooling provided on each DGX-100 system:
- DGX operating system based on Ubuntu Linux, optimized and tuned specifically for DGX hardware. The DGX OS includes certified GPU drivers, network software, NFS caching, NVIDIA data center GPU management (DCGM), GPU-enabled container runtime, CUDA SDK, and support for NVIDIA GPUDirect.
- Cluster management and orchestration tools—DGX POD management software for workload scheduling, and third-party certified cluster management and orchestration tools such as Run:AI tested to work on DGX POD racks.
- Kubernetes container orchestration—DGX POD management software runs on Kubernetes for fault tolerance and high availability. It provides automated network configuration (DHCP) and automated provisioning and updates of DGX OS and software over the network (PXE).
NVIDIA DGX with Run:AI
Run:AI automates resource management and orchestration for machine learning infrastructure, including on DGX servers and workstations. With Run:AI, you can automatically run as many compute intensive experiments as needed.
Our AI Orchestration Platform for GPU-based computers running AI/ML workloads provides:
- Advanced queueing and fair scheduling to allow users to easily and automatically share clusters of GPUs,
- Distributed training on multiple GPU nodes to accelerate model training times,
- Fractional GPUs to seamlessly run multiple workloads on a single GPU of any type,
- Visibility into workloads and resource utilization to improve user productivity.
Run:AI simplifies machine learning infrastructure pipelines, helping data scientists accelerate their productivity and the quality of their models.
Learn more about the Run:ai GPU virtualization platform.