Nvidia DGX

What is NVIDIA DGX?

DGX is a line of servers and workstations built by NVIDIA, which can run large, demanding machine learning and deep learning workloads on GPUs. DGX provides a massive amount of computing power—between 1-5 PetaFLOPS in one DGX system. It also provides advanced technology for interlinking GPUs and enabling massive parallelization across thousands of GPU cores.

Beyond the powerful hardware they provide, DGX systems come out of the box with an optimized operating system and a complete pre-integrated environment for running deep learning projects. They provide a containerized software architecture that lets data scientists easily deploy the deep learning frameworks and management tools they need with minimal setup or configuration. 

Related content: read our guide to NVIDIA deep learning GPUs

In this article, you will learn:

NVIDIA DGX-1—First Generation DGX Server

NVIDIA DGX-1 is an integrated deep learning workstation with a large computing capacity, which can be used to run demanding deep learning workloads. It provides GPU computing power of 1 PetaFLOPS (1 quadrillion floating-point operations per second).

It provides the following hardware features:

GPUs 8x NVIDIA Tesla V100
GPU Memory Total of 256 GB
CPU Dual 20 Core Intel Xeon E5-2698 v4 2.2GHz
CUDA Cores 40,960
Tensor Cores 5,120
Power 3500W
System RAM 512 GB 2133 MHz DDR4 RDIMM
Networking Dual 10 GB Ethernet
Operating System Canonical Ubuntu or Red Hat Enterprise Linux (RHEL)

Related content: read our guide to the best GPU for deep learning

Hardware Architecture

NVIDIA DGX-1 is built on a hardware architecture with several key components especially designed for large-scale deep learning workloads;

  • Physical enclosure—the DX-1 system takes up three rack units (3U), which include its eight GPUs, two CPUs, power, cooling, networking, and an SSD file system cache. 
  • Hybrid cube-mesh NVLink network topology—NVLink is a high-bandwidth interconnect, making it possible for GPUs to communicate with each other with bandwidth of up to 300 GB/s (9x the speed of 3rd generation PCIe interconnects).  
  • Tesla V100 Page Migration Engine—this feature of the Tesla V100 GPU allows fast data sharing within the multi-GPU system, and between GPUs and system RAM.
  • Multi-node capabilities—for clusters of DGX machines, DGX-1 provides fast InfiniBand networking and uses GPUDirect RDMA, which provides direct communication between NVIDIA GPUs and remote hosts. 

Software Architecture

DGX-1 comes with pre-installed, pre-configured deep learning software, enabling practitioners to get up and running quickly. 

The software architecture includes:

  • Optimized, minimal operating systems and drivers
  • Applications and SDKs needed for deep learning run in Docker 
  • NGC Private Registry holds certified versions of all Docker images 
  • Available containers include popular deep learning frameworks and the NVIDIA DIGITS application for training deep learning

Containers available for DGX-1 include:

  • Deep learning frameworks optimized to run on DGX
  • NVIDIA CUDA Toolkit
  • NVIDIA DIGITS which simplifies deep learning training with multi-GPU
  • Multiple third-party tools
  • Frameworks and tools are custom-tuned for high multi-GPU performance

The containerized architecture means that each deep learning framework runs separately, can have its own dependencies and different versions of core libraries. Users can easily deploy the frameworks or tools they need for development. NVIDIA is responsible for maintaining container images, and constantly updates them when new versions of frameworks are released.

Because all applications are managed as containers, the operating system stays clean, and NVIDIA can deliver operating system and driver updates without interference.

NVIDIA DGX-2—Second Generation DGX Server

The DGX-2 has a similar architecture to the DGX-1, but offers more computing power. With 16 Tesla V100 GPUs, it delivers 2 PetaFLOPS. According to NVIDIA, in a traditional x86 architecture, training ResNet-50 at the same speed as DGX-2 would require 300 servers with dual Intel Xeon Gold CPUs, which would cost more than $2.7 million. 

Hardware features include:

GPUs 16x NVIDIA Tesla V100
GPU Memory Total of 512 GB
CPU Dual Intel Xeon Platinum 8168, 2.7GHz, 24 cores
CUDA Cores 81,920
Tensor Cores 10,240
Power 10,000W
System RAM 1.5 GB
Networking 8x 100 GB Ethernet
Operating System Canonical Ubuntu or Red Hat Enterprise Linux (RHEL)

NVIDIA DGX A100—Third Generation DGX Server

DGX A100 is NVIDIA’s third generation dedicated AI system. It provides a massive 5 PetaFLOPS of computing power in one system.

There are two models of the A100, one with 320GB system RAM and the other with 640GB RAM. The following table summarizes the hardware features. 

GPU 8x NVIDIA A100 GPUs
GPU Memory Depending on model:

  • 320GB
  • 640GB
Power 6,500MW
CPU 2x AMD Rome 7742, 128 cores, 2.25 GHz (boosts up to 3.4 GHz)
System Memory Depending on model:

  • 1TB
  • 2TB
Operating System Storage 2x 1.92TB M.2 NVME drive
Internal Storage U.2 NVMe drives. Storage capacity depending on model:

  • 15TB
  • 30TB
Operating System Canonical Ubuntu, Red Hat Enterprise Linux, or CentOS

Hardware Architecture

DGX A100 is a powerful system on its own, but a main focus of its design is to enable massive scalability. It can be used to build large AI infrastructure clusters, using the DGX SuperPOD deployment pattern (see below). 

DGX A100 supports elastic scalability, enabling users to:

  • Scale up by connecting up to thousands of DGX A100 systems together
  • Scale down by splitting each GPU into seven separate GPU instances, each with its own cache, memory, and compute resources, using NVIDIA multi-Instance GPU (MIG) technology

Massively parallel GPU workloads depend on very high I/O performance. To this end, DGX A100 provides:

  • Next generation NVLink—10x faster than 4th generation PCIe
  • NVSwitch—8 Mellanox ConnectX-6 HDR InfiniBand adapters, each of them running at 200 GB/s
  • Magnum IO software SDK—makes it possible to distribute workloads across thousands of GPUs

This makes it possible to use DGX A100 for the most demanding deep learning use cases, such as conversational AI, image and video classification. These workloads can be distributed across hundreds or thousands of nodes, as needed.

Software Architecture

Like previous generations, DGX-100 uses a containerized architecture:

  • A minimal operating system and driver install
  • All applications and SDKs provisioned as containers, accessible through NGC Privacy Registry

The NGC Private Registry provides:

  • Containers enabling deep learning, machine learning, high performance computing (HPC), and deep learning orchestration with Kubernetes
  • Applications with pretrained models, scripts, Kubernetes Helm charts
  • Ability to store custom containers, model code, scripts and Helm charts and share them within the organization

NVIDIA DGX Station A100—Third Generation DGX Workstation

The DGX Station is a lightweight version of the 3rd generation DGX A100 for developers and small teams. Its Tensor Core architecture enables AVolta V100 GPUs to use mixed-precision multiply-accumulate operations, to significantly accelerate training for large neural networks.

There are two DGX Station models, one with 160GB GPU RAM, and the other with 320GB. The following table summarizes the hardware features. 

GPU 4x NVIDIA A100 GPUs
GPU Memory Depending on model:

  • 320GB
  • 640GB
Power 1,500MW, 100-120 Vac
CPU 1x AMD Rome 7742, 64 cores, 2.25 GHz (boosts up to 3.4 GHz)
System Memory 512MB
Operating System Storage 1x 1.92TB M.2 NVME drive
Internal Storage U.2 NVMe drive
Operating System Ubuntu Linux

NVIDIA DGX SuperPOD—Multi-Node DGX System

NVIDIA DGX SuperPOD is a full-stack computing platform including compute, storage, networking, and tools to support data science pipelines. It comes with an implementation service from NVIDIA, to assist with deployment and ongoing maintenance.

SuperPOD can support up to 140 DGX A100 systems, combined into one AI infrastructure cluster. The cluster’s capabilities are as follows:

Computing Power 100-700 PetaFLOPS
Number of DGX A100 nodes 20-140
Number of GPUs 160-1120
Storage 1-10 Petabytes
NVIDIA Mellanox Bandwidth 200 Gbps

Software Architecture

The NVIDIA SuperPod solution includes the following software tools that support the AI cluster, in addition to AI tooling provided on each DGX-100 system:

  • DGX operating system based on Ubuntu Linux, optimized and tuned specifically for DGX hardware. The DGX OS includes certified GPU drivers, network software, NFS caching, NVIDIA data center GPU management (DCGM), GPU-enabled container runtime, CUDA SDK, and support for NVIDIA GPUDirect. 
  • Cluster management and orchestration tools—DGX POD management software for workload scheduling, and third-party certified cluster management and orchestration tools such as Run:AI tested to work on DGX POD racks.
  • Kubeneretes container orchestration—DGX POD management software runs on  Kubernetes for fault tolerance and high availability. It provides automated network configuration (DHCP) and automated provisioning and updates of DGX OS and software over the network (PXE).

NVIDIA DGX with Run.AI

Run:AI automates resource management and orchestration for machine learning infrastructure, including on DGX servers and workstations. With Run:AI, you can automatically run as many compute intensive experiments as needed. 

Our AI Orchestration Platform for GPU-based computers running AI/ML workloads provides:

  • Advanced queueing and fair scheduling to allow users to easily and automatically share clusters of GPUs,
  • Distributed training on multiple GPU nodes to accelerate model training times,
  • Fractional GPUs to seamlessly run multiple workloads on a single GPU of any type,
  • Visibility into workloads and resource utilization to improve user productivity.

Run:AI simplifies machine learning infrastructure pipelines, helping data scientists accelerate their productivity and the quality of their models. 

Learn more about the Run.ai GPU virtualization platform.