The NVIDIA A100 is a data-center-grade graphical processing unit (GPU), part of larger NVIDIA solution that allows organizations to build large-scale machine learning infrastructure. It is a dual slot 10.5-inch PCI Express Gen4 card, based on the Ampere GA100 GPU. A100 is the world’s fastest deep learning GPU designed and optimized for deep learning workloads.
The A100 comes with either 40GB or 80GB of memory, and has two major editions—one based on NVIDIA’s high performance NVLink network infrastructure, and one based on traditional PCIe.
Its main features include:
This is part of an extensive series of guides about AI Technology.
In this article, you will learn:
There are two versions of the A100 GPU:
Both versions provide the following performance capabilities:
(*) A100 supports double performance for workloads with sparsity.
Memory and GPU specifications are different for each version:
Related content - read our detailed guides about:
NVIDIA’s third-generation high-speed NVLink interconnect improves GPU scalability, performance, and reliability. It is included in A100 GPUs and NVIDIA’s latest NVSwitch. Thanks to additional links per GPU and NVSwitch, NVLink now offers wider GPU-GPU communication bandwidth and enhanced error detection, with recovery capabilities.
The latest NVLink boasts a 50 Gbit/sec data rate per signal pair, which is double the bandwidth provided by V100. Comparatively, a V100 NVLink provides a bandwidth of only 25 GB/sec, with half the amount of signal pairs used by V100. The number of links has doubled—12 compared to V100’s 6—providing a total bandwidth of 600 GB/sec, compared to 300 in the V100.
NVIDIA’s A100 Tensor Core GPU is compatible with the company’s Magnum IO and Mellanox InfiniBand and Ethernet interconnect solutions for accelerating multi-node connectivity.
The API for Magnum IO brings together file systems and storage, computing, and networking, maximizing the performance of multi-node acceleration systems. It can accelerate I/O for a wide spectrum of workloads, including AI, analytics and graphics processing, by interfacing with CUDA-X.
A100 boasts an innovative asynchronous copy instruction. This is performed in the background, allowing shared memory (SM) to meanwhile perform other computations.
The A100 makes it possible to load data into SM straight from global memory, without requiring an intermediate register file (RF). Async-copy reduces power consumption, uses memory bandwidth more efficiently and reduces register bandwidth.
Barriers are important to manage competing and overlapping tasks in a multi-threaded environment. The A100 GPU provides special hardware acceleration for barriers, and C++-conformant barrier objects, accessible from CUDA 11. This makes it possible to:
Thanks to A100’s new features, new paths between task graph grids can now be implemented much faster.
You can employ a more efficient model for GPU work submission using CUDA task graphs, which provide define-once, run-repeatedly execution flows. These include a series of dependency-connected operations, including kernel launches or memory copies.
Predefined task graphs enable launching numerous kernels as one operation. This improves application performance and efficiency.
Tensors are mathematical objects that describe the relationship between other mathematical objects. They are usually represented as a numeric array with multiple dimensions.
When processing graphics large amounts of data must be moved and processed in vector form. Because GPUs have strong parallel processing capabilities, they are well suited for tensor processing. Today all GPUs support General Matrix Multiplication (GEMM), which is the typical mathematical operation performed on tensors.
The NVIDIA Ampere architecture provides improved performance for GEMM, including:
Programmers have easy access to Tensor Cores on Volta, Turing or Ampere chips. In order to use Tensor Cores:
Multi-Instance GPU (MIG) is a technique for splitting one A100 GPU into multiple GPU instances—currently up to 7 instances are supported. Each virtual GPU instance runs with separate memory, caching, and multiprocessors. According to NVIDIA, in many scenarios, MIG can improve GPU server utilization by up to 7X.
An A100 GPU running in MIG mode can run up to seven different AI or HPC workloads in parallel. This is especially useful for AI inference jobs that don't require all the power offered by modern GPUs.
For example, you can create:
MIG isolates GPU instances from each other, ensuring that faults in one instance don't affect the others. Each instance offers guaranteed quality of service (QoS) to ensure you get the expected latency and throughput from your workload.
At the training stage, MIG allows up to seven operators to access a dedicated GPU simultaneously, so multiple deep learning models can be tuned in parallel. As long as each job does not require the full computing capacity of the A100 GPU, this provides excellent performance for each operator.
At the inference stage, MIG lets you process up to seven inference jobs at a time on a single GPU. This is very useful for inference workloads based on small, low-latency models, which do not require full GPU capabilities.
Users can use MIG for AI and HPC without changing their existing CUDA programming model. MIG runs on existing Linux operating systems as well as Kubernetes.
NVIDIA A100 provides MIG as part of its software—this includes GPU drivers, the CUDA 11 framework, an updated NVIDIA container runtime, and new Kubernetes resource types available through device plugins.
NVIDIA Virtual Compute Server (vCS) can be used with MIG to manage hypervisors like Red Hat Virtualization and VMware vSphere. This enables functionality such as live migration and multi-tenancy.
The DGX A100 is an AI infrastructure server providing 5 petaFLOPS of computing power in one system. It enables operating AI workloads, from development to deployment, at large scale on a unified platform.
DGX A100 systems can be connected into large clusters with up to thousands of units. You can scale by adding more DGX units, and splitting each A100 GPU into seven independent GPUs using MIG. DGX A100 provides eight NVIDIA A100 GPUs, and because each can be split into seven, this provides up to 56 independent GPUs in a single DGX, each with its own direct high bandwidth connection, memory, cache and compute.
NVIDIA also provides the NGC Private Registry, with containers optimized for GPU infrastructure, that let you run deep learning, machine learning, and HPC applications. Each container comes with SDKs, trained models and running scripts, and Helm charts for rapid deployment on Kubernetes.
Learn more in our detailed guide to NVIDIA DGX
Run:ai automates resource management and workload orchestration for machine learning infrastructure. With Run:ai, you can automatically run as many compute intensive experiments as needed on NVIDIA A100 and other data center grade GPUs.
Here are some of the capabilities you gain when using Run:ai:
Run:ai simplifies machine learning infrastructure pipelines, helping data scientists accelerate their productivity and the quality of their models.
Learn more about the Run:ai GPU virtualization platform.
Together with our content partners, we have authored in-depth guides on several other topics that can also be useful as you explore the world of AI Technology.