What is NVIDIA A100?
The NVIDIA A100 is a data-center-grade graphical processing unit (GPU), part of larger NVIDIA solution that allows organizations to build large-scale machine learning infrastructure. It is a dual slot 10.5-inch PCI Express Gen4 card, based on the Ampere GA100 GPU. A100 is the world’s fastest deep learning GPU designed and optimized for deep learning workloads.
The A100 comes with either 40GB or 80GB of memory, and has two major editions—one based on NVIDIA’s high performance NVLink network infrastructure, and one based on traditional PCIe.
Its main features include:
- 3rd generation Tensor Core—new format TF32, 2.5x FP64 for HPC workloads, 20x INT8 for AI inference, and support for BF16 data format.
- HBM2e GPU memory—doubles memory capacity compared to the previous generation, with memory bandwidth of over 2TB per second.
- MIG Technology—each instance offers up to 7 isolated Multi Instance GPUs (MIG), each with 10 GB of RAM.
- Special support for sparse models—for sparse matrix calculations (tensors with many zeros), provides a 2x compared to the previous generation.
- 3rd Generation NVLink and NVSwitch—upgraded network interconnect enabling GPU-to-GPU bandwidth of 600 GB/s.
In this article, you will learn:
- NVIDIA A100 System Specifications
- NVIDIA A100 Key Features
- What is a Tensor Core?
- What are Multi Instance GPUs (MIG)?
- What is NVIDIA DGX A100?
- Nvidia A100 with Run.AI
There are two versions of the A100 GPU:
- NVIDIA A100 for NVLink—based on optimized NVIDIA networking infrastructure for highest performance—4/8 SXM on NVIDIA HGX™ A100.
- NVIDIA A100 for PCIe—based on traditional PCIe slots, letting you deploy the GPU on a larger variety of servers.
Both versions provide the following performance capabilities:
- Peak performance for FP64—9.7 TF, 19.5 TF for Tensor Cores
- Peak performance for FP32—19.5 TF
- Peak performance for FP16, BFLOAT16—312 TF for Tensor Cores*
- Peak performance for Tensor Float 32—156 TF*
- Peak performance for INT8—624 TOPS on Tensor Cores*
- Peak Performance for INT4—1,248 TOPS on Tensor Cores*
(*) A100 supports double performance for workloads with sparsity.
Memory and GPU specifications are different for each version:
- NVLink version—40 or 80 GB GPU memory, 1,555 or 2,039 GB/s memory bandwidth, up to 7 MIGs with 5 GB each (for A100 with 40 GB memory) or 10 GB each (for A100 with 80 GB memory), max power 400 W.
- PCIe version—40 GB GPU memory, 1,555 GB/s memory bandwidth, up to 7 MIGs with 5 GB each, max power 250 W.
Related content – read our detailed guides about:
NVIDIA’s third-generation high-speed NVLink interconnect improves GPU scalability, performance, and reliability. It is included in A100 GPUs and NVIDIA’s latest NVSwitch. Thanks to additional links per GPU and NVSwitch, NVLink now offers wider GPU-GPU communication bandwidth and enhanced error detection, with recovery capabilities.
The latest NVLink boasts a 50 Gbit/sec data rate per signal pair, which is double the bandwidth provided by V100. Comparatively, a V100 NVLink provides a bandwidth of only 25 GB/sec, with half the amount of signal pairs used by V100. The number of links has doubled—12 compared to V100’s 6—providing a total bandwidth of 600 GB/sec, compared to 300 in the V100.
NVIDIA’s A100 Tensor Core GPU is compatible with the company’s Magnum IO and Mellanox InfiniBand and Ethernet interconnect solutions for accelerating multi-node connectivity.
The API for Magnum IO brings together file systems and storage, computing, and networking, maximizing the performance of multi-node acceleration systems. It can accelerate I/O for a wide spectrum of workloads, including AI, analytics and graphics processing, by interfacing with CUDA-X.
A100 boasts an innovative asynchronous copy instruction. This is performed in the background, allowing shared memory (SM) to meanwhile perform other computations.
The A100 makes it possible to load data into SM straight from global memory, without requiring an intermediate register file (RF). Async-copy reduces power consumption, uses memory bandwidth more efficiently and reduces register bandwidth.
Barriers are important to manage competing and overlapping tasks in a multi-threaded environment. The A100 GPU provides special hardware acceleration for barriers, and C++-conformant barrier objects, accessible from CUDA 11. This makes it possible to:
- Split arrive and wait operations for asynchronous barriers
- Implement producer-consumer patterns directly in CUDA
- Synchronize CUDA threads at any level of granularity (beyond warp/block)
Thanks to A100’s new features, new paths between task graph grids can now be implemented much faster.
You can employ a more efficient model for GPU work submission using CUDA task graphs, which provide define-once, run-repeatedly execution flows. These include a series of dependency-connected operations, including kernel launches or memory copies.
Predefined task graphs enable launching numerous kernels as one operation. This improves application performance and efficiency.
Tensors are mathematical objects that describe the relationship between other mathematical objects. They are usually represented as a numeric array with multiple dimensions.
When processing graphics large amounts of data must be moved and processed in vector form. Because GPUs have strong parallel processing capabilities, they are well suited for tensor processing. Today all GPUs support General Matrix Multiplication (GEMM), which is the typical mathematical operation performed on tensors.
The NVIDIA Ampere architecture provides improved performance for GEMM, including:
- Improving performance from 64 to 256 GEMMs per cycle
- Support for new data formats
- Fast processing for sparse tensors (tensors with a large number of zero values)
Programmers have easy access to Tensor Cores on Volta, Turing or Ampere chips. In order to use Tensor Cores:
- Add a flag to the code to indicate tensor cores should be used (see some examples in the CUDA 9 documentation)
- Ensure you are using a supported data format
- Ensure the size of the matrix is a multiple of 8
Multi-Instance GPU (MIG) is a technique for splitting one A100 GPU into multiple GPU instances—currently up to 7 instances are supported. Each virtual GPU instance runs with separate memory, caching, and multiprocessors. According to NVIDIA, in many scenarios, MIG can improve GPU server utilization by up to 7X.
An A100 GPU running in MIG mode can run up to seven different AI or HPC workloads in parallel. This is especially useful for AI inference jobs that don’t require all the power offered by modern GPUs.
For example, you can create:
- 2 MIG instances with 20 GB of memory each
- 3 MIG instances with 10 GB each
- 7 MIG instances with 5 GB each
MIG isolates GPU instances from each other, ensuring that faults in one instance don’t affect the others. Each instance offers guaranteed quality of service (QoS) to ensure you get the expected latency and throughput from your workload.
At the training stage, MIG allows up to seven operators to access a dedicated GPU simultaneously, so multiple deep learning models can be tuned in parallel. As long as each job does not require the full computing capacity of the A100 GPU, this provides excellent performance for each operator.
At the inference stage, MIG lets you process up to seven inference jobs at a time on a single GPU. This is very useful for inference workloads based on small, low-latency models, which do not require full GPU capabilities.
Users can use MIG for AI and HPC without changing their existing CUDA programming model. MIG runs on existing Linux operating systems as well as Kubernetes.
NVIDIA A100 provides MIG as part of its software—this includes GPU drivers, the CUDA 11 framework, an updated NVIDIA container runtime, and new Kubernetes resource types available through device plugins.
NVIDIA Virtual Compute Server (vCS) can be used with MIG to manage hypervisors like Red Hat Virtualization and VMware vSphere. This enables functionality such as live migration and multi-tenancy.
The DGX A100 is an AI infrastructure server providing 5 petaFLOPS of computing power in one system. It enables operating AI workloads, from development to deployment, at large scale on a unified platform.
DGX A100 systems can be connected into large clusters with up to thousands of units. You can scale by adding more DGX units, and splitting each A100 GPU into seven independent GPUs using MIG. DGX A100 provides eight NVIDIA A100 GPUs, and because each can be split into seven, this provides up to 56 independent GPUs in a single DGX, each with its own direct high bandwidth connection, memory, cache and compute.
NVIDIA also provides the NGC Private Registry, with containers optimized for GPU infrastructure, that let you run deep learning, machine learning, and HPC applications. Each container comes with SDKs, trained models and running scripts, and Helm charts for rapid deployment on Kubernetes.
Learn more in our detailed guide to NVIDIA DGX
Run:AI automates resource management and workload orchestration for machine learning infrastructure. With Run:AI, you can automatically run as many compute intensive experiments as needed on NVIDIA A100 and other data center grade GPUs.
Here are some of the capabilities you gain when using Run:AI:
- Advanced visibility—create an efficient pipeline of resource sharing by pooling GPU compute resources.
- No more bottlenecks—you can set up guaranteed quotas of GPU resources, to avoid bottlenecks and optimize billing.
- A higher level of control—Run:AI enables you to dynamically change resource allocation, ensuring each job gets the resources it needs at any given time.
Run:AI simplifies machine learning infrastructure pipelines, helping data scientists accelerate their productivity and the quality of their models.
Learn more about the Run.ai GPU virtualization platform.