Deep Learning with GPUs

Making the Most of GPUs for Deep Learning

Making the Most of GPUs for Your Deep Learning Project

Graphics processing units (GPUs), originally developed for accelerating graphics processing, can dramatically speed up computational processes for deep learning. They are an essential part of a modern artificial intelligence infrastructure, and new GPUs have been developed and optimized specifically for deep learning.

Read on to understand the benefits of GPUs for deep learning projects, the difference between consumer-grade GPUs, data center GPUs and GPU servers, and several ways you can evaluate your GPU performance.

This is part of an extensive series of guides about AI Technology.

In this article, you will learn:

The Principles of GPU Computing

Graphics processing units (GPUs) are specialized processing cores that you can use to speed computational processes. These cores were initially designed to process images and visual data. However, GPUs are now being adopted to enhance other computational processes, such as deep learning. This is because GPUs can be effectively used in parallel for massive distributed computational processes.

Modes of parallelism

The primary benefit of GPUs is parallelism or simultaneous processing of parts of a whole. There are four architectures used for parallel processing implementations, including:

  • Single instruction, single data (SISD)
  • Single instruction, multiple data (SIMD)
  • Multiple instructions, single data (MISD)
  • Multiple instructions, multiple data (MIMD)

Most CPUs are multi-core processors, operating with an MIMD architecture. In contrast, GPUs use a SIMD architecture. This difference makes GPUs well-suited to deep learning processes which require the same process to be performed for numerous data items.

General purpose GPU programming

Related to GPUs’ original purpose, these processors previously required users to understand specialized languages, like OpenGL. These languages were used only for GPUs, making them impractical to learn and creating a barrier to use.

In 2007, with the launch of the NVIDIA CUDA framework, this barrier was broken, providing wider access to GPU resources. CUDA is based on C and provides an API that developers can use to apply GPU processing to machine learning tasks.

How modern deep learning frameworks use GPUs

Once NVIDIA introduced CUDA, several deep learning frameworks were developed, such as Pytorch and TensorFlow. These frameworks abstract the complexities of programming directly with CUDA and have made GPU processing accessible to modern deep learning implementations.

Learn to use GPUs in popular deep learning frameworks, in our guides about PyTorch GPU and TensorFlow GPU.

Why Use GPUs for Deep Learning?

GPUs can perform multiple, simultaneous computations. This enables the distribution of training processes and can significantly speed machine learning operations. With GPUs, you can accumulate many cores that use fewer resources without sacrificing efficiency or power.

When designing your deep learning architecture, your decision to include GPUs relies on several factors:

  • Memory bandwidth—including GPUs can provide the bandwidth needed to accommodate large datasets. This is because GPUs include dedicated video RAM (VRAM), enabling you to retain CPU memory for other tasks.
  • Dataset size—GPUs in parallel can scale more easily than CPUs, enabling you to process massive datasets faster. The larger your datasets are, the greater benefit you can gain from GPUs.
  • Optimization—a downside of GPUs is that optimization of long-running individual tasks is sometimes more difficult than with CPUs.

GPU Technology Options for Deep Learning

When incorporating GPUs into your deep learning implementations, there are a variety of options, although NVIDIA dominates the market. Within these options, you can choose from consumer-grade GPUs, data center GPUs, and managed workstations.

Consumer-Grade GPUs

Consumer GPUs are not appropriate for large-scale deep learning projects, but can offer an entry point for implementations. These GPUs enable you to supplement existing systems cheaply and can be useful for model building or low-level testing.

  • NVIDIA Titan V—depending on the edition, this GPU provides between 12GB and 32GB of memory and between 110 and 125 teraflops of performance. It includes Tensor Cores and uses NVIDIA’s Volta technology.
  • NVIDIA Titan RTX—provides 24GB memory and 130 teraflops of performance. It includes Tensor and RT Core technologies and is based on NVIDIA’s Turing GPU architecture.
  • NVIDIA GeForce RTX 2080 Ti—provides 11Gb memory and 120 teraflops of performance. It is designed for gaming enthusiasts rather than professional use and is also based on NVIDIA’s Turing GPU architecture.

Data Center GPUs

Data center GPUs are the standard for production deep learning implementations. These GPUs are designed for large-scale projects and can provide enterprise-grade performance.

  • NVIDIA A100—provides 40GB memory and 624 teraflops of performance. It is designed for HPC, data analytics, and machine learning and includes multi-instance GPU (MIG) technology for massive scaling.
  • NVIDIA v100—provides up to 32Gb memory and 149 teraflops of performance. It is based on NVIDIA Volta technology and was designed for high performance computing (HPC), machine learning, and deep learning.
  • NVIDIA Tesla P100—provides 16GB memory and 21 teraflops performance. It is designed for HPC and machine learning and is based on the Pascal architecture.
  • NVIDIA Tesla K80—provides up to 24GB memory and 8.73 teraflops of performance. It is designed for data analytics and scientific computing and is based on the Kepler architecture.
  • Google tensor processing units (TPUs)—while Google TPUs are not GPUs, they provide an alternative to NVIDIA GPUs which are commonly used for deep learning workloads. TPUs are cloud-based or chip-based application-specific integrated circuits (ASIC) designed for deep learning workloads. TPUs were developed specifically for the Google Cloud Platform and for use with TensorFlow. Each provides 128GB memory and 420 teraflops of performance.

DGX Servers

NVIDIA DGX servers are enterprise-grade, full-stack solutions. These systems are designed specifically for machine learning and deep learning operations. Systems are plug-n-play, and you can deploy on bare metal servers or in containers.

  • DGX-1—provides two Intel Xeon CPUs and up to eight V100 Tensor Cores, each with 32GB memory. It is based on the Ubuntu Linux Host OS. DGX-1 includes the CUDA toolkit, NVIDIA’s Deep Learning SDK, the Docker Engine Utility, and the DIGITS deep learning training application.
  • DGX-2—provides two Xeon Platinum CPUs and 16 V100 Tensor Core GPUs, each with 32GB memory. It provides significant scalability and parallelism and is based on the NVSwitch networking fabric for 195x faster training than the DGX-1.
  • DGX A100—provides two 64-core AMD CPUs and eight A100 GPUs, each with 320GB memory for five petaflops of performance. It is designed for machine learning training, inference, and analytics and is fully-optimized for CUDA-X. You can combine multiple DGX A100 units to create a super cluster.

Learn more in our guide to NVIDIA deep learning GPU, which explains how to choose the right GPU for your deep learning projects.

Top Metrics for Evaluating Your Deep Learning GPU Performance

GPUs are expensive resources that you need to optimize for a sustainable ROI. However, many deep learning projects utilize only 10-30% of their GPU resources, often due to inefficient allocation. To ensure that you are using your GPU investments efficiently, you should monitor and apply the following metrics.

GPU utilization

GPU utilization metrics measure the percentage of time your GPU kernels are running (i.e. your GPU utilization). You can use these metrics to determine your GPU capacity requirements and identify bottlenecks in your pipelines. You can access this metric with NVIDIA’s system management interface (NVIDIA-smi).

If you find that you are underusing resources, you may be able to distribute processes more effectively. In contrast, maximum utilization means you may benefit from adding GPUs to your operations.

GPU memory access and usage

GPU memory access and usage metrics measure the percentage of time that a GPU’s memory controller is in use. This includes both read and write operations. You can use these metrics to optimize the batch size for your training and gauge the efficiency of your deep learning program. You can access a comprehensive list of memory metrics through the NVIDIA-smi.

Power usage and temperatures

Power usage and temperature metrics enable you to measure how hard your system is working and can help you predict and control power consumption. These metrics are typically measured at the power supply unit and include resources used by compute and memory units, and cooling elements. These metrics are important because excessive temperatures can cause thermal throttling, which slows compute processes, or damage hardware.

Time to solution

Time to solution is a holistic metric that lets you define a desired accuracy level, and see how long it takes you to train your model to reach that level of accuracy. That training time will be different for different GPUs, depending on the model, distribution strategy and dataset you are running. Once you choose a GPU setup, you can use a time to solution measurement to tune batch sizes or leverage mixed-precision optimization, to improve performance.

Efficient Deep Learning GPU Management With Run:AI

Run:AI automates resource management and workload orchestration for machine learning infrastructure. With Run:AI, you can automatically run as many compute intensive experiments as needed.

Here are some of the capabilities you gain when using Run:AI:

  • Advanced visibility—create an efficient pipeline of resource sharing by pooling GPU compute resources.
  • No more bottlenecks—you can set up guaranteed quotas of GPU resources, to avoid bottlenecks and optimize billing.
  • A higher level of control—Run:AI enables you to dynamically change resource allocation, ensuring each job gets the resources it needs at any given time.

Run:AI accelerates deep learning on GPU by, helping data scientists optimize expensive compute resources and improve the quality of their models.

Learn more about the Run:AI GPU virtualization platform.