Basics and Best Practices

NVIDIA CUDA in Deep Learning

Deep learning implementations require significant computing power, like that offered by GPUs. Without GPU systems, many deep learning models would take significantly longer to train, making them more costly and slowing innovation.

For example, when training the models for Google Translate, Google implemented a system with 2k server-grade NVIDIA GPUs. They used this system to run hundreds of week-long TensorFlow operations. With a traditional CPU-based system, these operations would have taken months each.

Although TensorFlow is one of the most popular frameworks for deep learning, many other frameworks also rely on CUDA for GPU support. These include Torch, PyTorch, Keras, MXNet, and Caffe2.  

Most of these frameworks use the cuDNN library, which supports deep neural networks. This shared reliance gives the frameworks roughly the same performance for equivalent uses and means that updates to CUDA or cuDNN affect all frameworks equally.

Outside of cuDNN, there are three other main GPU-accelerated libraries for deep learning — TensorRT, NCCL, and DeepStream. TensorRT is a library created by NVIDIA for high performance learning optimization and runtimes. DeepStream is a library for video inference. NCCL is a library for multi-node and multi-GPU communications primitives.

In addition to its components for deep learning, the CUDA Toolkit includes various libraries and components. These provide support for debugging and optimization, compiling, documentation, runtimes, signal processing, and parallel algorithms. CUDA Toolkit libraries support all NVIDIA GPUs.

What is CUDA from NVIDIA?

CUDA is a programming model and a platform for parallel computing that was created by NVIDIA. CUDA programming was designed for computing with NVIDIA’s graphics processing units (GPUs). CUDA enables developers to reduce the time it takes to perform compute-intensive tasks, by allowing workloads to run on GPUs and be distributed across parallelized GPUs.  

When performing compute operations using GPUs both central processing units (CPUs) and GPUs are used. CPUs perform the sequential parts of the workload since they are optimized for single-thread performance. Meanwhile, the non-sequential, multi-thread tasks are performed on GPUs which then return their result to the CPU.

This is part of an extensive series of guides about AI Technology.

In this article, you will learn:

CUDA Programming

CUDA provides support for several popular languages, including C, C++, Fortran, Python, and MATLAB. The interface is based on C/C++, and the compiler applies abstractions to incorporate parallelism and simplify programming.

Using the CUDA programming model, you can access three main language extensions:

  • CUDA blocks—a group or collection of threads.
  • Shared memory—a block of shared memory distributed across threads.
  • Synchronization barriers—enables multiple threads to sync at a specific completion point.

One of the main benefits of the CUDA model is that it enables you to create scalar programs. In the example below, you can see the CUDA kernel adding two vectors (A and B) with a third vector (C) as output. This is executed on the GPU and adds the vectors as though they were scalar numbers. When run, it is performed in parallel with each vector element executed by a different thread in the CUDA block.

__global__ void vectorAdd( float *A, float *B, float *C, int numElements) {
 int i = blockDim.x * blockIdx.x + threadIdx.x;
 if (i < numElements) {
   C[i] = A[i] + B[i];

The CUDA model enables you to scale your programs transparently. You can separate applications and computations into independent functions or problems and perform them with CUDA blocks. Each block is assigned to a sub-problem or function and further breaks down the tasks to fit the available threads. Blocks are automatically scheduled on your GPU multiprocessors by the CUDA runtime.

The diagram below shows how this can work with a CUDA program defined in eight blocks. Through the runtime, the blocks are allocated to the available GPUs using streaming multiprocessors (SMs). Note, this diagram shows two separate GPU situations, one with four processors and one with eight. This is to highlight how blocks can be distributed in multiple situations without code changes.

CUDA Programming

Image Source: NVIDIA

CUDA Recommendations and Best Practices

Maximizing Parallel Execution

Maximizing parallel execution of your algorithm helps increase the overall speed of your execution. It also allows for more complex computations.

To maximize this, ensure that your algorithm exposes parallelism as much as possible and that these exposures are efficiently mapped to your hardware. You need to carefully choose your kernel launch execution configuration and explicitly expose concurrent execution on the device and between the device and host.

Optimizing Memory Use

Optimizing memory use helps you ensure access to maximum memory bandwidth. Start by minimizing any data transfers between your host and device. Then, maximize device shared memory to minimize kernel access to your global memory. Host/device transfers have lower bandwidth than internal transfers and global memory access is slower.

You should also consider organizing your memory accesses based on the most efficient access patterns. This can optimize use because your effective bandwidth varies according to access pattern by memory type.

Optimizing Instruction Use

Optimizing your instruction use can help you increase your instruction throughput. To achieve this, try to avoid using arithmetic instructions with low throughput, such as those that prioritize precision over speed.

For example, you should prefer single-precision over double and regular functions over intrinsics, provided it doesn’t affect your end result. You should also be mindful of your control flow instructions and make sure they are optimized for single instruction multiple thread (SIMT) devices.


Run:AI automates resource management and orchestration for machine learning infrastructure. With Run:AI, you can automatically run as many compute intensive experiments as needed, incorporating CUDA.

Here are some of the capabilities you gain when using Run:AI:

  • Advanced visibility—create an efficient pipeline of resource sharing by pooling GPU compute resources.
  • No more bottlenecks—you can set up guaranteed quotas of GPU resources, to avoid bottlenecks and optimize billing.
  • A higher level of control—Run:AI enables you to dynamically change resource allocation, ensuring each job gets the resources it needs at any given time.

Run:AI simplifies machine learning infrastructure pipelines, helping data scientists accelerate their productivity and the quality of their models.

Learn more about the Run:AI GPU virtualization platform.

Learn More About NVIDIA CUDA

CUDA Programming: An In-Depth Look

Compute unified device architecture (CUDA) programming enables you to leverage parallel computing technologies developed by NVIDIA. The CUDA platform and application programming interface (API) are particularly helpful for implementing general purpose computing on graphics processing units (GPU). The interface is based on C/C++, but allows you to use other programming languages and frameworks. 

Learn what CUDA programming is and how to leverage this programming model for implementing general purpose computing on graphics processing units (GPGPU). 

Read more: CUDA Programming: An In-Depth Look

CUDA vs OpenCL: Which One to Use in Your Project?

CUDA serves as a platform for parallel computing, as well as a programming model. Open Computing Language (OpenCL) serves as an independent, open standard for cross-platform parallel programming. 

Learn the difference between CUDA and OpenCL in terms of hardware, OS support, community, and programming model, and understand which is right for your project.

Read more: CUDA vs OpenCL: Which One to Use in Your Project?

NVIDIA cuDNN: Fine-Tuning GPU Performance for Neural Networks

NVIDIA CUDA Deep Neural Network (cuDNN) is a GPU-accelerated primitive library for deep neural networks, providing highly-tuned standard routine implementations, including normalization, pooling, back-and-forth convolution, and activation layers.

Learn how NVIDIA CUDA Deep Learning Network (cuDNN) works, its key features, and get a quick tutorial to installing it on your local machine.

Read more: NVIDIA cuDNN: Fine-Tuning GPU Performance for Neural Networks

See Our Additional Guides on Key AI Technology Topics

Together with our content partners, we have authored in-depth guides on several other topics that can also be useful as you explore the world of AI Technology.

Deep Learning for Computer Vision

Deep Learning GPU

Machine Learning Engineer


Multi GPU


HPC Clusters