NVIDIA CUDA: Basics and Best Practices

What is CUDA from NVIDIA?

CUDA is a programming model and a platform for parallel computing that was created by NVIDIA. CUDA programming was designed for computing with NVIDIA’s graphics processing units (GPUs). CUDA enables developers to reduce the time it takes to perform compute-intensive tasks, by allowing workloads to run on GPUs and be distributed across parallelized GPUs.  

When performing compute operations using GPUs both central processing units (CPUs) and GPUs are used. CPUs perform the sequential parts of the workload since they are optimized for single-thread performance. Meanwhile, the non-sequential, multi-thread tasks are performed on GPUs which then return their result to the CPU. 

In this article, you will learn:

CUDA in Deep Learning

Deep learning implementations require significant computing power, like that offered by GPUs. Without GPU systems, many deep learning models would take significantly longer to train, making them more costly and slowing innovation. 

For example, when training the models for Google Translate, Google implemented a system with 2k server-grade NVIDIA GPUs. They used this system to run hundreds of week-long TensorFlow operations. With a traditional CPU-based system, these operations would have taken months each. 

Although TensorFlow is one of the most popular frameworks for deep learning, many other frameworks also rely on CUDA for GPU support. These include Torch, PyTorch, Keras, MXNet, and Caffe2.  

Most of these frameworks use the cuDNN library, which supports deep neural networks. This shared reliance gives the frameworks roughly the same performance for equivalent uses and means that updates to CUDA or cuDNN affect all frameworks equally. 

Outside of cuDNN, there are three other main GPU-accelerated libraries for deep learning — TensorRT, NCCL, and DeepStream. TensorRT is a library created by NVIDIA for high performance learning optimization and runtimes. DeepStream is a library for video inference. NCCL is a library for multi-node and multi-GPU communications primitives.

In addition to its components for deep learning, the CUDA Toolkit includes various libraries and components. These provide support for debugging and optimization, compiling, documentation, runtimes, signal processing, and parallel algorithms. CUDA Toolkit libraries support all NVIDIA GPUs. 

CUDA Programming

CUDA provides support for several popular languages, including C, C++, Fortran, Python, and MATLAB. The interface is based on C/C++, and the compiler applies abstractions to incorporate parallelism and simplify programming.

Using the CUDA programming model, you can access three main language extensions:

  • CUDA blocks—a group or collection of threads.
  • Shared memory—a block of shared memory distributed across threads. 
  • Synchronization barriers—enables multiple threads to sync at a specific completion point. 

One of the main benefits of the CUDA model is that it enables you to create scalar programs. In the example below, you can see the CUDA kernel adding two vectors (A and B) with a third vector (C) as output. This is executed on the GPU and adds the vectors as though they were scalar numbers. When run, it is performed in parallel with each vector element executed by a different thread in the CUDA block. 

__global__ void vectorAdd( float *A, float *B, float *C, int numElements) {
  int i = blockDim.x * blockIdx.x + threadIdx.x;
  if (i < numElements) {
    C[i] = A[i] + B[i];

The CUDA model enables you to scale your programs transparently. You can separate applications and computations into independent functions or problems and perform them with CUDA blocks. Each block is assigned to a sub-problem or function and further breaks down the tasks to fit the available threads. Blocks are automatically scheduled on your GPU multiprocessors by the CUDA runtime. 

The diagram below shows how this can work with a CUDA program defined in eight blocks. Through the runtime, the blocks are allocated to the available GPUs using streaming multiprocessors (SMs). Note, this diagram shows two separate GPU situations, one with four processors and one with eight. This is to highlight how blocks can be distributed in multiple situations without code changes.

CUDA Programming

Image Source: NVIDIA

CUDA Recommendations and Best Practices

There are a few general recommendations and best practices that can help improve your CUDA implementations. These practices focus on three main strategies—increasing parallelism, optimizing memory use, and optimizing instruction use. 

Maximizing Parallel Execution

Maximizing parallel execution of your algorithm helps increase the overall speed of your execution. It also allows for more complex computations. 

To maximize this, ensure that your algorithm exposes parallelism as much as possible and that these exposures are efficiently mapped to your hardware. You need to carefully choose your kernel launch execution configuration and explicitly expose concurrent execution on the device and between the device and host. 

Optimizing Memory Use 

Optimizing memory use helps you ensure access to maximum memory bandwidth. Start by minimizing any data transfers between your host and device. Then, maximize device shared memory to minimize kernel access to your global memory. Host/device transfers have lower bandwidth than internal transfers and global memory access is slower. 

You should also consider organizing your memory accesses based on the most efficient access patterns. This can optimize use because your effective bandwidth varies according to access pattern by memory type. 

Optimizing Instruction Use 

Optimizing your instruction use can help you increase your instruction throughput. To achieve this, try to avoid using arithmetic instructions with low throughput, such as those that prioritize precision over speed. 

For example, you should prefer single-precision over double and regular functions over intrinsics, provided it doesn’t affect your end result. You should also be mindful of your control flow instructions and make sure they are optimized for single instruction multiple thread (SIMT) devices. 


Run:AI automates resource management and orchestration for machine learning infrastructure. With Run:AI, you can automatically run as many compute intensive experiments as needed, incorporating CUDA. 

Here are some of the capabilities you gain when using Run:AI: 

  • Advanced visibility—create an efficient pipeline of resource sharing by pooling GPU compute resources.
  • No more bottlenecks—you can set up guaranteed quotas of GPU resources, to avoid bottlenecks and optimize billing.
  • A higher level of control—Run:AI enables you to dynamically change resource allocation, ensuring each job gets the resources it needs at any given time.

Run:AI simplifies machine learning infrastructure pipelines, helping data scientists accelerate their productivity and the quality of their models. 

Learn more about the Run.ai GPU virtualization platform.