CUDA Programming: An In-Depth Look

What is CUDA Programming?

Compute unified device architecture (CUDA) programming enables you to leverage parallel computing technologies developed by NVIDIA. The CUDA platform and application programming interface (API) are particularly helpful for implementing general purpose computing on graphics processing units (GPU). The interface is based on C/C++, but allows you to use other programming languages and frameworks. 

In this article, you will learn:

CUDA Programming Model

The CUDA platform provides direct access to GPU instruction sets, as well as parallel computational elements. CUDA’s interface is based on C/C++, but you are free to use your preferred programming language, as well as frameworks like OpenCL and HIP. 

The CUDA programming model enables you to leverage parallel programming for allocation of GPU resources, and also write a scalar program. To leverage built-in parallelism, the CUDA compiler uses programming abstractions. 

There are three key language extensions CUDA programmers can use—CUDA blocks, shared memory, and synchronization barriers. CUDA blocks contain a collection of threads. A block of threads can share memory, and multiple threads can pause until all threads reach a specified set of execution.

Related content: read our guide to CUDA NVIDIA.

CUDA Programming Model: A Code Example

The below code provides an example of how the CUDA kernel code adds vectors A and B—and returns their output, vector C. Since there are two vectors executed, the code is designed to process scalars. You can use this code to simplify massive parallelism. When run on GPU, each vector element is executed by a thread, and all threads in the CUDA block run independently and in-parallel. 

/** CUDA kernel device code - CUDA Sample Codes
 * Computes the vector addition of A and B into C. The three vectors have the same number of elements as numElements. 
__global__ void vectorAdd( float *A, float *B, float *C, int numElements) {
  int i = blockDim.x * blockIdx.x + threadIdx.x;
  if (i < numElements) {
    C[i] = A[i] + B[i];

How a CUDA Program Works

The CUDA programming model enables you to scale software, increasing the number of GPU processor cores as needed. You can use CUDA language abstractions to program applications, divide your programs into small independent problems. 

You can further break down small problems into smaller pieces of code, without interrupting processes, because parallel threads inside the CUDA block continue executing and cooperating. The CUDA runtime determines the schedule and order of CUDA blocks run on multiprocessors, which lets the CUDA program run on any number of multiprocessors. 

This process is visualized in figure 1 below, which shows a compilation of a CUDA program with eight CUDA blocks. The figure shows how the CUDA runtime chooses to allocate blocks to streaming multiprocessors (SMs). The small GPU with four SMs allocates two CUDA blocks per SM, and the larger GPU with eight SMs, is allocated one CUDA block per one SM. This type of allocation enables you to ensure performance scalability without modifying the code. 

CUDA Programming

Image Source: NVIDIA

CUDA Program Structure

Typically, CUDA programs contain code instructions for GPU and CPU, and the default C program contains a CUDA program with the host code. In this structure, CPUs are referred to as hosts and GPUs are referred to as devices. 

You need to use different compilers for each. Here’s what you can do: 

  • You can compile host code by a traditional C compiler as GCC.
  • You need special compilers to enable devices to understand API functions. You can use the NVIDIA C Compiler (NVCC) for NVIDIA GPUs.

The NVCC compiler can process the CUDA program in a way that separates the host code from the device code. This is achieved by calling specific CUDA keywords. Device code is labeled with keywords used for data-parallel functions, called ‘Kernels’. Once the NVCC identifies these keywords, it compiles the device code and executes it on the GPU.

CUDA Program Structure

Image Source: tutorialspoint

Execution of a CUDA C Program

When you write a CUDA program, you can define the number of threads you want to launch, without limitations. However, you should do this wisely. Threads are packed into blocks, which are then packed into three-dimensional grids. Each thread is allocated a unique identifier, which determines what data is executed.

Each GPU typically contains built-in global memory, called dynamic random access memory (DRAM), or device memory. To execute a kernel on a GPU, you need to write code that allocates separate memory on the GPU. This is achieved by using specific functions provided by the CUDA API. Here is how this sequence works:

  • Allocate memory on the device 
  • Transfer data from host memory to device memory
  • Execute the kernel on the device
  • Transfer the result back from the device memory to the host memory
  • Free-up the allocated memory on the device 

During this process, the host can access the device memory and transfer data to and from the device. However, the device can’t transfer data to and from the host. 

CUDA Memory Management

The CUDA program structure requires storage on two machines—the host computer running the program, and the device GPU executing the CUDA code. Each storage process implements the C memory model and has a separate memory stack and heap. This means you need to separately transfer data from host to device.

In some cases transfer means manually writing code that copies the memory from one location to another. However, if you are using NVIDIA you can use unified memory, to eliminate manual coding and save time. This model enables you to allocate memory from CPUs and GPUs, as well as prefetch the memory before usage.

CUDA Programming with Run:AI

Run:AI automates resource management and orchestration for machine learning infrastructure. With Run:AI, you can automatically run as many compute intensive experiments as needed. 

Here are some of the capabilities you gain when using Run:AI: 

  • Advanced visibility—create an efficient pipeline of resource sharing by pooling GPU compute resources.
  • No more bottlenecks—you can set up guaranteed quotas of GPU resources, to avoid bottlenecks and optimize billing.
  • A higher level of control—Run:AI enables you to dynamically change resource allocation, ensuring each job gets the resources it needs at any given time.

Run:AI simplifies machine learning infrastructure pipelines, helping data scientists accelerate their productivity and the quality of their models. 

Learn more about the GPU virtualization platform.