NVIDIA DALI: The Basics and a Quick Tutorial

What Is the NVIDIA Data Loading Library (DALI)?

NVIDIA DALI, short for NVIDIA Data Loading Library, is an open-source library developed by NVIDIA that aims to expedite and optimize the process of data preparation for deep learning models that process images, video, or audio. It reduces the time spent in preparing data, including tasks like loading rich media, cropping and resizing it, and augmenting it by generating multiple variations of the same training data.

DALI offloads data augmentation to the GPU, enabling a faster and more efficient pipeline that can handle large amounts of data with high performance. Equipped with a flexible API, this library enables developers to build highly customized data loading pipelines that can cater to a wide range of deep learning applications.

You can get DALI from the official GitHub repository.

Source: NVIDIA

This is part of a series of articles about AI open source projects

In this article:

Key Features of NVIDIA DALI
NVIDIA DALI Operators
Quick Tutorial: Getting Started with NVIDIA DALI
Managing AI Infrastructure with Run:ai

Key Features of NVIDIA DALI

Let's look into DALI’s key features and see why it's a handy tool for deep learning.

Rapid Prototyping

With its Python-based API, NVIDIA DALI allows for quick and easy prototyping, enabling developers to experiment with different data loading and augmentation techniques without having to write lengthy, complex code. This means that you can iterate faster, validate your ideas quicker, and ultimately speed up the development process.

NVIDIA DALI's API allows you to easily adapt the data pipeline to suit the specific needs of your deep learning models. Whether you're working on image classification, object detection, or another computer vision task, NVIDIA DALI enables you to quickly develop and test your data loading pipeline.

GPU Acceleration

By offloading data augmentation tasks to the GPU, NVIDIA DALI frees up the CPU, allowing it to focus on other tasks, such as training the model. This results in a more efficient pipeline that can handle larger datasets and more complex models without slowing down.

NVIDIA DALI's GPU acceleration feature is fully compatible with NVIDIA's other deep learning tools, such as CUDA and cuDNN, enabling you to build a fully GPU-accelerated deep learning pipeline.

Custom Pipelines

Unlike other data loading libraries that have a fixed pipeline structure, NVIDIA DALI allows developers to build their own custom pipelines that suit their specific needs. This means that you have complete control over the data loading process, enabling you to tailor the pipeline to the specific requirements of your deep learning models.

Whether you need to perform complex data transformations, handle heterogeneous data types, or deal with large datasets, NVIDIA DALI gives you the flexibility to design a pipeline to handle these tasks. DALI's custom pipelines are also built to be efficient, leveraging GPU power to carry out data augmentation tasks quickly.

Data Support

NVIDIA DALI supports a wide range of data formats, from standard formats such as JPEG and PNG, to specialized formats like TFRecord and RecordIO. This means that regardless of the type of data you're using, you can easily incorporate it into your deep learning pipeline.

DALI also supports a variety of data sources, including local storage, network storage, and cloud storage. This means that you can easily access and process your data, even when stored in different locations.

NVIDIA DALI Operators

DALI offers a number of operators that are useful for specific data preparation tasks.

Image Processing

NVIDIA DALI's image processing operators provide a range of functionalities, from basic operations like cropping, resizing and rotating images, to more advanced operations like color space conversion, brightness and contrast adjustment, and image normalization.

When working with images, the quality and speed of processing can greatly impact the effectiveness of a deep learning model. DALI's image processing operators use GPUs to perform complex image transformations at high speeds, without sacrificing the quality of the output. This makes them suited for tasks like image classification, object detection, and semantic segmentation.

DALI's image processing operators are not limited to 2D images. They also support 3D images, making them suitable for tasks in fields like medical imaging and volumetric data analysis.

Audio Processing

NVIDIA DALI's audio processing operators provide a set of tools for manipulating and transforming audio data. These include operations like audio decoding, resampling, and extraction of audio features such as Mel-frequency cepstral coefficients (MFCCs) and spectrograms.

Unlike images, which are typically represented as 2D or 3D arrays of pixel values, audio data is represented as 1D arrays of sample values. This difference in representation, combined with the temporal nature of audio data, makes audio processing a complex task. DALI's audio processing operators are designed to handle these complexities.

These operators can be used in conjunction with NVIDIA's deep learning frameworks like TensorFlow and PyTorch, providing a seamless pipeline for audio-based deep learning tasks. This makes them useful for tasks like speech recognition and music classification.

Video Processing

Video data is arguably the most complex of the three types of data discussed here, as it combines the spatial dimensions of image data with the temporal dimension of audio data. DALI's video processing operators provide an efficient solution for video data preparation.

With these operators, you can perform a variety of operations on video data, including video decoding, frame extraction, and temporal and spatial transformations. These operations are executed on the GPU, resulting in high-speed processing without compromising the quality of the output.

Like NVIDIA DALI's image and audio processing operators, the video processing operators also support integration with NVIDIA's deep learning frameworks. This means you can use them to prepare video data for tasks like action recognition and scene understanding, directly feeding the processed data into your deep learning model.

Quick Tutorial: Getting Started with NVIDIA DALI

Let’s look at how to start using NVIDIA DALI. The code below was shared in the NVIDIA DALI documentation.

Setting Up

Before you start:

You need a Linux x64 system with an NVIDIA GPU.
Your NVIDIA Driver should support CUDA 11.0 or newer versions, like 450.80.02 or subsequent releases.
Install the CUDA Toolkit. For DALI that's built on CUDA 12, this toolkit is a must because it's linked dynamically. For builds on CUDA 11, the toolkit is not mandatory.
If needed, you can install one of the following deep learning frameworks, which DALI integrates with: MXNet, PyTorch, TensorFlow, PaddlePaddle, or JAX.

Installing DALI

Use pip, the package installer for Python, to install the latest DALI that matches your CUDA version. Please consult NVIDIA’s support matrix to ensure that your platform is compatible.

For CUDA 11.0:

Run this command:


pip install nvidia-dali-cuda110

‍For CUDA 12.0:

Run the following:


pip install nvidia-dali-cuda120

Installing the DALI TensorFlow plugin

Out of the box, DALI doesn’t include prebuilt versions of the DALI TensorFlow plugin. You must install this as a separate package, which will be created depending on the version of TensorFlow you have installed:

For CUDA 11.0:

Run this command:


pip install nvidia-dali-tf-plugin-cuda110

For CUDA 12.0:

Run the following command:


pip install nvidia-dali-tf-plugin-cuda120

Once you've installed this package, any prerequisites (such as nvidia-dali-cudaXXX) that aren't already installed, will be installed automatically. However, make sure you install the tensorflow-gpu package before you try to install nvidia-dali-tf-plugin-cudaXXX.

Defining a Pipeline

The idea of a data processing pipeline is central to handling data with DALI. This pipeline is a directed graph containing numerous operations, all housed within a class object known as nvidia.dali.Pipeline. This class offers essential functions for constructing and executing data processing pipelines.

Start by importing the Pipeline class in your code:


from nvidia.dali.pipeline import Pipeline

Next, let’s define a straightforward pipeline for a classification task. Consider a task that identifies whether an image contains a dog or a cat. See the example data folder in the official DALI repo.

The basic pipeline will retrieve and decode these images from the directory and return (image, label) pairs.

You can set up a pipeline by using the pipeline_def decorator. This decorator allows you to establish the computations and their sequence within the simple_pipeline function.

Use the fn.readers.file to extract encoded images (JPEGs) and labels from the hard drive, while fn.decoders.image helps decode these stored images from JPEG to the RGB format.

Next, define which interim variables should be the output of the pipeline:


from nvidia.dali import pipeline_def
import nvidia.dali.fn as fn
import nvidia.dali.types as types

image_dir = "data/images"
max_batch_size = 8

@pipeline_def
def simple_pipeline():
    jpegs, labels = fn.readers.file(file_root=image_dir)
    images = fn.decoders.image(jpegs, device="cpu")

    return images, labels

Building the Pipeline

To use the pipeline we specified above, called simple_pipeline, we’ll have to build it. This involves invoking simple_pipeline, which results in a pipeline instance:


pipe = simple_pipeline(batch_size=max_batch_size, num_threads=1, device_id=0)
pipe.build()

Note that when a function has the pipeline_def decorator, it adds new named arguments. These arguments can fine-tune elements of the pipeline, such as the maximum batch size, the thread count for CPU-based computation, GPU device selection, and a seed for random number generation.

Running the Pipeline

After the pipeline is built, we can run this command to get a batch of results:


pipe_out = pipe.run()
print(pipe_out)

The output looks like this (abbreviated):



(TensorListCPU(
    [[[[255 255 255]
      [255 255 255]
      ...
      [ 86  46  55]
      [ 86  46  55]]

     [[255 255 255]
      [255 255 255]
      ...
      [ 86  46  55]
      [ 86  46  55]]
      ...
      [113 123  88]
      [104 116  80]]]],
    dtype=DALIDataType.UINT8,
    layout="HWC",
    num_samples=8,
    shape=[(427, 640, 3),
           (427, 640, 3),
           (425, 640, 3),
           (480, 640, 3),
           (485, 640, 3),
           (427, 640, 3),
           (409, 640, 3),
           (427, 640, 3)]), TensorListCPU(
    [[0]
     [0]
     [0]
     [0]
     [0]
     [0]
     [0]
     [0]],
    dtype=DALIDataType.INT32,
    num_samples=8,
    shape=[(1,), (1,), (1,), (1,), (1,), (1,), (1,), (1,)]))

The output of the pipeline is a tuple of 2 elements, like we specified when building the pipeline. Both of these elements are TensorListCPU objects, containing a list of CPU tensors.

Let’s check the shape and contents of the returned labels:


print(labels)
TensorListCPU(
    [[0]
     [0]
     [0]
     [0]
     [0]
     [0]
     [0]
     [0]],
    dtype=DALIDataType.INT32,
    num_samples=8,
    shape=[(1,), (1,), (1,), (1,), (1,), (1,), (1,), (1,)])

In order to see the images, we can loop over all tensors contained in TensorList. This can be done with matplotlib as follows:


import matplotlib.gridspec as gridspec
import matplotlib.pyplot as plt

%matplotlib inline

def show_images(image_batch):
    columns = 4
    rows = (max_batch_size + 1) // (columns)
    fig = plt.figure(figsize=(24, (24 // columns) * rows))
    gs = gridspec.GridSpec(rows, columns)
    for j in range(rows * columns):
        plt.subplot(gs[j])
        plt.axis("off")
        plt.imshow(image_batch.at(j))

show_images(images)

Here is the result:

Managing AI Infrastructure with Run:ai

As an AI developer, you will need to manage large-scale computing architecture to train and deploy AI models. Run:ai automates resource management and orchestration for AI infrastructure. With Run:ai, you can automatically run as many compute intensive experiments as needed.

Here are some of the capabilities you gain when using Run:ai:

Advanced visibility—create an efficient pipeline of resource sharing by pooling GPU compute resources.
No more bottlenecks—you can set up guaranteed quotas of GPU resources, to avoid bottlenecks and optimize billing.
A higher level of control—Run:ai enables you to dynamically change resource allocation, ensuring each job gets the resources it needs at any given time.

Run:ai simplifies machine learning infrastructure pipelines, helping data scientists accelerate their productivity and the quality of their models.

Learn more about the Run:ai GPU virtualization platform.