The Basics and a Quick Tutorial

What Is FasterTransformer?

A transformer is a deep learning model introduced in the paper "Attention is All You Need". It uses self-attention mechanisms and has been adopted for a variety of tasks in natural language processing (NLP).

FasterTransformer is an open source library that can make transformer models faster and more efficient. Developed by NVIDIA, it is a highly optimized model library that supports transformer architectures including BERT, GPT2, GPT-J, and T5. It aims to accelerate transformer inference, which is crucial in NLP tasks like translation, text generation, and summarization.

FasterTransformer is built on the CUDA platform, a parallel computing platform and API model created by NVIDIA. This allows FasterTransformer to utilize the processing capabilities of NVIDIA GPUs. It also uses NVIDIA's NeMo (Neural Modules) framework, a toolkit for creating AI applications with a focus on conversational AI.

Important: NVIDIA has stopped development of the FasterTransformer library. It will continue to be available, but will not be maintained. NVIDIA recommends transitioning to the TensorRT-LLM library.

For those continuing to use FasterTransformer, we’ll describe the use of the library in the remainder of this article.

This is part of a series of articles about AI open source projects

In this article:

Key Features of FasterTransformer

Optimization for NVIDIA GPUs

By leveraging the computational power of NVIDIA GPUs, FasterTransformer can significantly speed up transformer inference tasks. This optimization makes it a useful tool for researchers and developers working on NLP tasks, as it can reduce the time spent on model training and inference.

FasterTransformer utilizes the GPU kernels to perform computations that are traditionally done on the CPU, thereby reducing computational overheads and increasing speed. It is compatible with various NVIDIA GPUs, including the Tesla V100, A100, and others. This wide range of compatibility ensures that developers can leverage the FasterTransformer irrespective of the NVIDIA GPU model they have at their disposal.

Learn more in our detailed guide to Nvidia modulus

Support for Various Transformer Architectures

FasterTransformer supports multiple transformer architectures, allowing developers and researchers to leverage the transformer for a range of NLP tasks.

Among the supported architectures are BERT (Bidirectional Encoder Representations from Transformers) and OpenAI's GPT-2 (Generative Pretrained Transformer 2), GPT-J, BLOOM, and T5. If you need to support newer Transformer models, use TensorRT-LLM.

Efficient Kernel Implementations

Kernel implementations are crucial in determining the efficiency and speed of computations in deep learning models. FasterTransformer's kernel implementation uses custom CUDA kernels. CUDA is a parallel computing platform and application programming interface (API) model created by NVIDIA. CUDA kernels are small functions that can be executed simultaneously on different parts of a dataset. This simultaneous execution allows for a speedup in computations.

FasterTransformer's kernel implementation also includes optimized implementations of various operations involved in transformer models. These include softmax, attention, and feed-forward operations. These optimizations contribute to the overall speed and efficiency of the FasterTransformer.

Dynamic Sequence Lengths

In NLP tasks, the input data often consists of sequences of words or sentences. These sequences can vary in length, and handling this variability efficiently is a significant challenge. FasterTransformer addresses this challenge by supporting dynamic sequence lengths. It can handle inputs of varying lengths without requiring padding or other preprocessing steps. This feature simplifies the preprocessing steps and increases the overall efficiency of the model.

Optimization Techniques Used in FasterTransformer

There are several techniques used to optimize NLP models in FasterTransformer.

Layer Fusion

Layer fusion is a method used to reduce the time and computational resources needed to run deep learning models. In the context of FasterTransformer, it involves merging multiple layers into a single, more efficient layer.

In the conventional transformer model, there is a significant amount of time spent on moving data between the GPU's global memory and its shared memory. This memory transfer process is both time-consuming and inefficient. However, with layer fusion, FasterTransformer can merge several layers, thus reducing the number of times data needs to be moved.

This technique, in addition to accelerating computation, also saves memory. By fusing layers, FasterTransformer can reduce the need for intermediate data storage, leading to a significant reduction in memory usage. This approach allows it to handle larger models and datasets.

Multi-Head Attention Acceleration

The attention mechanism is a critical component of transformer models. It allows the model to focus on different parts of the input sequence when generating output, thus improving the accuracy of predictions.

However, conventional multi-head attention mechanisms can be computationally expensive. To overcome this, FasterTransformer uses multi-head attention acceleration. This involves optimizing the processing of the attention mechanism, speeding up calculations without sacrificing the model's predictive accuracy.

GEMM Kernel Autotuning

General Matrix Multiply (GEMM) is a fundamental operation in many deep learning models. It involves multiplying two matrices and adding the result to a third matrix. However, the speed and efficiency of GEMM operations can vary depending on the size and shape of the matrices involved.

FasterTransformer addresses this issue with GEMM kernel autotuning. This involves automatically tuning the GEMM kernel's parameters to optimize its performance for any given matrix size and shape. By doing so, FasterTransformer can ensure that every GEMM operation is as fast and efficient as possible.

Lower Precision

In traditional deep learning models, computations are typically performed using 32-bit floating-point numbers. However, this level of precision is often unnecessary and can lead to increased computation time and memory usage.

In contrast, FasterTransformer uses 16-bit floating-point numbers for most of its computations. This approach, known as lower precision computation, can significantly speed up model training and inference. It also reduces memory usage, allowing FasterTransformer to handle larger models and datasets.

To get started with FasterTransformer, see the official quick start guide here.

Optimizing Your AI Infrastructure with Run:ai

Run:ai automates resource management and orchestration and reduces cost for the infrastructure used to train LLMs and other computationally intensive models. With Run:ai, you can automatically run as many compute intensive experiments as needed.

Here are some of the capabilities you gain when using Run:ai:

  • Advanced visibility—create an efficient pipeline of resource sharing by pooling GPU compute resources.
  • No more bottlenecks—you can set up guaranteed quotas of GPU resources, to avoid bottlenecks and optimize billing.
  • A higher level of control—Run:ai enables you to dynamically change resource allocation, ensuring each job gets the resources it needs at any given time.

Run:ai simplifies machine learning infrastructure pipelines, helping data scientists accelerate their productivity and the quality of their models.

Learn more about the Run:ai GPU virtualization platform.