What Is TensorRT-LLM?
NVIDIA TensorRT-LLM is an advanced open-source library designed to enhance and optimize the inference performance of large language models (LLMs) on NVIDIA's AI platform. This tool allows developers to work with the latest LLMs, providing high performance and ease of customization without requiring in-depth understanding of C++ or CUDA programming languages.
TensorRT-LLM is based on the TensorRT deep learning compiler, known for its optimized kernels, sourced from the NVIDIA FasterTransformer project. The library encompasses functionalities for pre- and post-processing, as well as for facilitating communication across multiple GPUs and nodes. This allows TensorRT-LLM to deliver significant performance improvements for LLM inference tasks.
This is part of a series of articles about Generative AI.
In this article:
NVIDIA TensorRT Benefits
Speed Up LLM Inference
NVIDIA TensorRT-based applications significantly enhance speed, with up to a 8X faster inference rate compared to CPU-only platforms, according to NVIDIA tests. This faster performance allows for the optimization of neural network models across all major frameworks.
By calibrating for lower precision while maintaining high output accuracy, applications can be deployed across a variety of platforms, from hyperscale data centers to embedded and automotive product platforms. This speed increase not only boosts efficiency but also opens up possibilities for real-time applications and services.
Optimize Inference Performance
Built on the NVIDIA CUDA® parallel programming model, TensorRT is specifically designed to optimize inference performance. It achieves this through techniques such as quantization, layer and tensor fusion, and kernel tuning, tailored for NVIDIA GPUs. These optimizations allow for streamlined processing and enhanced performance for LLM inference.
Accelerate AI Workloads
TensorRT is engineered to accelerate workloads by incorporating INT8 and floating point 16 (FP16) optimizations. These optimizations are crucial for deploying deep learning inference applications across a range of sectors, including video streaming, recommendation systems, fraud detection, and natural language processing.
By enabling reduced-precision inference, TensorRT significantly minimizes latency, meeting the critical requirements of real-time services, autonomous systems, and embedded applications.
Deploy, Run, and Scale with Triton
TensorRT-optimized models can be deployed, operated, and scaled with NVIDIA Triton™, an open-source inference-serving software. Triton supports TensorRT as one of its backends and offers numerous advantages, such as high throughput achieved through dynamic batching and concurrent model execution.
Additionally, Triton introduces advanced features like model ensembles and streaming audio/video inputs, facilitating a flexible and efficient environment for deploying and managing AI models at scale.
What Can You Do With TensorRT-LLM?
TensorRT-LLM provides a substantial performance boost for large language model (LLM) inference on NVIDIA GPUs. Here are its key capabilities for LLM developers:
Open Source Python API
One of the key advantages of TensorRT-LLM is its open-source modular Python API, which simplifies defining, optimizing, and executing new architectures and enhancements as LLMs evolve. This modularity ensures ease of use and extensibility, allowing developers to customize the library to fit their specific needs.
In-Flight Batching and Paged Attention
TensorRT-LLM introduces In-Flight Batching, which optimizes the text generation process by breaking it into multiple execution iterations. The runtime evicts finished sequences from the batch and immediately begins processing new requests, even while others are still in flight. This feature, managed by the Batch Manager, reduces wait times in queues, eliminates the need for padding requests, and enhances GPU utilization.
Multi-GPU and Multi-Node Inference
TensorRT-LLM supports multi-GPU and multi-node inference, leveraging pre- and post-processing steps along with communication primitives in its Python API. This support facilitates groundbreaking LLM inference performance on NVIDIA GPUs. For more details, users should refer to the Multi-GPU and Multi-Node Support section of the documentation.
FP8 Support
With NVIDIA H100 GPUs, TensorRT-LLM allows for easy conversion of model weights into the FP8 format, automatically compiling models to utilize optimized FP8 kernels. This feature, powered by NVIDIA Hopper, does not require changes to the model code.
Latest GPU Support
TensorRT-LLM supports a wide range of NVIDIA GPUs, including those based on the Hopper, Ada Lovelace, Ampere, Turing, and Volta architectures. Users should consult the Support Matrix for specific limitations and detailed support information.
Native Windows Support
Developers and AI enthusiasts can benefit from accelerated LLMs on PCs and workstations powered by NVIDIA RTX and NVIDIA GeForce RTX GPUs. For installation details, users can refer to the Installing on Windows section of the documentation.
TensorRT-LLM Architecture and Components
Let’s look at the structure of TensorRT-LLM.
Model Definition
TensorRT-LLM provides a Python API that enables the definition of Large Language Models (LLMs) through a straightforward and powerful interface. This API lets you create graph representations of deep neural networks within TensorRT.
The API's design provides utilities for defining network structures and integrating specialized activation functions directly into the model's graph. This approach simplifies the model definition process, making it accessible to developers without requiring deep expertise in underlying technologies like C++ or CUDA.
Weight Bindings
An essential step in preparing LLMs for inference with TensorRT-LLM involves binding model parameters, such as weights and biases, to the network before compilation. This process ensures that the network's weights are embedded within the TensorRT engine, enabling efficient execution.
TensorRT-LLM extends the capability to update these weights post-compilation, offering flexibility for refining and optimizing models in response to new data or objectives.
Pattern Matching and Fusion
TensorRT-LLM supports ‘operation fusion’, a technique that combines multiple operations into a single, more efficient kernel operation. This process, facilitated by TensorRT's advanced pattern-matching algorithms, enhances execution efficiency by minimizing memory transfers and kernel launch overhead.
By identifying and fusing compatible operations, such as combining activation functions directly with matrix multiplications, TensorRT-LLM optimizes the data flow within the network, leading to faster inference times and reduced computational overhead.
Plugins
To extend the range of possible optimizations, TensorRT-LLM incorporates plugins, which are user-defined kernels that integrate into the network graph. These plugins allow for the implementation of advanced graph modifications and optimizations that might not be automatically recognized by TensorRT's pattern-matching algorithms.
One example is the Flash-Attention technique for optimizing multihead attention blocks, which demonstrates how plugins enable customization and enhancement of LLM performance beyond standard optimizations.
Runtime
The runtime component of TensorRT-LLM is designed to manage the execution of TensorRT engines, supporting both Python and C++ environments. This includes loading the engines and orchestrating their execution, catering to complex models like GPT that require specific handling of input sequences and generation loops.
The runtime API makes it possible to deploy LLMs efficiently, ensuring seamless operation across single and multi-GPU systems. It does this by leveraging communication plugins for optimized data exchange between GPUs.
TensorRT-LLM Benchmarks
The benchmarks for TensorRT-LLM demonstrate its significant performance improvements for large language model (LLM) inference on various NVIDIA GPUs. These benchmarks were conducted using a local inference client, which was fed requests at an infinite rate to measure maximum throughput. Below is the detailed performance data collected using version 0.10.0, presented in tokens per second.
TensorRT-LLM Installation and Build
Let’s see how to install and build a large language model using TensorRT-LLM.
Create the Container
TensorRT-LLM provides a flexible setup for developers by offering a way to create and run a development container. This container facilitates the building of TensorRT-LLM within a controlled environment.
On systems with GNU Make:
To create a Docker image for development, use the command:
This command tags the image locally as tensorrt_llm/devel:latest. To run the container, execute:
For users who prefer operating under their user account instead of root, include LOCAL_USER=1:
On systems without GNU Make:
For systems that don’t support GNU Make, build the Docker image with:
Then, run the container using:
Build TensorRT-LLM
Inside the container, TensorRT-LLM can be compiled from the source with:
Then, deploy it by installing the wheel file:
For a clean build, use --clean with the build command. To target specific CUDA architectures, specify them with --cuda_architectures.
Link with the TensorRT-LLM C++ Runtime
The build_wheel.py script compiles both the Python and C++ runtime of TensorRT-LLM. For projects only requiring the C++ runtime, use --cpp_only:
This approach is beneficial for avoiding linking issues related to torch and GCC's dual ABI support. Libraries for linking against TensorRT-LLM can be found in cpp/build/tensorrt_llm.
Use Supported C++ Header Files
When integrating TensorRT-LLM, include the cpp and cpp/include directories in your project's include paths. Only headers in cpp/include are considered part of the API and should be directly included. Headers under cpp are subject to change and should not be directly included in projects to ensure compatibility with future versions of TensorRT-LLM.
Optimizing Your AI Infrastructure with Run:ai
Run:ai automates resource management and orchestration and reduces cost for the infrastructure used to train LLMs and other computationally intensive models. With Run:ai, you can automatically run as many compute intensive experiments as needed.
Here are some of the capabilities you gain when using Run:ai:
- Advanced visibility—create an efficient pipeline of resource sharing by pooling GPU compute resources.
- No more bottlenecks—you can set up guaranteed quotas of GPU resources, to avoid bottlenecks and optimize billing.
- A higher level of control—Run:ai enables you to dynamically change resource allocation, ensuring each job gets the resources it needs at any given time.
Run:ai simplifies machine learning infrastructure pipelines, helping data scientists accelerate their productivity and the quality of their models.
Learn more about the Run:ai GPU virtualization platform.