Accelerating Inference for LLMs

What Is TensorRT-LLM?

NVIDIA TensorRT-LLM is an advanced open-source library designed to enhance and optimize the inference performance of large language models (LLMs) on NVIDIA's AI platform. This tool allows developers to work with the latest LLMs, providing high performance and ease of customization without requiring in-depth understanding of C++ or CUDA programming languages.

TensorRT-LLM is based on the TensorRT deep learning compiler, known for its optimized kernels, sourced from the NVIDIA FasterTransformer project. The library encompasses functionalities for pre- and post-processing, as well as for facilitating communication across multiple GPUs and nodes. This allows TensorRT-LLM to deliver significant performance improvements for LLM inference tasks.

This is part of a series of articles about Generative AI.

In this article:

NVIDIA TensorRT Benefits

Speed Up LLM Inference

NVIDIA TensorRT-based applications significantly enhance speed, with up to a 8X faster inference rate compared to CPU-only platforms, according to NVIDIA tests. This faster performance allows for the optimization of neural network models across all major frameworks.

By calibrating for lower precision while maintaining high output accuracy, applications can be deployed across a variety of platforms, from hyperscale data centers to embedded and automotive product platforms. This speed increase not only boosts efficiency but also opens up possibilities for real-time applications and services.

Optimize Inference Performance

Built on the NVIDIA CUDA® parallel programming model, TensorRT is specifically designed to optimize inference performance. It achieves this through techniques such as quantization, layer and tensor fusion, and kernel tuning, tailored for NVIDIA GPUs. These optimizations allow for streamlined processing and enhanced performance for LLM inference.

Accelerate AI Workloads

TensorRT is engineered to accelerate workloads by incorporating INT8 and floating point 16 (FP16) optimizations. These optimizations are crucial for deploying deep learning inference applications across a range of sectors, including video streaming, recommendation systems, fraud detection, and natural language processing.

By enabling reduced-precision inference, TensorRT significantly minimizes latency, meeting the critical requirements of real-time services, autonomous systems, and embedded applications.

Deploy, Run, and Scale with Triton

TensorRT-optimized models can be deployed, operated, and scaled with NVIDIA Triton™, an open-source inference-serving software. Triton supports TensorRT as one of its backends and offers numerous advantages, such as high throughput achieved through dynamic batching and concurrent model execution.

Additionally, Triton introduces advanced features like model ensembles and streaming audio/video inputs, facilitating a flexible and efficient environment for deploying and managing AI models at scale.

TensorRT-LLM Architecture and Components

Let’s look at the structure of TensorRT-LLM.

Model Definition

TensorRT-LLM provides a Python API that enables the definition of Large Language Models (LLMs) through a straightforward and powerful interface. This API lets you create graph representations of deep neural networks within TensorRT.

The API's design provides utilities for defining network structures and integrating specialized activation functions directly into the model's graph. This approach simplifies the model definition process, making it accessible to developers without requiring deep expertise in underlying technologies like C++ or CUDA.

Weight Bindings

An essential step in preparing LLMs for inference with TensorRT-LLM involves binding model parameters, such as weights and biases, to the network before compilation. This process ensures that the network's weights are embedded within the TensorRT engine, enabling efficient execution.

TensorRT-LLM extends the capability to update these weights post-compilation, offering flexibility for refining and optimizing models in response to new data or objectives.

Pattern Matching and Fusion

TensorRT-LLM supports ‘operation fusion’, a technique that combines multiple operations into a single, more efficient kernel operation. This process, facilitated by TensorRT's advanced pattern-matching algorithms, enhances execution efficiency by minimizing memory transfers and kernel launch overhead.

By identifying and fusing compatible operations, such as combining activation functions directly with matrix multiplications, TensorRT-LLM optimizes the data flow within the network, leading to faster inference times and reduced computational overhead.


To extend the range of possible optimizations, TensorRT-LLM incorporates plugins, which are user-defined kernels that integrate into the network graph. These plugins allow for the implementation of advanced graph modifications and optimizations that might not be automatically recognized by TensorRT's pattern-matching algorithms.

One example is the Flash-Attention technique for optimizing multihead attention blocks, which demonstrates how plugins enable customization and enhancement of LLM performance beyond standard optimizations.


The runtime component of TensorRT-LLM is designed to manage the execution of TensorRT engines, supporting both Python and C++ environments. This includes loading the engines and orchestrating their execution, catering to complex models like GPT that require specific handling of input sequences and generation loops.

The runtime API makes it possible to deploy LLMs efficiently, ensuring seamless operation across single and multi-GPU systems. It does this by leveraging communication plugins for optimized data exchange between GPUs.

TensorRT-LLM Installation and Build  

Let’s see how to install and build a large language model using TensorRT-LLM.

Create the Container

TensorRT-LLM provides a flexible setup for developers by offering a way to create and run a development container. This container facilitates the building of TensorRT-LLM within a controlled environment.

On systems with GNU Make:

To create a Docker image for development, use the command:

make -C docker build

This command tags the image locally as tensorrt_llm/devel:latest. To run the container, execute:

make -C docker run

For users who prefer operating under their user account instead of root, include LOCAL_USER=1:

make -C docker run LOCAL_USER=1

On systems without GNU Make:

For systems that don’t support GNU Make, build the Docker image with:

docker build --pull \
             --target devel \
             --file docker/Dockerfile.multi \
             --tag tensorrt_llm/devel:latest \

Then, run the container using:

docker run --rm -it \
           --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all \
           --volume ${PWD}:/code/tensorrt_llm \
           --workdir /code/tensorrt_llm \

Build TensorRT-LLM

Inside the container, TensorRT-LLM can be compiled from the source with:

python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt

Then, deploy it by installing the wheel file:

pip install ./build/tensorrt_llm*.whl

For a clean build, use --clean with the build command. To target specific CUDA architectures, specify them with --cuda_architectures.

Link with the TensorRT-LLM C++ Runtime

The build_wheel.py script compiles both the Python and C++ runtime of TensorRT-LLM. For projects only requiring the C++ runtime, use --cpp_only:

python3 ./scripts/build_wheel.py --cuda_architectures "80-real;86-real" --cpp_only --clean

This approach is beneficial for avoiding linking issues related to torch and GCC's dual ABI support. Libraries for linking against TensorRT-LLM can be found in cpp/build/tensorrt_llm.

Use Supported C++ Header Files

When integrating TensorRT-LLM, include the cpp and cpp/include directories in your project's include paths. Only headers in cpp/include are considered part of the API and should be directly included. Headers under cpp are subject to change and should not be directly included in projects to ensure compatibility with future versions of TensorRT-LLM.

Optimizing Your AI Infrastructure with Run:ai

Run:ai automates resource management and orchestration and reduces cost for the infrastructure used to train LLMs and other computationally intensive models. With Run:ai, you can automatically run as many compute intensive experiments as needed.

Here are some of the capabilities you gain when using Run:ai:

  • Advanced visibility—create an efficient pipeline of resource sharing by pooling GPU compute resources.
  • No more bottlenecks—you can set up guaranteed quotas of GPU resources, to avoid bottlenecks and optimize billing.
  • A higher level of control—Run:ai enables you to dynamically change resource allocation, ensuring each job gets the resources it needs at any given time.

Run:ai simplifies machine learning infrastructure pipelines, helping data scientists accelerate their productivity and the quality of their models.

Learn more about the Run:ai GPU virtualization platform.