What Is the NVIDIA Triton Inference Server?
NVIDIA’s open-source Triton Inference Server offers backend support for most machine learning (ML) frameworks, as well as custom C++ and python backend. This reduces the need for multiple inference servers for different frameworks and allows you to simplify your machine learning infrastructure
While Triton was initially designed for advanced GPU features, it can also perform well on CPU. Triton offers flexible processing hardware and ML framework support, reducing the complexity of the model serving infrastructure.
This is part of our series of articles about cloud deep learning.
In this article:
- Triton Inference Server Features
- Triton Model Repository
- What Are Model Versions?
- Types Of Models Supported By Triton
- ONNX Models
- TensorFlow Models
- TensorRT Models
- TorchScript Models
- Triton Client Libraries
- Tutorial: Install and Run Triton
- 1. Install Triton Docker Image
- 2. Create Your Model Repository
- 3. Run Triton
- Accelerating AI Inference with Run.AI
Triton Inference Server Features
The Triton Inference Server offers the following features:
- Support for various deep-learning (DL) frameworks—Triton can manage various combinations of DL models and is only limited by memory and disk resources. Triton supports multiple formats, including TensorFlow 1.x and 2.x, TensorFlow SavedModel, TensorFlow GraphDef, TensorRT, ONNX, OpenVINO and PyTorch TorchScript.
- Simultaneous execution—Triton can run multiple instances of a model, or multiple models, concurrently, either on multiple GPUs or on a single GPU.
- Dynamic scheduling and batching—Triton uses a variety of scheduling and batching algorithms to aggregate inference requests and enhance inference throughput for batching-compatible models. All batching and scheduling decisions are visible to the client sending inference requests.
- Backend extensibility—Triton has a backend API, which can be used to extend it with any model execution logic you implement in C++ or Python. This allows you to extend any Triton features, including GPU and CPU support.
- Model ensembles—a Triton ensemble provides a representation of a model pipeline. This includes the connection of output and input tensors between models. You can trigger the execution of an entire pipeline with one inference request.
- Various metrics—Triton provides a variety of metrics in the Prometheus format. These include metrics for server throughput, server latency, and GPU utilization. The metrics are provided in Prometheus data format.
Triton Model Repository
Triton uses the concept of a “model,” representing a packaged machine learning algorithm used to perform inference. Triton can access models from a local file path, Google Cloud Storage, Amazon S3, or Azure Storage.
You specify a model repository when starting Triton, as in the following command. You can add more than one --model-repository flag to connect Triton to multiple repositories.
What Are Model Versions?
There can be multiple versions of each model, with each version stored in a numerically-named subdirectory. The subdirectory’s name must be the model’s version number. Triton ignores subdirectories that start with 0 or do not start with a number. In model configuration, you can specify a version policy to determine which versions will be used by Triton for inference.
Within a model version subdirectory, Triton stores the required files, which may differ according to the type of the model and backend requirements.
When you start Triton using the --model-repository option, you specify these repository paths. You can specify the --model-repository multiple times to include models from different repositories. The files and directories included in a model repository must conform to required layouts.
Types Of Models Supported By Triton
ONNX models can consist of single files or contain a directory with multiple files. The file or directory must be named model.onnx. A minimal ONNX model definition looks like this:
Note: You can only use ONNX models for the ONNX Runtime version that Triton is currently using.
TensorFlow uses two main types of formats to represent models: GraphDef and SavedModel.
This type of model is represented by a single file called model.graphdef. A minimal model definition looks like this:
A SavedModel is a directory with multiple files. Here is an example of a minimal definition for a TensorFlow SavedModel model:
<list of files>
A TensorRT model is defined in a single file called model.plan.
TensorRT models use the CUDA library’s Compute Capability to communicate with the server’s GPU. Each plan file needs to configure the cc_model_filenames to reference the relevant Compute Capability.
Here is an example of a minimal definition for a TensorRT model:
A TorchScript model uses a single file called model.pt. Here is an example of a minimal model definition:
Triton Client Libraries
There are a number of client libraries available to facilitate communication with Triton. The Triton project also provides examples for using these libraries.
Triton client libraries include:
- Python API—helps you communicate with Triton from a Python application. You can access all capabilities via GRPC or HTTP requests. This includes managing model repositories, health and status checks and inferencing. The library supports the use of CUDA and system memory to send inputs to Triton and receive outputs.
- C++ API—provides the same capabilities as the Python API but for C++ applications.
- Java API—provided by Alibaba Cloud PAI Team, it simplifies communication with Triton from Java applications with HTTP requests. Currently, there is only support for a limited set of features.
- GRPC API—you can use the protoc compiler to create a GRPC API in a wide range of programming languages.
You can leverage a number of applications that demonstrate how you can use these libraries. You can find most of these example models in the GitHub repository. Here are some of the examples available:
- Basic C++ example models demonstrating how to communicate with Triton using the C++ library for tasks such as inferencing. C++ examples for the HTTP client are named as simple_http_ prefix, while GRPC client examples are named as simple_grpc_ prefix.
- Basic Python examples demonstrating how to communicate with Triton using the Python library for tasks such as inferencing. Python examples for the HTTP client are named as simple_http_ prefix, while GRPC client examples are named as simple_grpc_ prefix.
- Python and C++ image/client versions, which can execute example models of image classification using a Python or C++ client library.
- Basic Java examples demonstrating how to communicate with Triton using the Java API for tasks such as inferencing.
- Examples showing how to use a protoc compiler-generated Python GRPC API to communicate with Triton. One example is grpc_client.py, which demonstrates simple ways to use the API. Another example is grpc_image_client.py, which is functionally similar to image_client but communicates with Triton using a GRPC client stub.
Tutorial: Install and Run Triton
1. Install Triton Docker Image
First, you need to install Docker. You also need to install the NVIDIA Container Toolkit to use a GPU for inference—otherwise, Docker cannot recognize GPUs. If you intend to use NVIDIA DGX, you should follow these guidelines on preparing containers.
Use this command to pull a Triton Docker image
$ docker pull nvcr.io/nvidia/tritonserver:<xx.yy>-py3
<xx.yy> represents the version of Triton you intend to pull.
2. Create Your Model Repository
A model repository is a directory containing the models that Triton serves. The docs/examples/model_repository provides an example of a model repository. You need to retrieve any missing model definition files before you can use the repositories. You can use the following script to fetch these files from public model zoos:
$ cd docs/examples
3. Run Triton
While Triton performs best for GPU inferencing, you can also use it for CPU-powered systems. You can use the same Triton Docker image for either GPU or CPU. To run Triton on a GPU-based system, enter the following script:
$ docker run --gpus=3 --rm -p8000:8000 -p8001:8001 -p8002:8002 -v/full/path/to/docs/examples/model_repository:/models nvcr.io/nvidia/tritonserver:<xx.yy>-py3 tritonserver --model-repository=/models
In this example, --gpus=3 indicates that the system should make three GPUs available to Triton for inferencing. Triton will run using your example model repository. <xx.yy> represents the version of Triton you’ve chosen to use. Once you start Triton, the console displays an output showing the server start up and load the model.
Models that load correctly should display a “ready” status. Models that fail to load will send failure reports describing the cause of the failure. If you cannot see a model in the table, you can check your CUDA drivers and the path to your model repository.
Accelerating AI Inference with Run:AI
Run:AI automates resource management and orchestration for machine learning infrastructure. With Run:AI, you can automatically run compute intensive inference on as many machines as needed.
Here are some of the capabilities you gain when using Run:AI:
- Advanced visibility—create an efficient pipeline of resource sharing by pooling GPU compute resources.
- No more bottlenecks—you can set up guaranteed quotas of GPU resources, to avoid bottlenecks and optimize billing.
- A higher level of control—Run:AI enables you to dynamically change resource allocation, ensuring each job gets the resources it needs at any given time.
Run:AI simplifies machine learning infrastructure pipelines, helping data scientists accelerate their productivity and the quality of their models.
Learn more about the Run:ai GPU virtualization platform.