NVIDIA’s open-source Triton Inference Server offers backend support for most machine learning (ML) frameworks, as well as custom C++ and python backend. This reduces the need for multiple inference servers for different frameworks and allows you to simplify your machine learning infrastructure
While Triton was initially designed for advanced GPU features, it can also perform well on CPU. Triton offers flexible processing hardware and ML framework support, reducing the complexity of the model serving infrastructure.
This is part of our series of articles about machine learning operations.
In this article:
The Triton Inference Server offers the following features:
Triton uses the concept of a “model,” representing a packaged machine learning algorithm used to perform inference. Triton can access models from a local file path, Google Cloud Storage, Amazon S3, or Azure Storage.
You specify a model repository when starting Triton, as in the following command. You can add more than one --model-repository flag to connect Triton to multiple repositories.
There can be multiple versions of each model, with each version stored in a numerically-named subdirectory. The subdirectory’s name must be the model’s version number. Triton ignores subdirectories that start with 0 or do not start with a number. In model configuration, you can specify a version policy to determine which versions will be used by Triton for inference.
Within a model version subdirectory, Triton stores the required files, which may differ according to the type of the model and backend requirements.
When you start Triton using the --model-repository option, you specify these repository paths. You can specify the --model-repository multiple times to include models from different repositories. The files and directories included in a model repository must conform to required layouts.
ONNX models can consist of single files or contain a directory with multiple files. The file or directory must be named model.onnx. A minimal ONNX model definition looks like this:
Note: You can only use ONNX models for the ONNX Runtime version that Triton is currently using.
TensorFlow uses two main types of formats to represent models: GraphDef and SavedModel.
This type of model is represented by a single file called model.graphdef. A minimal model definition looks like this:
A SavedModel is a directory with multiple files. Here is an example of a minimal definition for a TensorFlow SavedModel model:
<list of files>
A TensorRT model is defined in a single file called model.plan.
TensorRT models use the CUDA library’s Compute Capability to communicate with the server’s GPU. Each plan file needs to configure the cc_model_filenames to reference the relevant Compute Capability.
Here is an example of a minimal definition for a TensorRT model:
A TorchScript model uses a single file called model.pt. Here is an example of a minimal model definition:
There are a number of client libraries available to facilitate communication with Triton. The Triton project also provides examples for using these libraries.
Triton client libraries include:
You can leverage a number of applications that demonstrate how you can use these libraries. You can find most of these example models in the GitHub repository. Here are some of the examples available:
First, you need to install Docker. You also need to install the NVIDIA Container Toolkit to use a GPU for inference—otherwise, Docker cannot recognize GPUs. If you intend to use NVIDIA DGX, you should follow these guidelines on preparing containers.
Use this command to pull a Triton Docker image
$ docker pull nvcr.io/nvidia/tritonserver:<xx.yy>-py3
<xx.yy> represents the version of Triton you intend to pull.
A model repository is a directory containing the models that Triton serves. The docs/examples/model_repository provides an example of a model repository. You need to retrieve any missing model definition files before you can use the repositories. You can use the following script to fetch these files from public model zoos:
$ cd docs/examples
While Triton performs best for GPU inferencing, you can also use it for CPU-powered systems. You can use the same Triton Docker image for either GPU or CPU. To run Triton on a GPU-based system, enter the following script:
$ docker run --gpus=3 --rm -p8000:8000 -p8001:8001 -p8002:8002 -v/full/path/to/docs/examples/model_repository:/models nvcr.io/nvidia/tritonserver:<xx.yy>-py3 tritonserver --model-repository=/models
In this example, --gpus=3 indicates that the system should make three GPUs available to Triton for inferencing. Triton will run using your example model repository. <xx.yy> represents the version of Triton you’ve chosen to use. Once you start Triton, the console displays an output showing the server start up and load the model.
Models that load correctly should display a “ready” status. Models that fail to load will send failure reports describing the cause of the failure. If you cannot see a model in the table, you can check your CUDA drivers and the path to your model repository.
Run:AI automates resource management and orchestration for machine learning infrastructure. With Run:AI, you can automatically run compute intensive inference on as many machines as needed.
Here are some of the capabilities you gain when using Run:AI:
Run:AI simplifies machine learning infrastructure pipelines, helping data scientists accelerate their productivity and the quality of their models.
Learn more about the Run:ai GPU virtualization platform.