Machine learning (ML) inference involves applying a machine learning model to a dataset and generating an output or “prediction”. This output might be a numerical score, a string of text, an image, or any other structured or unstructured data.
Typically, a machine learning model is software code implementing a mathematical algorithm. The machine learning inference process deploys this code into a production environment, making it possible to generate predictions for inputs provided by real end-users.
The machine learning life cycle includes two main parts:
In this article:
Here is the key difference between training and inference:
You need three main components to deploy machine learning inference: data sources, a system to host the ML model, and data destinations.
A data source captures real-time data, either from an internal source managed by the organization, from external sources, or from users of the application.
Common examples of data sources for ML applications are log files, transactions stored in a database, or unstructured data in a data lake.
The ML model's host system receives data from data sources and feeds it into the ML model. It provides the infrastructure on which the ML model’s code can run. After outputs (predictions) are generated by the ML model, the host system sends these outputs to the data destination.
Common examples of host systems are an API endpoint accepting inputs through a REST API, a web application receiving inputs from human users, or a stream processing application processing large volumes of log data.
The data destination is the target of the ML model. It can be any type of data repository, such as a database, a data lake, or a stream processing system that feeds downstream applications.
For example, a data destination could be the database of a web application, which stores predictions and allows them to be viewed and queried by end users. In other scenarios, the data destination could be a data lake, where predictions are stored for further analysis by big data tools.
Machine learning inference servers or engines execute your model algorithm and return an inference output. The inference server works by accepting input data, passing it to a trained ML model, executing the model, and returning the inference output.
ML inference servers require ML model creation tools to export the model in a file format that the server can understand. The Apple Core ML inference server, for example, can only read models stored in the .mlmodel file format. If you used TensorFlow to create your model, you can use the TensorFlow conversion tool to convert your model to the .mlmodel file format.
You can use the Open Neural Network Exchange Format (ONNX) to improve file format interoperability between various ML inference servers and your model training environments. ONNX offers an open format for representing deep-learning models, providing greater portability of models between ML inference servers and tools for vendors supporting ONNX.
The following hardware systems are commonly used to run machine learning and deep learning inference workloads.
A CPU can process instructions for performing a sequence of requested operations. CPUs are pieces of hardware containing billions of transistors and powerful cores that can handle massive amounts of operations and memory consumption. CPUs can support any operations without customized programs.
Here are the four building blocks of CPUs:
The universality of CPUs means they include superfluous logic verifications and operations. Additionally, CPUs do not fully exploit deep learning’s parallelism opportunities.
GPUs are specialized hardware components that can perform numerous simple operations simultaneously. GPUs and CPUs share a similar structure—both employing spatial architectures—but otherwise vary greatly.
CPUs are composed of a few ALUs for sequential serial processing. GPUs, on the other hand, include thousands of ALUs that enable the parallel execution of many simple operations. GPUs parallel execution capability makes them ideal for deep learning execution. However, GPUs consume a large amount of energy, meaning that standard GPUs might now be able to run on many edge devices.
FPGA is specialized hardware that users can configure after manufacturing. It includes:
This hierarchy enables inter-wiring blocks in different configurations. Users can write code in a hardware description language (HDL) like VHDL or Verilog, and the program determines the connections and how to use digital components to implement them.
FPGAs can support a vast amount of multiply and accumulate operations. This ability enables FPGAs to implement parallel circuits. However, HDL is a piece of code that defines hardware components like counters and registers—it is not a programming language. It makes some aspects very difficult, such as converting your Python library to FPGA code.
Custom AI chips provide hardware built especially for artificial intelligence (AI), such as Systems on Chip (SoCs) and Application Specific Integrated Circuits (ASICs) for deep learning. Companies worldwide are developing custom AI chips, dedicating many resources to creating hardware that can perform deep learning operations faster than existing hardware, like GPUs.
AI chips are designed for different purposes, with chips built for training and chips customized especially for inference. Notable solutions include Google’s TPU, NVIDIA’s NVDLA, Amazon’s Inferentia, and Intel’s Habana Labs.
There are three primary challenges you might face when setting up ML inference:
The cost of inference is a key factor in effective operation of machine learning models. ML models are often computationally intensive, requiring GPUs and CPUs running in data centers or cloud environments. It is important to ensure that inference workloads fully utilize the available hardware infrastructure, minimizing the cost per inference. One way to do this is to run queries concurrently or in batches.
A common requirement for inference systems is a required maximal latency:
When developing ML models, teams use frameworks like Tensorflow, Pytorch, and Keras. Different teams may use different tools to solve their specific problems. However, when running inference in production, these different models need to play well together. Models may need to run in diverse environments including on client devices, at the edge, or in the cloud.
Containerization has become a common practice that can ease deployment of models to production. Many organizations use Kubernetes to deploy models at large scale and organize them into clusters. Kubernetes makes it possible to deploy multiple instances of inference servers, and scale them up and down as needed. across public clouds and local data centers.
Run:ai automates resource management and orchestration for machine learning infrastructure. With Run:ai, you can automatically run inference workloads at any scale, on any type of computing infrastructure, whether on-premises or in the cloud.
Here are some of the capabilities you gain when using Run:ai:
Run:ai simplifies machine learning infrastructure pipelines, helping data scientists accelerate their productivity and the quality of their models.
Learn more about the Run:ai GPU virtualization platform.