Understanding Machine Learning Inference

What is Machine Learning Inference?

Machine learning (ML) inference involves applying a machine learning model to a dataset and generating an output or “prediction”. This output might be a numerical score, a string of text, an image, or any other structured or unstructured data. 

Typically, a machine learning model is software code implementing a mathematical algorithm. The machine learning inference process deploys this code into a production environment, making it possible to generate predictions for inputs provided by real end-users.

The machine learning life cycle includes two main parts:

  1. The training phase—involves creating a machine learning model, training it by running the model on labeled data examples, then testing and validating the model by running it on unseen examples.
  2. Machine learning inference—involves putting the model to work on live data to produce an actionable output. During this phase, the inference system accepts inputs from end-users, processes the data, feeds it into the ML model, and serves outputs back to users.

In this article:

Machine Learning Training Versus Inference

Here is the key difference between training and inference:

  • Machine learning training is the process of using an ML algorithm to build a model. It typically involves using a training dataset and a deep learning framework like TensorFlow. 
  • Machine learning inference is the process of using a pre-trained ML algorithm to make predictions. 

How Does Machine Learning Inference Work?

You need three main components to deploy machine learning inference: data sources, a system to host the ML model, and data destinations.

Data Source

A data source captures real-time data, either from an internal source managed by the organization, from external sources, or from users of the application. 

Common examples of data sources for ML applications are log files, transactions stored in a database, or unstructured data in a data lake.

Host System

The ML model's host system receives data from data sources and feeds it into the ML model. It provides the infrastructure on which the ML model’s code can run. After outputs (predictions) are generated by the ML model, the host system sends these outputs to the data destination. 

Common examples of host systems are an API endpoint accepting inputs through a REST API, a web application receiving inputs from human users, or a stream processing application processing large volumes of log data.

Data Destination

The data destination is the target of the ML model. It can be any type of data repository, such as a database, a data lake, or a stream processing system that feeds downstream applications. 

For example, a data destination could be the database of a web application, which stores predictions and allows them to be viewed and queried by end users. In other scenarios, the data destination could be a data lake, where predictions are stored for further analysis by big data tools.

What Is a Machine Learning Inference Server?

Machine learning inference servers or engines execute your model algorithm and return an inference output. The inference server works by accepting input data, passing it to a trained ML model, executing the model, and returning the inference output. 

ML inference servers require ML model creation tools to export the model in a file format that the server can understand. The Apple Core ML inference server, for example, can only read models stored in the .mlmodel file format. If you used TensorFlow to create your model, you can use the TensorFlow conversion tool to convert your model to the .mlmodel file format.

You can use the Open Neural Network Exchange Format (ONNX) to improve file format interoperability between various ML inference servers and your model training environments. ONNX offers an open format for representing deep-learning models, providing greater portability of models between ML inference servers and tools for vendors supporting ONNX.

Hardware for Deep Learning Inference

The following hardware systems are commonly used to run machine learning and deep learning inference workloads.

Central Processing Unit (CPU)

A CPU can process instructions for performing a sequence of requested operations. CPUs are pieces of hardware containing billions of transistors and powerful cores that can handle massive amounts of operations and memory consumption. CPUs can support any operations without customized programs. 

Here are the four building blocks of CPUs: 

  • Control Unit (CU)—directs the processor’s operations, informing other components on how to respond to instructions sent to the processor. 
  • Arithmetic logic unit (ALU)—performs bitwise logical operations and integer arithmetic. 
  • Address generation unit (AGU)—calculates the addresses used to access the main memory. 
  • Memory management unit (MMU)—any memory component the CPU uses to allocate memory.

The universality of CPUs means they include superfluous logic verifications and operations. Additionally, CPUs do not fully exploit deep learning’s parallelism opportunities.

Graphical Processing Units (GPU)

GPUs are specialized hardware components that can perform numerous simple operations simultaneously. GPUs and CPUs share a similar structure—both employing spatial architectures—but otherwise vary greatly.  

CPUs are composed of a few ALUs for sequential serial processing. GPUs, on the other hand, include thousands of ALUs that enable the parallel execution of many simple operations. GPUs parallel execution capability makes them ideal for deep learning execution. However, GPUs consume a large amount of energy, meaning that standard GPUs might now be able to run on many edge devices.

Field Programmable Gate Array (FPGA)

FPGA is specialized hardware that users can configure after manufacturing. It includes: 

  • An array of programmable logic blocks 
  • A hierarchy of configurable interconnections 

This hierarchy enables inter-wiring blocks in different configurations. Users can write code in a hardware description language (HDL) like VHDL or Verilog, and the program determines the connections and how to use digital components to implement them. 

FPGAs can support a vast amount of multiply and accumulate operations. This ability enables FPGAs to implement parallel circuits. However, HDL is a piece of code that defines hardware components like counters and registers—it is not a programming language. It makes some aspects very difficult, such as converting your Python library to FPGA code.

Custom AI Chips (SoC and ASIC)

Custom AI chips provide hardware built especially for artificial intelligence (AI), such as Systems on Chip (SoCs) and Application Specific Integrated Circuits (ASICs) for deep learning. Companies worldwide are developing custom AI chips, dedicating many resources to creating hardware that can perform deep learning operations faster than existing hardware, like GPUs.

AI chips are designed for different purposes, with chips built for training and chips customized especially for inference. Notable solutions include Google’s TPU, NVIDIA’s NVDLA, Amazon’s Inferentia, and Intel’s Habana Labs.

ML Inference Challenges

There are three primary challenges you might face when setting up ML inference:

Infrastructure Cost

The cost of inference is a key factor in effective operation of machine learning models. ML models are often computationally intensive, requiring GPUs and CPUs running in data centers or cloud environments. It is important to ensure that inference workloads fully utilize the available hardware infrastructure, minimizing the cost per inference. One way to do this is to run queries concurrently or in batches.


A common requirement for inference systems is a required maximal latency:

  • Mission-critical applications often require real-time inference. Examples include autonomous navigation, critical material handling, and medical equipment. 
  • Some use cases can tolerate higher latency. For example, some big data analytics use cases do not require an immediate response. You can run these analyses in batches based on the frequency of inference queries.


When developing ML models, teams use frameworks like Tensorflow, Pytorch, and Keras. Different teams may use different tools to solve their specific problems. However, when running inference in production, these different models need to play well together. Models may need to run in diverse environments including on client devices, at the edge, or in the cloud.

Containerization has become a common practice that can ease deployment of models to production. Many organizations use Kubernetes to deploy models at large scale and organize them into clusters. Kubernetes makes it possible to deploy multiple instances of inference servers, and scale them up and down as needed. across public clouds and local data centers.

Scaling Machine Learning Inference with Run:ai

Run:ai automates resource management and orchestration for machine learning infrastructure. With Run:ai, you can automatically run inference workloads at any scale, on any type of computing infrastructure, whether on-premises or in the cloud.

Here are some of the capabilities you gain when using Run:ai: 

  • Advanced visibility—create an efficient pipeline of resource sharing by pooling GPU compute resources.
  • No more bottlenecks—you can set up guaranteed quotas of GPU resources, to avoid bottlenecks and optimize billing.
  • A higher level of control—Run:ai enables you to dynamically change resource allocation, ensuring each job gets the resources it needs at any given time.

Run:ai simplifies machine learning infrastructure pipelines, helping data scientists accelerate their productivity and the quality of their models. 

Learn more about the Run:ai GPU virtualization platform.