Machine learning (ML) infrastructure is the foundation on which machine learning models are developed and deployed. Because models differ between projects, machine learning infrastructure implementations also vary. However, there are core components any machine learning infrastructure needs to be fully functional.
This article explains these components, and reviews important aspects you should consider when creating your machine learning infrastructure.
This is part of our series of articles about machine learning engineering.
In this article, you will learn:
Machine learning infrastructure includes the resources, processes, and tooling needed to develop, train, and operate machine learning models. It is sometimes referred to as AI infrastructure or a component of MLOps.
ML infrastructure supports every stage of machine learning workflows. It enables data scientists, engineers, and DevOps teams to manage and operate the various resources and processes required to train and deploy neural network models.
To understand machine learning infrastructure it helps to first understand its components.
Machine learning model selection is the process of selecting a well-fitting model. It determines what data is ingested, what tools are used, which components are required, and how components are interlinked.
Data ingestion capabilities are at the core of any machine learning infrastructure. These capabilities are needed to collect data for model training, application, and refinement.
In terms of tooling, data ingestion requires connections to data sources, processing pipelines, and storage. These tools need to be scalable, flexible, and highly performant. Frequently, extract, load, transform (ELT) pipelines and data lakes are included to meet these needs.
Data ingestion tools enable data from a wide range of sources to be aggregated and stored without requiring significant upfront processing. This allows teams to leverage real-time data and to effectively collaborate on the creation of datasets.
ML pipelines automation
There are numerous tools available that can automate machine learning workflows according to scripts and event triggers. Pipelines are used to process data, train models, perform monitoring tasks, and deploy results. These tools enable teams to focus on higher-level tasks while helping to increase efficiency and ensure the standardization of processes.
When developing your infrastructure, you can create toolchains from scratch by individually integrating and orchestrating tools. You can also adopt pre-built or self-contained pipelines, such as ML Flow Pipelines or Apache Airflow. Learn more in our guide about machine learning automation.
Visualization and monitoring
Machine learning visualization and monitoring are used to gain perspective on how smoothly workflows are moving, how accurate model training is, and to derive insights from model results. Visualizations can be integrated at any point in machine learning workflows to enable teams to quickly interpret system metrics. Monitoring should be integrated throughout.
When incorporating visualization and monitoring into your machine learning infrastructure, you need to ensure that tools ingest data consistently. If solutions do not integrate with all relevant data sources you will not get meaningful insights. Additionally, you need to keep in mind the resources that these tools require. Make sure that you are choosing solutions that work efficiently and do not create resource conflicts with your training or deployment tools.
Testing machine learning models requires integrating tooling between training and deployment phases. This tooling is used to run models against manually labeled datasets to ensure that the results are as expected. Thorough testing requires:
To set up machine learning testing, you need to add monitoring, data analysis, and visualization tools to your infrastructure. You also need to set up automated creation and management of environments. During set up you should perform integration tests to ensure that components are not causing errors in other components or negatively affecting your test results.
Deployment is the final step that you need to account for in your architecture. This step packages your model and makes it available to development teams for integration into services or applications.
If you are offering Machine Learning as a Service (MLaaS), it may also mean deploying the model to a production environment. This deployment enables you to take data from and return results to users. Typically, MLaaS involves containerizing models. When models are hosted in containers, you can deliver them as scalable, distributed services regardless of end environment.
In the deployment stage, it is important to evaluate deep learning frameworks and select those that best fit your needs for ongoing inference of new data. You will need to select and optimize the framework that meet your performance requirements in production without exhausting your hardware resources. For example, a computer vision model running in a self driving car must perform inference at millisecond speeds, while taking into account the hardware available on board the car.
The process of moving models between frameworks, according to production needs, has been made easier in recent years with the development of universal model file formats. These formats enable you to more easily port models between libraries, such as the Open Neural Network eXchange (ONNX).
When creating your machine learning infrastructure there are several considerations that you should keep in mind.
Pay attention to where your machine learning workflows are being conducted. The requirements for on-premises operations vs cloud operations can differ significantly. Additionally, your location of choice should support the purpose of your model.
In the training stage, you should primarily focus on cost considerations and operational convenience. Security and regulations relating to data are also important considerations when deciding where to store training data. Will it be cheaper and/or easier to perform training on premises or in the cloud? The answer may vary depending on the number of models, the size and nature of data being ingested, and your ability to automate the infrastructure.
In the inference stage, the focus should be on balancing between performance and latency requirements vs available hardware in the target location. Models that need a fast response or very low latency should prioritize local or edge infrastructures, and be optimized to run on low-powered local hardware. Models that can tolerate some latency can leverage cloud infrastructure, which can scale up if needed to run “heavier” inference workflows.
The hardware used for machine learning can have a huge impact on performance and cost. Typically, GPUs are used to run deep learning models, and CPUs are used to run classical machine learning models. In some cases, the traditional ML uses large volumes of data, it can also be accelerated by GPUs using frameworks like Nvidia’s RAPIDS.
In both cases, the efficiency of the GPU or CPU for the algorithms being used will affect operating and cloud costs, hours spent waiting for processes to complete, and by extension, time to market..
When building your machine learning infrastructure you should find the balance between underpowering and overpowering your resources. Underpowering may save you upfront costs but requires extra time and reduces efficiency. Overpowering ensures that you aren’t restricted by hardware but means you’re paying for unused resources.
The right network infrastructure is vital to ensuring efficient machine learning operations. You need all of your various tools to communicate smoothly and reliably. You also need to ingest and deliver data to and from outside sources without bottlenecks.
To ensure that networking resources meet your needs, you should consider the overall environment you are working in. You should also carefully gauge how well networking capabilities match your processing and storage capabilities. Lightning fast network speeds aren’t helpful if your processing or data retrieval speeds lag.
An automated ML pipeline should have access to an appropriate volume of storage, according to the data requirements of the models. Data-hungry models may require Petabytes of storage. You need to consider in advance where to locate this storage – on-premises or on the cloud.
It is always preferred to colocate storage with training. For example, you can run training using TPUs on Google Cloud, and have data stored in Google Cloud Storage, which is infinitely scalable. Or you could run training on local NVIDIA GPUs and use a large-volume, high performance, fast distributed file system to store data locally. If you create a hybrid infrastructure, plan data ingestion carefully to prevent delays and complexity in training
Data center extension
If you are incorporating machine learning into existing business operations you should work to extend your current infrastructure. While it may seem easier to start from scratch, this often isn’t cost-efficient and can negatively affect productivity.
A better option is to evaluate the existing infrastructure resources and tooling you have. Any assets that are suited to your machine learning needs should be integrated. The exception is if you are planning to retire those assets soon. Then, you are better off adopting new resources and tools.
Training and applying models requires extensive amounts of data, which is often valuable or sensitive. For example, financial data or medical images. Big data is a big lure for threat actors interested in using data for malicious purposes, like ransoming or stealing data in black markets.
Additionally, depending on the purpose of the model, illegitimate manipulation of data could lead to serious damages. For example, if models used for object detection in autonomous vehicles are manipulated to cause intentional crashes.
When creating your machine learning infrastructure you should take care to build in monitoring, encryption, and access controls to properly secure your data. You should also verify which compliance standards apply to your data. Depending on the results, you may need to limit the physical location of data storage or process data to remove sensitive information before use.
Machine learning infrastructure pipelines set the pace of the entire development cycle. If resource allocation is not properly configured and optimized, you can quickly hit compute or memory bottlenecks.
You can avoid these issues by replacing static allocation and provisioning with automated and dynamic resource management. This capability is enabled by virtualization software like Run:AI, which automates resource management and workload orchestration for machine learning infrastructure. With Run:AI, you can automatically run as many compute intensive experiments as needed.
Here are some of the capabilities you gain when using Run:AI:
Run:AI simplifies machine learning infrastructure pipelines, helping data scientists accelerate their productivity and the quality of their models. Learn more about the Run:AI platform.
We have authored in-depth guides on several other artificial intelligence infrastructure topics that can also be useful as you explore the world of deep learning GPUs.
Learn how to assess GPUs to determine which is the best GPU for your deep learning model. Discover types of consumer and data center deep learning GPUs. Get started with PyTorch for GPUs – learn how PyTorch supports NVIDIA’s CUDA standard, and get quick technical instructions for using PyTorch with CUDA. Finally, learn about the NVIDIA deep learning SDK, what are the top NVIDIA GPUs for deep learning, and what best practices you should adopt when using NVIDIA GPUs.
See top articles in our GPU for Deep Learning guide:
This guide explains the Kubernetes Architecture for AI workloads and how K8s came to be used inside many companies. There are specific considerations implementing Kubernetes to orchestrate AI workloads. Finally, the guide addresses the shortcomings of Kubernetes when it comes to scheduling and orchestration of Deep Learning workloads and how you can address those shortfalls.
See top articles in our Kubernetes for AI guide: