Question 1

Components of Effective Pipelines

Accepted Answer

Machine learning (ML) infrastructure is the foundation on which machine learning models are developed and deployed. Because models differ between projects, machine learning infrastructure implementations also vary. However, there are core components any machine learning infrastructure needs to be fully functional. This article explains these components, and reviews important aspects you should consider when creating your machine learning infrastructure. Learn more about Machine Learning Engineering.

Question 2

What Is Machine Learning Infrastructure?

Accepted Answer

Machine learning infrastructure includes the resources, processes, and tooling needed to develop, train, and operate machine learning models. It is sometimes referred to as AI infrastructure or a component of MLOps. ML infrastructure supports every stage of machine learning workflows. It enables data scientists, engineers, and DevOps teams to manage and operate the various resources and processes required to train and deploy neural network models. Learn more about Triton Inference Server.

Question 3

Key Considerations for Infrastructure that Supports ML

Accepted Answer

When creating your machine learning infrastructure there are several considerations that you should keep in mind.

Location

Pay attention to where your machine learning workflows are being conducted. The requirements for on-premises operations vs cloud operations can differ significantly. Additionally, your location of choice should support the purpose of your model.

In the training stage, you should primarily focus on cost considerations and operational convenience. Security and regulations relating to data are also important considerations when deciding where to store training data. Will it be cheaper and/or easier to perform training on premises or in the cloud? The answer may vary depending on the number of models, the size and nature of data being ingested, and your ability to automate the infrastructure.

In the inference stage, the focus should be on balancing between performance and latency requirements vs available hardware in the target location. Models that need a fast response or very low latency should prioritize local or edge infrastructures, and be optimized to run on low-powered local hardware. Models that can tolerate some latency can leverage cloud infrastructure, which can scale up if needed to run “heavier” inference workflows.

Compute requirements

The hardware used for machine learning can have a huge impact on performance and cost. Typically, GPUs are used to run deep learning models, and CPUs are used to run classical machine learning models. In some cases, the traditional ML uses large volumes of data, it can also be accelerated by GPUs using frameworks like Nvidia’s RAPIDS.

In both cases, the efficiency of the GPU or CPU for the algorithms being used will affect operating and cloud costs, hours spent waiting for processes to complete, and by extension, time to market..

When building your machine learning infrastructure you should find the balance between underpowering and overpowering your resources. Underpowering may save you upfront costs but requires extra time and reduces efficiency. Overpowering ensures that you aren’t restricted by hardware but means you’re paying for unused resources.

Network infrastructure

The right network infrastructure is vital to ensuring efficient machine learning operations. You need all of your various tools to communicate smoothly and reliably. You also need to ingest and deliver data to and from outside sources without bottlenecks.

To ensure that networking resources meet your needs, you should consider the overall environment you are working in. You should also carefully gauge how well networking capabilities match your processing and storage capabilities. Lightning fast network speeds aren’t helpful if your processing or data retrieval speeds lag.

Storage infrastructure

An automated ML pipeline should have access to an appropriate volume of storage, according to the data requirements of the models. Data-hungry models may require Petabytes of storage. You need to consider in advance where to locate this storage – on-premises or on the cloud.

It is always preferred to colocate storage with training. For example, you can run training using TPUs on Google Cloud, and have data stored in Google Cloud Storage, which is infinitely scalable. Or you could run training on local NVIDIA GPUs and use a large-volume, high performance, fast distributed file system to store data locally. If you create a hybrid infrastructure, plan data ingestion carefully to prevent delays and complexity in training

Data center extension

If you are incorporating machine learning into existing business operations you should work to extend your current infrastructure. While it may seem easier to start from scratch, this often isn’t cost-efficient and can negatively affect productivity.

A better option is to evaluate the existing infrastructure resources and tooling you have. Any assets that are suited to your machine learning needs should be integrated. The exception is if you are planning to retire those assets soon. Then, you are better off adopting new resources and tools.

Security

Training and applying models requires extensive amounts of data, which is often valuable or sensitive. For example, financial data or medical images. Big data is a big lure for threat actors interested in using data for malicious purposes, like ransoming or stealing data in black markets.

Additionally, depending on the purpose of the model, illegitimate manipulation of data could lead to serious damages. For example, if models used for object detection in autonomous vehicles are manipulated to cause intentional crashes.

When creating your machine learning infrastructure you should take care to build in monitoring, encryption, and access controls to properly secure your data. You should also verify which compliance standards apply to your data. Depending on the results, you may need to limit the physical location of data storage or process data to remove sensitive information before use.

Machine Learning Infrastructure

Components of Effective Pipelines

Related Articles

Components of Effective Pipelines

What Is Machine Learning Infrastructure?

Machine Learning Infrastructure Development: The Building Blocks

Key Considerations for Infrastructure that Supports ML

Speed Up Machine Learning Infrastructure With Run:ai

See Our Additional Guides on Key Artificial Intelligence Infrastructure Topics

GPUs for Deep Learning

Kubernetes and AI