Machine Learning Operations

What it is and why we need it

This article explains how Machine Learning Operations came to be a discipline inside many companies and things to consider when deciding if your organization is ready to form an MLOps team.

Machine learning (ML) is a subset of artificial intelligence in which computer systems autonomously learn a task over time. Based on pattern analyses and inference models, ML algorithms allow a computer system to adapt in real time as it is exposed to data and real-world interactions.

For many people, ML was, until recently, considered science fiction. But advances in computational power, frictionless access to scalable cloud resources, and the exponential growth of data have fueled an increase in ML-based applications. Today, ML has a profound impact on a wide range of verticals such as financial services, telecommunications, healthcare, retail, education, and manufacturing. Within all of these sectors, ML is driving faster and better decisions in business-critical use cases, from marketing and sales to business intelligence, R&D, production, executive management, IT, and finance.

As enterprises deepen their engagement with ML, however, they are discovering not only its benefits, but its challenges as well. Definitive surveys do not seem to be available, but numerous opinion leaders have noted that many ML models (some say as high as 70-80%) never make it from the prototype stage to production. A commonly cited reason for this high failure rate is the difficulty in bridging the gap between the data scientists who build and train the inference models and the IT team that maintains the infrastructure as well as the engineers who develop and deploy production-ready ML applications.

In this article, we look more closely at the challenges preventing ML from reaching its full game-changing potential and discuss how a relatively new discipline, Machine Learning Operations, or MLOps, is the key to optimizing the lifecycle of ML applications.

Machine Learning Operations: Getting from Science to Production

Machine learning is rooted in the realm of data science. For the ML inference model used during runtime to identify a pattern or predict an outcome, data scientists must make sure it is “asking the right questions,” i.e., looking for the features that are most relevant to the task at hand. Once data scientists have defined an initial set of features, their next task is to identify, aggregate, clean, and annotate a known data set that can be used to train the model to recognize those features. Here, the larger the training data set, the better. The data scientists then continue to optimize the model under development through a highly iterative process of training, testing, and tuning.

In addition to data science expertise, developing an ML model also involves considerable IT and infrastructure skills. Huge data sets have to be aggregated, stored, moved, protected, and managed. Training and testing the models require very high levels of compute capacity and performance.

Thus, one of the first challenges in accelerating the ML development lifecycle is to abstract the infrastructure layer from the data science. In much the same way that DevOps freed developers from infrastructure issues, allowing them to concentrate on application development, a simple and easy-to-use research environment is necessary to allow data scientists to focus on model development rather than infrastructure provisioning and monitoring.

Other challenges are related to an inherent disconnect between data scientists and the engineers who must operationalize the models in production-ready applications. Each group works in its own silo, with its own unique mindset, concepts, processes, and tool stacks. In many cases, the engineering team may have difficulty simply understanding the model handed off to them by the data scientists. And once the model is in production, it is tough for the operations team to understand which metrics and parameters need to be tracked in order to effectively monitor accuracy and performance. It’s also not easy to establish the critical feedback loop for the data science team to be able to continue improving the inference model while ensuring that the updated models don’t have a negative impact on application performance.

Why is MLOps Important? Closing the Loop with Machine Learning Operations

MLOps (Machine Learning Operations) is a relatively new discipline that seeks to systematize the entire ML lifecycle, from science to production. It started as a set of best practices to improve the communications between data scientists and DevOps teams—promoting workflows and processes that could accelerate the time to market for ML applications. Soon, open source MLOps frameworks began to emerge, such as MLflow and Kubeflow.

Today, MLOps capabilities are considered a key requirement for Data Science and Machine Learning (DSML) platforms. Gartner’s “2020 Magic Quadrant for Data Science and Machine Learning Platforms” cites MLOps as a key inclusion criterion, noting that “…[a]s DSML moves out of the lab and into the mainstream, it must be operationalized with seamless integration and carefully designed architecture and processes. Machine learning operations capabilities should also include explainability, versioning of models and business impact analysis, among others.” (Source: A report reprint, available to Gartner subscribers only.)

As shown in Figure 1 below, the next-generation data science lifecycle breaks down the silos among all the different stakeholders that need to be involved for ML projects to capture business value. This starts with the modeling and data acquisition activities of the data science team being informed by a clear understanding of the business objectives for the ML application—as well as of the governance and compliance issues that should be taken into account. The MLOps model then ensures that the data science, production, and operations teams work seamlessly together across ML workflows that are as automated as possible, ensuring smooth deployments and effective ongoing monitoring. Performance issues, as well as new production data, are reflected back to the data science team so that they can tune and improve the model, which is then thoroughly tested by the operations team before being put into production.

Figure 1: MLOps Drives Data Science Success and Value. (Source: Azure)

In short, Machine learning operations is the critical missing link that allows IT to support the highly specialized infrastructure requirements of ML infrastructure. The cyclical, highly automated MLOps approach:

  • Reduces the time and complexity of moving models into production.
  • Enhances communications and collaboration across teams that are often siloed: data science, development, operations.
  • Streamlines the interface between R&D processes and infrastructure, in general, and operationalizes the use of specialized hardware accelerators (such as GPUs), in particular.
  • Operationalizes model issues critical to long-term application health, such as versioning, tracking, and monitoring.
  • Makes it easier to monitor and understand ML infrastructure and compute costs at all stages, from development to production.
  • Standardizes the ML process and makes it more auditable for regulation and governance purposes.

Stay Ahead of the ML Curve

In today’s highly competitive economy, enterprises are looking to Artificial Intelligence in general and Machine and Deep Learning in particular to transform big data into actionable insights that can help them better address their target audiences, improve their decision-making processes, and streamline their supply chains and production processes, to mention just a few of the many use cases out there. In order to stay ahead of the curve and capture the full value of ML, however, companies must strategically embrace MLOps.

Run:AI’s AI/ML virtualization platform is an important enabler for Machine Learning Operations teams. Focusing on deep learning neural network models that are particularly compute-intensive, Run:AI creates a pool of shared GPU and other compute resources that are provisioned dynamically to meet the needs of jobs in process. By abstracting workloads from the underlying infrastructure, organizations can embrace MLOps and allow data scientists to focus on models, while letting IT teams gain control and real-time visibility of compute resources across multiple sites, both on-premises and in the cloud.

See for yourself how Run:AI can operationalize your data science projects, accelerating their journey from research to production.

Learn More About Machine Learning Operations

There’s a lot more to learn about machine learning operations. To continue your research, take a look at the rest of our guides on this topic:

Machine Learning Infrastructure: Components of Effective Pipelines

Machine learning infrastructure includes the processes, resources, and tooling needed to develop, train, and operate machine learning models. It is sometimes referred to as AI infrastructure or a component of MLOps.

ML infrastructure supports every stage of machine learning workflows. It enables engineers, data scientists, and DevOps teams to manage and operate the required resources and processes.

Read more: Machine Learning Infrastructure: Components of Effective Pipelines

Machine Learning Automation: Speeding Up the Data Science Pipeline

Machine learning automation enables data scientists to automate the machine learning workflow. Without automation, the ML workflow can take a very long time, even months. This includes data preparation, training, until actual deployment. 

Read more: Machine Learning Automation: Speeding Up the Data Science Pipeline

Machine Learning Workflow: Streamlining Your ML Pipeline

Machine learning workflows define which stages are implemented during a machine learning project. The common stages include data collection, data pre-processing, building datasets, model training, and deployment to production. You can automate some aspects of the workflow, such as model and feature selection phases, but not all.

Read more: Machine Learning Workflow: Streamlining Your ML Pipeline

See Our Additional Guides on Key Artificial Intelligence Infrastructure Topics

We have authored in-depth guides on several other artificial intelligence infrastructure topics that can also be useful as you explore the world of deep learning GPUs. 

GPUs for Deep Learning

Learn how to assess GPUs to determine which is the best GPU for your deep learning model. Discover types of consumer and data center deep learning GPUs. Get started with PyTorch for GPUs – learn how PyTorch supports NVIDIA’s CUDA standard, and get quick technical instructions for using PyTorch with CUDA. Finally, learn about the NVIDIA deep learning SDK, what are the top NVIDIA GPUs for deep learning, and what best practices you should adopt when using NVIDIA GPUs.

See top articles in our GPU for Deep Learning guide:

Kubernetes and AI

This guide explains the Kubernetes Architecture for AI workloads and how K8s came to be used inside many companies. There are specific considerations implementing Kubernetes to orchestrate AI workloads. Finally, the guide addresses the shortcomings of Kubernetes when it comes to scheduling and orchestration of Deep Learning workloads and how you can address those shortfalls.

See top articles in our Kubernetes for AI guide: