Machine learning operations (MLOps) is the practice of creating new machine learning (ML) and deep learning (DL) models and running them through a repeatable, automated workflow that deploys them to production.
MLOps was inspired by DevOps. The DevOps movement defined a new, agile software development lifecycle (SDLC), which encouraged frequent innovation. Developers work on small, frequent releases, each of which undergoes automated testing and is automatically deployed to production.
Similarly, MLOps defines a new lifecycle for AI technology that allows rapid experimentation, in response to business need or live model performance, and seamless deployment of new models as a predictive service.
An MLOps pipeline provides a variety of services to data science teams, including model version control, continuous integration and continuous delivery (CI/CD), model service catalogs for models in production, infrastructure management, monitoring of live model performance, security, and governance.
This is part of an extensive series of guides about AI Technology.
In this article:
MLOps started as a set of best practices to improve the communications between data scientists and DevOps teams—promoting workflows and processes that could accelerate the time to market for ML applications. Soon, open source MLOps frameworks began to emerge, such as MLflow and Kubeflow.
Today, MLOps capabilities are considered a key requirement for Data Science and Machine Learning (DSML) platforms. Gartner’s “2020 Magic Quadrant for Data Science and Machine Learning Platforms” cites MLOps as a key inclusion criterion, noting that “…[a]s DSML moves out of the lab and into the mainstream, it must be operationalized with seamless integration and carefully designed architecture and processes. Machine learning operations capabilities should also include explainability, versioning of models and business impact analysis, among others.” (Source: A report reprint, available to Gartner subscribers only.)
As shown in the diagram below, the next-generation data science lifecycle breaks down the silos among all the different stakeholders that need to be involved for ML projects to capture business value. This involves:
MLOps is the critical missing link that allows IT to support the highly specialized infrastructure requirements of ML infrastructure. The cyclical, highly automated MLOps approach:
MLOps was inspired by DevOps, and the two approaches are inherently similar. However, there are a few ways in which MLOps differs significantly from DevOps:
This discussion of MLOps maturity is based on a framework by Google Cloud.
At this level of maturity, a team is able to build useful ML/DL models, but have a completely manual process for deploying them to production. The ML pipeline looks like this:
At this level of maturity, there is an understanding that the model needs to be managed in a CI/CD pipeline, and training/validation needs to be performed continuously on new data. The ML Pipeline now evolves to look like this:
At this highest level of MLOps maturity, new experiments are seamlessly deployed to production with minimal involvement of engineers. A data scientist can easily create a new ML pipeline and automatically build, test, and deploy it to a target environment. This type of setup is illustrated in the following diagram.
A fully automated CI/CD pipeline works like this:
Here are a few steps to implementing MLOps in your organization.
To succeed in MLOps, establish a hybrid team including some or all of the following roles. All these roles should work together, assuming shared ownership for ML models working effectively in production:
To form a true cross-functional team, each of these roles should have at least some of the skills of the other roles. Data scientists should be able to code and know the basics of DevOps; machine learning engineers should understand the experimentation process; and DevOps or data engineers should be familiar with machine learning concepts and should not treat models as a black box.
ML pipelines are the “factory floor” of a data science team. Ensure your ML pipeline includes:
Ensure you track everything in the pipeline using version control. An MOps pipeline has two parallel versioning systems:
Each version of the model should be tied to a version of model code, giving the MLOps team a clear audit trail showing what ran where. This way, if a specific version of the model resulted in great performance, or conversely performed poorly, it can be tied back to specific data, parameters, and implementation code.
Ensure the MLOps pipeline automatically validates mode performance. In a DevOps environment, software undergoes automated testing to see if it is good enough to run in production. This testing is usually of a “pass/fail” nature. An MLOps pipeline, by contrast, needs to test a model’s performance and determine if it is “good enough” to run in production.
Model validation typically involves:
All this can be done automatically, and if the model passes a certain threshold, it is deployed. In other cases, data scientists can review model results and make a qualitative decision whether to push it to production or not.
It is not enough to validate the model - you must also automatically validate your datasets. An MLOps pipeline must validate that the data used to train the model has the required characteristics. This is similar to unit testing in a traditional DevOps pipeline. Use automated checks to verify that the data is of the correct format, there are no missing values (if none are expected), and there should be standardized tests of data quality.
Another important check is to compare the data to previous training runs. If the statistical properties of the data changed (for example, the mean or distribution is significantly different), this can affect the model’s predictions. This might mean the data is skewed, or that model inputs are really changing, and the model needs to change to adapt.
Ensure you monitor regular operational properties such as latency, system load and errors, and in addition, monitor the performance of your product ML models:
Learn more about cutting edge MLOps tools in our guides to:
In today’s highly competitive economy, enterprises are looking to Artificial Intelligence in general and Machine and Deep Learning in particular to transform big data into actionable insights that can help them better address their target audiences, improve their decision-making processes, and streamline their supply chains and production processes, to mention just a few of the many use cases out there. In order to stay ahead of the curve and capture the full value of ML, however, companies must strategically embrace MLOps.
Run:ai’s AI/ML virtualization platform is an important enabler for Machine Learning Operations teams. Focusing on deep learning neural network models that are particularly compute-intensive, Run:AI creates a pool of shared GPU and other compute resources that are provisioned dynamically to meet the needs of jobs in process. By abstracting workloads from the underlying infrastructure, organizations can embrace MLOps and allow data scientists to focus on models, while letting IT teams gain control and real-time visibility of compute resources across multiple sites, both on-premises and in the cloud.
See for yourself how Run:AI can operationalize your data science projects, accelerating their journey from research to production.
The rich text element allows you to create and format headings, paragraphs, blockquotes, images, and video all in one place instead of having to add and format them individually. Just double-click and easily create content.
Apache Airflow: Use Cases, Architecture, and Best Practices
Apache Airflow is an open-source platform for authoring, scheduling and monitoring data and computing workflows.
Understand how Apache Airflow can help you automate workflows for ETL, DevOps and machine learning tasks.
Read more: Apache Airflow: Use Cases, Architecture, and Best Practices
Edge AI: Benefits, Use Cases, and Deployment Models
Edge computing helps make data storage and computation more accessible to users. This is achieved by running operations on local devices like laptops, Internet of Things (IoT) devices, or dedicated edge servers. Edge processes are not affected by the latency and bandwidth issues that often hamper the performance of cloud-based operations.
Learn how edge AI is making real-time AI inference a reality for mobile devices, IoT, video analytics, and more.
Read more: Edge AI: Benefits, Use Cases, and Deployment Models
JupyterHub: A Practical Guide
Jupyter Notebook is an open source application, used by data scientists and machine learning professionals to author and present code, explanatory text, and visualizations. JupyterHub is an open source tool that lets you host a distributed Jupyter Notebook environment.
Learn how JupyterHub works in depth, see two quick deployment tutorials, and learn to configure the user environment.
Read more: JupyterHub: A Practical Guide
MLflow: The Basics and a Quick Tutorial
MLflow is an open source platform for managing machine learning workflows. It is used by machine learning engineering teams and data scientists. MLflow has four main components:
Understand MLflow tracking, projects, and models, and see a quick tutorial showing how to train a machine learning model and deploy it to production.
Read more: MLflow: The Basics and a Quick Tutorial
Together with our content partners, we have authored in-depth guides on several other topics that can also be useful as you explore the world of AI Technology.