Machine Learning Operations (MLOps) – What It Is and Why We Need It

– Ronen Dar, CTO and Co-founder, Run:AI – 

Machine learning (ML) is a subset of artificial intelligence in which computer systems autonomously learn a task over time. Based on pattern analyses and inference models, ML algorithms allow a computer system to adapt in real time as it is exposed to data and real-world interactions.

For many people, ML was, until recently, considered science fiction. But advances in computational power, frictionless access to scalable cloud resources, and the exponential growth of data have fueled an increase in ML-based applications. Today, ML has a profound impact on a wide range of verticals such as financial services, telecommunications, healthcare, retail, education, and manufacturing. Within all of these sectors, ML is driving faster and better decisions in business-critical use cases, from marketing and sales to business intelligence, R&D, production, executive management, IT, and finance. 

As enterprises deepen their engagement with ML, however, they are discovering not only its benefits, but its challenges as well. Definitive surveys do not seem to be available, but numerous opinion leaders have noted that many ML models (some say as high as 70-80%) never make it from the prototype stage to production. A commonly cited reason for this high failure rate is the difficulty in bridging the gap between the data scientists who build and train the inference models and the IT team that maintains the infrastructure as well as the engineers who develop and deploy production-ready ML applications.

In this article, we look more closely at the challenges preventing ML from reaching its full game-changing potential and discuss how a relatively new discipline, Machine Learning Operations, or MLOps, is the key to optimizing the lifecycle of ML applications.

Machine Learning: Getting from Science to Production

Machine learning is rooted in the realm of data science. For the ML inference model used during runtime to identify a pattern or predict an outcome, data scientists must make sure it is “asking the right questions,” i.e., looking for the features that are most relevant to the task at hand. Once data scientists have defined an initial set of features, their next task is to identify, aggregate, clean, and annotate a known data set that can be used to train the model to recognize those features. Here, the larger the training data set, the better. The data scientists then continue to optimize the model under development through a highly iterative process of training, testing, and tuning.

In addition to data science expertise, developing an ML model also involves considerable IT and infrastructure skills. Huge data sets have to be aggregated, stored, moved, protected, and managed. Training and testing the models require very high levels of compute capacity and performance. 

Thus, one of the first challenges in accelerating the ML development lifecycle is to abstract the infrastructure layer from the data science. In much the same way that DevOps freed developers from infrastructure issues, allowing them to concentrate on application development, a simple and easy-to-use research environment is necessary to allow data scientists to focus on model development rather than infrastructure provisioning and monitoring.

Other challenges are related to an inherent disconnect between data scientists and the engineers who must operationalize the models in production-ready applications. Each group works in its own silo, with its own unique mindset, concepts, processes, and tool stacks. In many cases, the engineering team may have difficulty simply understanding the model handed off to them by the data scientists. And once the model is in production, it is tough for the operations team to understand which metrics and parameters need to be tracked in order to effectively monitor accuracy and performance. It’s also not easy to establish the critical feedback loop for the data science team to be able to continue improving the inference model while ensuring that the updated models don’t have a negative impact on application performance.

Closing the Loop with Machine Learning Operations

MLOps (Machine Learning Operations) is a relatively new discipline that seeks to systematize the entire ML lifecycle, from science to production. It started as a set of best practices to improve the communications between data scientists and DevOps teams—promoting workflows and processes that could accelerate the time to market for ML applications. Soon, open source MLOps frameworks began to emerge, such as MLflow and Kubeflow

Today, MLOps capabilities are considered a key requirement for Data Science and Machine Learning (DSML) platforms. Gartner’s “2020 Magic Quadrant for Data Science and Machine Learning Platforms” cites MLOps as a key inclusion criterion, noting that “…[a]s DSML moves out of the lab and into the mainstream, it must be operationalized with seamless integration and carefully designed architecture and processes. Machine learning operations capabilities should also include explainability, versioning of models and business impact analysis, among others.” (Source: A report reprint, available to Gartner subscribers only.)

As shown in Figure 1 below, the next-generation data science lifecycle breaks down the silos among all the different stakeholders that need to be involved for ML projects to capture business value. This starts with the modeling and data acquisition activities of the data science team being informed by a clear understanding of the business objectives for the ML application—as well as of the governance and compliance issues that should be taken into account. The MLOps model then ensures that the data science, production, and operations teams work seamlessly together across workflows that are as automated as possible, ensuring smooth deployments and effective ongoing monitoring. Performance issues, as well as new production data, are reflected back to the data science team so that they can tune and improve the model, which is then thoroughly tested by the operations team before being put into production.

Figure 1: MLOps Drives Data Science Success and Value. (Source: Azure)

In short, Machine learning operations is the critical missing link that allows IT to support the highly specialized infrastructure requirements of ML. The cyclical, highly automated MLOps approach:

  • Reduces the time and complexity of moving models into production.
  • Enhances communications and collaboration across teams that are often siloed: data science, development, operations.
  • Streamlines the interface between R&D processes and infrastructure, in general, and operationalizes the use of specialized hardware accelerators (such as GPUs), in particular.
  • Operationalizes model issues critical to long-term application health, such as versioning, tracking, and monitoring.
  • Makes it easier to monitor and understand ML infrastructure and compute costs at all stages, from development to production. 
  • Standardizes the ML process and makes it more auditable for regulation and governance purposes.

Stay Ahead of the ML Curve

In today’s highly competitive economy, enterprises are looking to Artificial Intelligence in general and Machine and Deep Learning in particular to transform big data into actionable insights that can help them better address their target audiences, improve their decision-making processes, and streamline their supply chains and production processes, to mention just a few of the many use cases out there. In order to stay ahead of the curve and capture the full value of ML, however, companies must strategically embrace MLOps.

Run:AI’s AI/ML virtualization platform is an important enabler for MLOps teams. Focusing on deep learning neural network models that are particularly compute-intensive, Run:AI creates a pool of shared GPU and other compute resources that are provisioned dynamically to meet the needs of jobs in process. By abstracting workloads from the underlying infrastructure, organizations can embrace MLOps and allow data scientists to focus on models, while letting IT teams gain control and real-time visibility of compute resources across multiple sites, both on-premises and in the cloud.

See for yourself how Run:AI can operationalize your data science projects, accelerating their journey from research to production.

Like this article?

Share on linkedin
Share on LinkedIn
Share on twitter
Share on Twitter