Machine Learning Workflow

Streamlining Your ML Pipeline

Machine learning workflows define which phases are implemented during a machine learning project. The typical phases include data collection, data pre-processing, building datasets, model training and refinement, evaluation, and deployment to production. You can automate some aspects of the machine learning operations workflow, such as model and feature selection phases, but not all.

While these steps are generally accepted as a standard, there is also room for change. When creating a machine learning workflow, you first need to define the project, and then find an approach that works. Don’t try to fit the model into a rigid workflow. Rather, build a flexible workflow that allows you to start small and scale up to a production-grade solution.

This is part of our series of articles about machine learning engineering.

In this article, you will learn:

Understanding the Machine Learning Workflow

Machine learning workflows define the steps initiated during a particular machine learning implementation. Machine learning workflows vary by project, but four basic phases are typically included.

Gathering machine learning data

Gathering data is one of the most important stages of machine learning workflows. During data collection, you are defining the potential usefulness and accuracy of your project with the quality of the data you collect.

To collect data, you need to identify your sources and aggregate data from those sources into a single dataset. This could mean streaming data from Internet of Things sensors, downloading open source data sets, or constructing a data lake from assorted files, logs, or media.

Data pre-processing

Once your data is collected, you need to pre-process it. Pre-processing involves cleaning, verifying, and formatting data into a usable dataset. If you are collecting data from a single source, this may be a relatively straightforward process. However, if you are aggregating several sources you need to make sure that data formats match, that data is equally reliable, and remove any potential duplicates.

Building datasets

This phase involves breaking processed data into three datasets—training, validating, and testing:

  • Training set—used to initially train the algorithm and teach it how to process information. This set defines model classifications through parameters.
  • Validation set—used to estimate the accuracy of the model. This dataset is used to finetune model parameters.
  • Test set—used to assess the accuracy and performance of the models. This set is meant to expose any issues or mistrainings in the model.

Training and refinement

Once you have datasets, you are ready to train your model. This involves feeding your training set to your algorithm so that it can learn appropriate parameters and features used in classification.

Once training is complete, you can then refine the model using your validation dataset. This may involve modifying or discarding variables and includes a process of tweaking model-specific settings (hyperparameters) until an acceptable accuracy level is reached.

Machine learning evaluation

Finally, after an acceptable set of hyperparameters is found and your model accuracy is optimized you can test your model. Testing uses your test dataset and is meant to verify that your models are using accurate features. Based on the feedback you receive you may return to training the model to improve accuracy, adjust output settings, or deploy the model as needed.

What Are the Machine Learning Best Practices for Efficient Workflows?

When defining the workflow for your machine learning project, there are several best practices you can apply. Below are a few to start with.

Define the project

Carefully define your project goals before starting to ensure your models add value to a process rather than redundancy. When defining your project, consider the following aspects:

  • What is your current process—typically models are designed to replace an existing process. Understanding how the existing process works, what its goals are, who performs it, and what counts as success are all important. Understanding these aspects lets you know what roles your model needs to fill, what restrictions might exist in implementation, and what criteria the model needs to meet or exceed.
  • What do you want to predict—carefully defining what you want to predict is key to understanding what data you need to collect and how models should be trained. You want to be as detailed as possible with this step and make sure to quantify results. If your goals aren’t measurable you’ll have a hard time ensuring that each is met.
  • What are your data sources—evaluate what data your current process relies on, how it’s collected and in what volume. From those sources, you should determine what specific data types and points you need to form predictions.

Find an approach that works

The goal of implementing machine learning workflows is to improve the efficiency and/or accuracy of your current process. To find an approach that achieves this goal you need to:

  • Research—before implementing an approach, you should spend time researching how other teams have implemented similar projects. You may be able to borrow methods they used or learn from their mistakes, saving yourself time and money.
  • Experiment—whether you have found an existing approach to start from or created your own, you need to experiment with it. This is essentially the training and testing phases of your model training.

Build a full-scale solution

When developing your approach, your end result is typically a proof-of-concept. However, you need to be able to translate this proof into a functional product to meet your end goal. To transition from proof to deployable solution, you need the following:

  • A/B testing—enables you to compare your current model with the existing process. This can confirm or deny whether your model is effective and able to add value to your teams and users.
  • Machine learning API—creating an API for your model implementation is what enables it to communicate with data sources and services. This accessibility is especially important if you plan to offer your model as a machine learning service.
  • User-friendly documentation—includes documentation of code, methods, and how to use the model. If you want to create a marketable product it needs to be clear to users how they can leverage the model, how to access its results, and what kind of results they can expect.

Automating Machine Learning Workflows

Automating machine learning workflows enables teams to more efficiently perform some of the repetitive tasks involved in model development. There are many modules and an increasing number of platforms for this, sometimes referred to as autoML.

What is Automated Machine Learning?

AutoML essentially applies existing machine learning algorithms to the development of new models. Its purpose is not to automate the entire process of model development. Instead, it is to reduce the number of interventions that humans must make to ensure successful development.

AutoML helps developers get started with and complete projects significantly faster. It also has potential to improve deep learning and unsupervised machine learning training processes, potentially enabling self correction in developed models.

What Can You Automate?

While it would be great to be able to automate all aspects of machine learning operations, this currently isn’t possible. What can be reliably automated includes:

  • Hyperparameter optimization—uses algorithms like grid search, random search, and Bayesian methods to test combinations of pre-defined parameters and find the optimal combination.
  • Model selection—the same dataset is run through multiple models with default hyperparameters to determine which is best suited to learn from your data.
  • Feature selection—tools select the most relevant features from pre-determined sets of features.

3 Frameworks You Can Use to Automate Machine Learning Workflows

Below are three frameworks you can use to get started with machine learning automation.


Featuretools is an open source framework that you can use to automate feature engineering. You can use it to transform structured temporal and relational datasets using a Deep Feature Synthesis algorithm. This algorithm uses primitives (operations such as sum, mean, or average) to aggregate or transform data into usable features. This framework is based on a project created by Max Kanter and Kalyan Verramachaneni at MIT, called Data Science Machine.


DataRobot is a proprietary platform you can use to perform automated data preparation, feature engineering, model selection, training, testing, and deployment. You can use it to find new data sources, apply business rules, or regroup and reshape data.

The DataRobot platform includes a library of open source and proprietary models you can use to base your own model implementation on. It also includes a dashboard with visualizations that you can use to evaluate your model and understand predictions.


tsfresh is an open source Python module you can use to calculate and extract characteristics from time series data. It enables you to extract features which can then be used with scikit-learn or pandas to apply features to training.

Machine Learning Workflow Automation With Run:AI

Machine learning workflows define the entire machine learning cycle. While the tools mentioned above help with automating some parts of the ML lifecycle, such as data preparation, they are not built to automate resource allocation and job scheduling. If resource allocation is not properly configured and optimized, you can quickly hit compute or memory bottlenecks.

You can avoid these issues by replacing static allocation and provisioning with automated and dynamic resource management. This capability is enabled by virtualization software like Run:AI, which automates resource management for machine learning infrastructure. With Run:AI, you can automatically run as many compute intensive experiments as needed.

Here are some of the capabilities you gain when using Run:AI:

  • To ensure visibility and efficient resource sharing, you can pool GPU compute resources.
  • To avoid bottlenecks, you can set up GPU guaranteed quotas.
  • To gain better control, you can dynamically change resource allocation, ensuring each job gets the resources it needs at any given time.

Run:AI simplifies machine learning workflows, helping data scientists accelerate their productivity and the quality of their models. Learn more about the Run:AI platform.