Machine learning workflows define which phases are implemented during a machine learning project. The typical phases include data collection, data pre-processing, building datasets, model training and refinement, evaluation, and deployment to production. You can automate some aspects of the machine learning operations workflow, such as model and feature selection phases, but not all.
While these steps are generally accepted as a standard, there is also room for change. When creating a machine learning workflow, you first need to define the project, and then find an approach that works. Don’t try to fit the model into a rigid workflow. Rather, build a flexible workflow that allows you to start small and scale up to a production-grade solution.
This is part of our series of articles about machine learning engineering.
In this article, you will learn:
Machine learning workflows define the steps initiated during a particular machine learning implementation. Machine learning workflows vary by project, but four basic phases are typically included.
Gathering machine learning data
Gathering data is one of the most important stages of machine learning workflows. During data collection, you are defining the potential usefulness and accuracy of your project with the quality of the data you collect.
To collect data, you need to identify your sources and aggregate data from those sources into a single dataset. This could mean streaming data from Internet of Things sensors, downloading open source data sets, or constructing a data lake from assorted files, logs, or media.
Data pre-processing
Once your data is collected, you need to pre-process it. Pre-processing involves cleaning, verifying, and formatting data into a usable dataset. If you are collecting data from a single source, this may be a relatively straightforward process. However, if you are aggregating several sources you need to make sure that data formats match, that data is equally reliable, and remove any potential duplicates.
Building datasets
This phase involves breaking processed data into three datasets—training, validating, and testing:
Training and refinement
Once you have datasets, you are ready to train your model. This involves feeding your training set to your algorithm so that it can learn appropriate parameters and features used in classification.
Once training is complete, you can then refine the model using your validation dataset. This may involve modifying or discarding variables and includes a process of tweaking model-specific settings (hyperparameters) until an acceptable accuracy level is reached.
Machine learning evaluation
Finally, after an acceptable set of hyperparameters is found and your model accuracy is optimized you can test your model. Testing uses your test dataset and is meant to verify that your models are using accurate features. Based on the feedback you receive you may return to training the model to improve accuracy, adjust output settings, or deploy the model as needed.
When defining the workflow for your machine learning project, there are several best practices you can apply. Below are a few to start with.
Define the project
Carefully define your project goals before starting to ensure your models add value to a process rather than redundancy. When defining your project, consider the following aspects:
Find an approach that works
The goal of implementing machine learning workflows is to improve the efficiency and/or accuracy of your current process. To find an approach that achieves this goal you need to:
Build a full-scale solution
When developing your approach, your end result is typically a proof-of-concept. However, you need to be able to translate this proof into a functional product to meet your end goal. To transition from proof to deployable solution, you need the following:
Automating machine learning workflows enables teams to more efficiently perform some of the repetitive tasks involved in model development. There are many modules and an increasing number of platforms for this, sometimes referred to as autoML.
AutoML essentially applies existing machine learning algorithms to the development of new models. Its purpose is not to automate the entire process of model development. Instead, it is to reduce the number of interventions that humans must make to ensure successful development.
AutoML helps developers get started with and complete projects significantly faster. It also has potential to improve deep learning and unsupervised machine learning training processes, potentially enabling self correction in developed models.
While it would be great to be able to automate all aspects of machine learning operations, this currently isn’t possible. What can be reliably automated includes:
Below are three frameworks you can use to get started with machine learning automation.
Featuretools
Featuretools is an open source framework that you can use to automate feature engineering. You can use it to transform structured temporal and relational datasets using a Deep Feature Synthesis algorithm. This algorithm uses primitives (operations such as sum, mean, or average) to aggregate or transform data into usable features. This framework is based on a project created by Max Kanter and Kalyan Verramachaneni at MIT, called Data Science Machine.
DataRobot
DataRobot is a proprietary platform you can use to perform automated data preparation, feature engineering, model selection, training, testing, and deployment. You can use it to find new data sources, apply business rules, or regroup and reshape data.
The DataRobot platform includes a library of open source and proprietary models you can use to base your own model implementation on. It also includes a dashboard with visualizations that you can use to evaluate your model and understand predictions.
tsfresh
tsfresh is an open source Python module you can use to calculate and extract characteristics from time series data. It enables you to extract features which can then be used with scikit-learn or pandas to apply features to training.
Machine learning workflows define the entire machine learning cycle. While the tools mentioned above help with automating some parts of the ML lifecycle, such as data preparation, they are not built to automate resource allocation and job scheduling. If resource allocation is not properly configured and optimized, you can quickly hit compute or memory bottlenecks.
You can avoid these issues by replacing static allocation and provisioning with automated and dynamic resource management. This capability is enabled by virtualization software like Run:AI, which automates resource management for machine learning infrastructure. With Run:AI, you can automatically run as many compute intensive experiments as needed.
Here are some of the capabilities you gain when using Run:AI:
Run:AI simplifies machine learning workflows, helping data scientists accelerate their productivity and the quality of their models. Learn more about the Run:AI platform.
We have authored in-depth guides on several other artificial intelligence infrastructure topics that can also be useful as you explore the world of deep learning GPUs.
Learn how to assess GPUs to determine which is the best GPU for your deep learning model. Discover types of consumer and data center deep learning GPUs. Get started with PyTorch for GPUs – learn how PyTorch supports NVIDIA’s CUDA standard, and get quick technical instructions for using PyTorch with CUDA. Finally, learn about the NVIDIA deep learning SDK, what are the top NVIDIA GPUs for deep learning, and what best practices you should adopt when using NVIDIA GPUs.
See top articles in our GPU for Deep Learning guide:
This guide explains the Kubernetes Architecture for AI workloads and how K8s came to be used inside many companies. There are specific considerations implementing Kubernetes to orchestrate AI workloads. Finally, the guide addresses the shortcomings of Kubernetes when it comes to scheduling and orchestration of Deep Learning workloads and how you can address those shortfalls.
See top articles in our Kubernetes for AI guide: