AutoML Python

10 Open Source Libraries and a Quick Tutorial

What Is AutoML in Python?

AutoML, or Automated Machine Learning, simplifies the process of applying machine learning to real-world problems. It automates the tedious process of selecting and optimizing machine learning models, eliminating the need for extensive domain knowledge.

Instead of manually experimenting with various algorithms and their hyperparameters, practitioners can use AutoML tools to find optimal solutions more efficiently. Python, being a leading programming language in data science and machine learning, offers several AutoML libraries.

These libraries vary in their approach to automating the machine learning pipeline, which includes data preprocessing, feature engineering, model selection, and hyperparameter tuning. They aim to make machine learning more accessible and accelerate the development of models.

Top 10 AutoML Python Libraries

Here are some of the most popular Python libraries for automating machine learning tasks.

H2O AutoML

H2O AutoML emphasizes ease of use and scalability. It automatically searches through possible models and preprocessing steps to find the most effective machine learning pipeline. H2O can handle large datasets and perform well on a variety of tasks without needing extensive tuning.

It provides an intuitive interface that allows users to start model training with a few lines of code. The library supports classification, regression, and time-series prediction, making it suitable for different applications. H2O AutoML also offers detailed reports on the performance of individual models, helping users understand their results better.

PyCaret

PyCaret is a low-code machine learning library that enables practitioners to perform end-to-end ML experiments with minimal effort. Its focus on simplicity and ease of use aims to make machine learning accessible to non-experts. PyCaret automates most of the machine learning workflow, from data preparation to model deployment.

The library supports various tasks, including classification, regression, clustering, and anomaly detection. It integrates with other Python libraries, allowing for custom workflows. PyCaret is suitable for prototyping and production, offering features like model comparison and blending.

Auto-sklearn

Auto-sklearn is an automated machine learning toolkit based on the popular scikit-learn library. It focuses on combining different algorithms and pre-processing methods to find the best model for a given dataset. Auto-sklearn employs Bayesian optimization, meta-learning, and ensemble methods to achieve high performance.

This library automates model and hyperparameter selection. It aims to reduce the time and effort needed to produce competitive machine learning models. Auto-sklearn is particularly effective for users who are familiar with scikit-learn.

MLBox

MLBox is an AutoML library that offers pre-processing, optimization, and prediction capabilities. It is designed for efficiency, capable of handling large datasets and performing feature selection automatically. MLBox's optimization engine aims to find optimal models quickly.

The library supports various regression and classification tasks and provides tools for hyperparameter tuning and model evaluation. It also focuses on model prediction and interpretability.

TPOT

TPOT, short for Tree-based Pipeline Optimization Tool, leverages genetic algorithms to automate the design of machine learning pipelines. It explores a range of possible pipelines to find the one that performs best on a given task. TPOT is designed to discover complex patterns and generate optimized pipelines.

This library emphasizes the evolution of pipelines over time, aiming to produce models that offer superior performance. TPOT is well-suited for data scientists looking to automate pipeline selection and optimization while focusing on data analysis and interpretation.

Autokeras

Autokeras simplifies deep learning by automating the design and tuning of neural networks. It utilizes neural architecture search (NAS) to find the best model architecture for a specific problem. Autokeras is designed with ease of use in mind, targeting researchers and practitioners without deep learning expertise.

The library supports image classification, text classification, and regression tasks. Autokeras continuously updates its strategies to incorporate the latest advances in neural architecture search, making it a handy tool for automated deep learning.

AutoGluon

AutoGluon aims to provide high-quality machine learning models with minimal user intervention. It automates feature engineering, model selection, and hyperparameter tuning, enabling it to handle tabular, image, and text data. AutoGluon emphasizes performance and speed, producing ML models quickly.

The library is accessible to both novices and experts, offering advanced configuration options for experienced users. AutoGluon's approach to automation enables rapid development and deployment of machine learning models, suitable for a range of applications.

Auto-ViML

Auto-ViML, short for Automated Variant Interpretable Machine Learning, focuses on creating interpretable models. It automates the process of feature engineering, model selection, and hyperparameter tuning, with a strong emphasis on producing models that are easy to understand and explain.

The library supports classification and regression tasks and provides detailed reports on model performance. Auto-ViML is intended for users who prioritize model interpretability, helping them make informed decisions based on transparent and understandable models.

EvalML

EvalML is an AutoML library that automates the process of building, optimizing, and evaluating machine learning pipelines. It handles categorical, numeric, and text data, providing support for various ML tasks. EvalML's distinctive feature is its focus on objective-driven optimization, guiding the search for the best pipeline based on specific project goals.

The library includes features for model understanding and diagnostics, facilitating the analysis of model behavior and performance. EvalML is designed for flexibility, catering to rapid experimentation and the development of deployable models.

FLAML

FLAML, short for Fast and Lightweight AutoML, is designed to be efficient and lightweight while delivering high-quality machine learning models. It optimizes machine learning pipelines with a focus on speed and resource efficiency. FLAML performs well on a range of ML tasks.

This library is easy to use, requiring minimal configuration to get started. FLAML's efficiency makes it useful for scenarios where computational resources are limited, offering a pragmatic approach to AutoML without sacrificing performance.

Tutorial: Automated Machine Learning in Python with H20

Here is an overview of how to get started with automated machine learning in Python using the H2O AutoML library.

Installing the H2O Library

To begin using H2O's AutoML capabilities in Python, the first step is to install the H2O library. This can be done easily using the pip package manager. Simply run the following command in your terminal:

pip install h2o

Importing Libraries

After installing the H2O library, import it along with NumPy and Pandas, which are essential for data manipulation. The H2OAutoML class is specifically imported for the AutoML functionality:

import numpy as np
import pandas as pd
import h2o
from h2o.automl import H2OAutoML

This setup prepares your environment by loading the necessary libraries to process data and utilize H2O's AutoML features.

Data Loading

The next step is to load your data. In this tutorial, we’ll use the Iris dataset as an example. First, load the dataset from a URL into a Pandas DataFrame. Then, initialize the H2O server and convert the Pandas DataFrame into an H2O Frame, which is necessary for processing with H2O:

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
data = pd.read_csv(url, header=None, names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class'])

Now you can convert the data to the H2O format:

h2o.init()
h2o_data = h2o.H2OFrame(data)

This process makes the Iris dataset ready for machine learning tasks with H2O.

Training the AutoML Model

To train the model, define the model's parameters, such as the maximum number of models to be trained and the seed for randomization. Then, specify the dataset along with the features and the target variable. Here is how to do it:

aml = H2OAutoML(max_models=20, seed=1)
aml.train(x=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'], y='class', training_frame=h2o_data)

This code snippet initiates the AutoML training process on the Iris dataset, aiming to discover the best performing model.

Comparing Models

After training, you can view the AutoML leaderboard to see the performance of the top ML models:

lb = aml.leaderboard
print(lb)

This command prints out the leaderboard, showing the models ranked by their performance.

Testing the Model

Finally, to test the trained AutoML model, you can make predictions on new data. Here's an example of predicting the class of two new Iris flowers:

test_data = pd.DataFrame(np.array([[5.1, 3.5, 1.4, 0.2], [7.7, 3.0, 6.1, 2.3]]), columns=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'])
h2o_test_data = h2o.H2OFrame(test_data)
preds = aml.predict(h2o_test_data)
print(preds)

The output should look something like this:

This demonstrates how to use the model to make predictions, providing insights into the potential class of each new data point.

Related content: Read our guide to AutoML solutions

Stay Ahead of the ML Curve with Run:ai

In today’s highly competitive economy, enterprises are looking to Artificial Intelligence in general and Machine and Deep Learning in particular to transform big data into actionable insights that can help them better address their target audiences, improve their decision-making processes, and streamline their supply chains and production processes, to mention just a few of the many use cases out there. In order to stay ahead of the curve and capture the full value of ML, however, companies must strategically embrace MLOps.

Run:ai’s AI/ML virtualization platform is an important enabler for Machine Learning Operations teams. Focusing on deep learning neural network models that are particularly compute-intensive, Run:ai creates a pool of shared GPU and other compute resources that are provisioned dynamically to meet the needs of jobs in process. By abstracting workloads from the underlying infrastructure, organizations can embrace MLOps and allow data scientists to focus on models, while letting IT teams gain control and real-time visibility of compute resources across multiple sites, both on-premises and in the cloud.

See for yourself how Run:ai can operationalize your data science projects, accelerating their journey from research to production.