Bayesian Hyperparameter Optimization

The Basics and a Quick Tutorial

What Is Bayesian Hyperparameter Optimization?

Some hyperparameter tuning methods, like Random Search and GridSearch, process parameter values in isolation without considering past results. Tuning with these approaches is often time-consuming, especially for a large parameter space. The more parameters are tuned, the larger the search space becomes.

Standard hyperparameter tuning approaches also require training a model for each combination of hyperparameters. They also require making predictions on validation data and calculating validation metrics.

Bayesian optimization—tuning hyperparameters using Bayesian logic—helps reduce the time required to obtain an optimal parameter set. It improves the performance of test set generalization tasks. It works by considering the previously seen hyperparameter combinations when determining the next set of hyperparameters to evaluate.

How Bayesian Optimization Works

In Bayesian optimization, hyperparameter values, known as data points, are chosen randomly in the first iteration. Then there is a trade off between:

  • Active learning - choosing the point with the highest uncertainty in each iteration. This is also called exploitation.
  • Best objective function - choosing a point from a region that currently has the best results. This is also called exploration.

For example, if you are working on a maximization problem, for each iteration, the Bayesian optimization method runs the algorithm with several random hyperparameter values, and then either finds the point that achieves the maximal result (exploitation), or the point that has the highest uncertainty and thus the best potential to achieve an even better result (exploration).

There are several ways by which this method chooses, at each iteration, whether to go down the path of exploitation or exploration. Here are three common functions that can make this choice.

Upper Confidence Bound

With this function, the next selected point is the one with the highest upper confidence bound. Assuming a Gaussian process, this can be obtained as:

UCB(x)=μ(x)+κσ(x)

μ is the mean, σ is the standard deviation, and κ is an exploration parameter (larger values cause the function to perform more exploration).

Probability of Improvement (PI)

This function selects the next point with the highest potential for improvement, compared to the current maximum objective function, obtained from the previously evaluated points. The current maximum objective function is denoted fmax. This function also assumes a Gaussian process:

PI(x)=Φ(μ(x)−fmax−εσ(x))

The variable ε is used to trade off between exploration and exploitation. Larger values result in more exploration vs. exploitation.

Expected Improvement (EI)

This function attempts to quantify the degree of improvement achieved by new points. It picks the new point with the highest expected value. Again it assumes a Gaussian process:

EI(x)=(μ(x)−fmax)Φ(μ(x)−fmax−εσ(x))+σ(x)ϕ(μ(x)−fmax−εσ(x))

As in the previous function, ε is used to trade off between exploration and exploitation.

Quick Tutorial: Bayesian Hyperparam Optimization in scikit-learn

Here is a tutorial on how to perform Bayesian hyperparameter optimization using the popular open source Python library scikit-learn.

Step 1: Install Libraries

First, we will install the necessary libraries:

pip install scikit-learn

pip install scikit-optimize

pip install matplotlib

pip install bayesian-optimization

Step 2: Define Optimization Function

Next, we will define a function that takes in a set of hyperparameters and returns the cross-validated mean performance of a model trained with those hyperparameters. This function will be used to evaluate the performance of different sets of hyperparameters.

from bayes_opt import BayesianOptimization, UtilityFunction


# Numpy import

import numpy as np


# SK Learn imports

from sklearn.model_selection import cross_val_score

from sklearn.svm import SVC

from sklearn.preprocessing import MinMaxScaler

from sklearn.model_selection import train_test_split

from sklearn.metrics import roc_auc_score


# Pandas import

import pandas as pd


# We will use a custom CSV file to get the test data for this test

# CSV file has the following columns:

# - date, latitude, longitude, car, speed, ticketed, expected result


def load_data_set(filename):

try:

return pd.read_csv(filename)

except Exception as e:

print (e)


# Define the Internal method for optimization

def internal_method(C):

   # C: SVC hyper parameter to optimize for.

   model = SVC(C = C)

   model.fit(X_train_scaled, y_train)

   y_score = model.decision_function(X_test_scaled)

   f = roc_auc_score(y_test, y_score)

   return f


# Load the wine data set using load_wine()

test_dataset = load_data_set("car_speeding_tickets.csv")


# Retrieve dataset components data and target

X = test_dataset["data"]

y = test_dataset["expected"]


# Create training sets using random distribution of 48

X_trn, X_tst, y_train, y_test = train_test_split(X, y,

                                                stratify = y,

                                                random_state = 12)


# Create the min/max scaler

min_max_sclr = MinMaxScaler()

X_train_scaled = min_max_sclr.fit_transform(X_trn)

X_test_scaled = min_max_sclr.transform(X_tst)


# bayes_opt requires this to be a dictionary.

bds = {"C": [0.1, 15]}


# Create a BayesianOptimization optimizer and optimize the function

optimizer = BayesianOptimization(f = internal_method,

                                pbounds = bds,

                                random_state = 7,

verbose=2)

Step 3: Define Search Space and Optimization Procedure

Now, we can define the search space for the hyperparameters. This can be done using the hp module from scikit-optimize. For example, if we have a hyperparameter learning_rate that can take on values between 0.01 and 1, and a hyperparameter num_hidden_units that can take on values between 10 and 100, we can define the search space like this:

from skopt import hp


learning_rate_space = hp.uniform('learning_rate', 0.01, 1)

num_hidden_units_space = hp.quniform('num_hidden_units', 10, 100, 1)


search_space = [learning_rate_space, num_hidden_units_space]


Now, we can define the Bayesian optimization procedure using the BayesSearchCV class from scikit-optimize. We will use a Gaussian process as the probabilistic model, and the expected improvement acquisition function as the acquisition function.

from skopt import BayesSearchCV


optimizer = BayesSearchCV(

   estimator=SomeModel(),

   search_spaces=search_space,

   scoring='neg_mean_absolute_error',

   cv=5,

   n_iter=10,

   return_train_score=False,

   n_jobs=-1

)

Step 4: Fit the Optimizer to the Data

Finally, we can fit the optimizer to our data by calling the fit method:

optimizer.fit(X, y)

This will run the Bayesian optimization procedure for 10 iterations. At the end of the optimization process, the optimizer object will contain the best set of hyperparameters found, as well as the cross-validated performance of the model trained with those hyperparameters.

Step 5: View Best Set of Hyperparameters

We can access the best set of hyperparameters using the best_params_ attribute:

best_hyperparameters = optimizer.best_params_

We can access the cross-validated performance of the model trained with the best set of hyperparameters using the best_score_ attribute:

best_score = optimizer.best_score_

That's it! You have successfully performed Bayesian hyperparameter optimization using scikit-learn and scikit-optimize.

Optimize Machine Learning Compute with Run:ai

The Run:ai platform takes the complexity out of distributed computing and provides unlimited compute power. It achieves this by pooling compute resources and leveraging them flexibly with elastic GPU clusters. Additional features such as a Kubernetes-based scheduler ensure training is never disrupted and that no machines are left idle. Together with HPO tools, these capabilities enable highly efficient tuning.

In addition, using our fractional GPU capabilities, experiments with a smaller hyperparameter space, which require less compute power, can utilize less GPU memory, freeing up additional GPU space and allowing more experiments to run in parallel (as opposed to using an entire GPU for each experiment). Combining Run:ai’s scheduling and fractional capabilities, experimentation can be sped up by 10x or more.

In one customer example, the Run:ai platform was able to spin up 6,000 HPO runs, each using one GPU. This ensured that at any given moment, there were 30 HPO runs executed simultaneously. The tuning was accomplished via Run:ai’s advanced scheduling features, built on top of Kubernetes. This solution also considerably reduced management overhead by eliminating the need for Python scripts, loops to ensure containers were up and running, and code to take care of failures, manage errors, etc.

Learn more about the Run:ai platform