Question 1

What Is Hyperparameter Tuning?

Accepted Answer

Hyperparameters define different aspects of the machine learning model, such as learning rate, batch size, the number of layers in the neural network, the maximum depth of the decision tree, and the number of trees that should be included in a random forest. As opposed to model parameters, hyperparameters cannot be calculated through model training but must be selected ahead of time through experimentation by the researcher.

Hyperparameter tuning or hyperparameter optimization (HPO) refers to the search for optimal hyperparameters, i.e., the ideal model structure. Once the model is defined, the space of possible hyperparameter values is scanned and sampled for potential candidates, which are then tested and validated.

In theory, discovering the optimal values would require checking every possible value, which would be far too expensive and time-consuming. There are, however, a number of more efficient and intelligent methods for exploring the range of possible values.

Question 2

Hyperparameter Optimization Methods

Accepted Answer

In the sections that follow, we will explore three of the most popular methods for searching the hyperparameter space for optimal values.

Grid Search
Grid search is the most basic and intuitive method of hyperparameter tuning. A model is built for each possible combination of hyperparameter values. Each model is then evaluated, and the optimal hyperparameters with the best-performing architecture are selected. This approach, however, can be extremely inefficient and time-consuming, especially in cases where there are different hyperparameters with many possible values.

Random Search
As opposed to grid search, where all possible values are examined, random search randomly samples hyperparameter values from a statistical distribution. In addition, unlike grid search, which blindly explores the entire hyperparameter space, random search allows you to limit the number of times the space is sampled. This enables you to focus on sampling the most important hyperparameters—those with the greatest impact on model quality. This, in turn, saves precious time and resources.

Bayesian Optimization
Bayesian optimization is a sequential model-based optimization (SMBO) algorithm that allows the search method to continuously improve based on previous calculations. Using a Gaussian model, the hyperparameter values to be sampled are calculated based on the model score of previously tested values. Because it learns from past experiments, this process reaches an optimum much faster than both grid and random search.

Question 3

The Challenges of Hyperparameter Tuning

Accepted Answer

Hyperparameters determine a model’s behavior and predictive power. Choosing the right hyperparameters can thus make or break a model. This crucial process poses a number of challenges.

First, the process of discovering the optimal hyperparameter values requires expertise as well as time and resources. This is especially true in the case of complex models with multiple hyperparameters, where manually tuning becomes virtually impossible. Take, for example, a neural network architecture with 12 hyperparameters. Exploring just seven of its integer-valued hyperparameters leads to 450,000 possible tuning configurations. Testing all these options would be incredibly inefficient and costly.

Second, simply fitting hyperparameters to the training data isn’t enough. The values must also be verified with validation data sets to assess the model’s ability to make generalizations. Without this vital step, the tuning is incomplete, and the model could be compromised.

Third, tuning is often an ongoing task. Some cases may require continuous tweaking of hyperparameters. Netflix, for example, uses a different data set for each of its geographical regions. It therefore needs to perform hyperparameter tuning for each region and train a separate neural network for each combination. Companies with multiple data sets or data that is frequently changing must perform continuous hyperparameter tuning.

Finally, in distributed training, where the batch size is divided across available GPUs, tuning becomes even more complicated. The batch size directly affects the quality of the model. When the batch changes in size based on GPU availability, it is crucial to change hyperparameters such as learning rate accordingly. Failing to do this can result in degraded model quality and significantly extend overall training time, no matter how many GPUs are available.

These challenges can be reduced considerably with automatic hyperparameter optimization. A number of tools and libraries have cropped up in recent years to facilitate automatic HPO. These tools can accelerate training time, even more when the underlying GPU utilization is optimized.

Question 4

How to Speed Up HPO by Optimizing GPU Usage

Accepted Answer

Training models on CPU memory can take hours or even days. Hyperparameter tuning adds even more time to this equation.

HPO libraries (e.g., Ray Tune, Optuna, Amazon SageMaker, and Spearmint) accelerate tuning through automation. They provide sophisticated tuning methods (such as Bayesian optimization) and are often equipped with advanced scheduling functionality. This enables the termination of unpromising trials, thus reaching an optimal solution faster and more efficiently. When combined with distributed training, these solutions are even more powerful.

Distributing training splits the training workload across multiple GPUs that run in parallel. This computational approach not only saves valuable training time but can also be used to dramatically speed up HPO. Automatic queuing tools, such as Run:AI and Celery, can be used together with HPO libraries to calculate hyperparameters across multiple GPUs in parallel.

The combination of HPO with multiple GPUs offers major benefits:

Scaling: Data scientists can scale the HPO process to hundreds or even thousands of GPUs that work simultaneously, allowing experiments to run in parallel rather than sequentially. This approach enables teams to retain the highest-performing experiments without having to wait for the results of less-successful experiments and dramatically accelerates training time.
Reduced management overhead: Data scientists no longer need to worry about managing underlying resources. Instead of using manual scripts, these tools automatically manage the queue, launching thousands of runs when the resources are available. This greatly simplifies HPO and ensures the cluster is always fully utilized.
Accelerated training: Replacing the tedious, slow, and serial manual methods with automation and parallel computing significantly increases HPO efficiency and reduces overall training time.
Data science teams have begun to use this approach and have achieved impressive results. In the Netflix example referenced above, the company used Spearmint to perform Bayesian optimization and used Celery to manage the distributed training across multiple GPUs. This enabled Netflix to tune, train, and update multiple neural networks for each of its international regions on a daily basis. In another example, a random forest classifier was used to perform HPO using a distributed approach. As a result, tuning was achieved 30 times faster, and classifier accuracy increased by 5%.

Automate Hyperparameter Tuning Across Multiple GPU

A Guide

What Is Hyperparameter Tuning?

Hyperparameter Optimization Methods

Grid Search

Random Search

Bayesian Optimization

The Challenges of Hyperparameter Tuning

How to Speed Up HPO by Optimizing GPU Usage

Accelerating Hyperparameter Tuning with Run:AI

Summary