Automate Hyperparameter Tuning Across Multiple GPU

A Guide

In this post, we will review how hyperparameters and hyperparameter tuning plays an important role in the design and training of machine learning networks. Choosing the optimal hyperparameter values directly influences the architecture and quality of the model. This crucial process also happens to be one of the most difficult, tedious, and complicated tasks in machine learning training.

This blog post explains what hyperparameter tuning is, how it can be automated, and how optimizing GPU usage can radically accelerate training time and enhance model quality. The topics covered are:

Related content: read our guide to Multi GPU

What Is Hyperparameter Tuning?

Hyperparameters define different aspects of the machine learning model, such as learning rate, batch size, the number of layers in the neural network, the maximum depth of the decision tree, and the number of trees that should be included in a random forest. As opposed to model parameters, hyperparameters cannot be calculated through model training but must be selected ahead of time through experimentation by the researcher.

Hyperparameter tuning or hyperparameter optimization (HPO) refers to the search for optimal hyperparameters, i.e., the ideal model structure. Once the model is defined, the space of possible hyperparameter values is scanned and sampled for potential candidates, which are then tested and validated.

In theory, discovering the optimal values would require checking every possible value, which would be far too expensive and time-consuming. There are, however, a number of more efficient and intelligent methods for exploring the range of possible values.

Hyperparameter Optimization Methods

In the sections that follow, we will explore three of the most popular methods for searching the hyperparameter space for optimal values.

Grid Search

Grid search is the most basic and intuitive method of hyperparameter tuning. A model is built for each possible combination of hyperparameter values. Each model is then evaluated, and the optimal hyperparameters with the best-performing architecture are selected. This approach, however, can be extremely inefficient and time-consuming, especially in cases where there are different hyperparameters with many possible values.

Random Search

As opposed to grid search, where all possible values are examined, random search randomly samples hyperparameter values from a statistical distribution. In addition, unlike grid search, which blindly explores the entire hyperparameter space, random search allows you to limit the number of times the space is sampled. This enables you to focus on sampling the most important hyperparameters—those with the greatest impact on model quality. This, in turn, saves precious time and resources.

Bayesian Optimization

Bayesian optimization is a sequential model-based optimization (SMBO) algorithm that allows the search method to continuously improve based on previous calculations. Using a Gaussian model, the hyperparameter values to be sampled are calculated based on the model score of previously tested values. Because it learns from past experiments, this process reaches an optimum much faster than both grid and random search.

The Challenges of Hyperparameter Tuning

Hyperparameters determine a model’s behavior and predictive power. Choosing the right hyperparameters can thus make or break a model. This crucial process poses a number of challenges.

First, the process of discovering the optimal hyperparameter values requires expertise as well as time and resources. This is especially true in the case of complex models with multiple hyperparameters, where manually tuning becomes virtually impossible. Take, for example, a neural network architecture with 12 hyperparameters. Exploring just seven of its integer-valued hyperparameters leads to 450,000 possible tuning configurations. Testing all these options would be incredibly inefficient and costly.

Second, simply fitting hyperparameters to the training data isn’t enough. The values must also be verified with validation data sets to assess the model’s ability to make generalizations. Without this vital step, the tuning is incomplete, and the model could be compromised.

Third, tuning is often an ongoing task. Some cases may require continuous tweaking of hyperparameters. Netflix, for example, uses a different data set for each of its geographical regions. It therefore needs to perform hyperparameter tuning for each region and train a separate neural network for each combination. Companies with multiple data sets or data that is frequently changing must perform continuous hyperparameter tuning.

Finally, in distributed training, where the batch size is divided across available GPUs, tuning becomes even more complicated. The batch size directly affects the quality of the model. When the batch changes in size based on GPU availability, it is crucial to change hyperparameters such as learning rate accordingly. Failing to do this can result in degraded model quality and significantly extend overall training time, no matter how many GPUs are available.

These challenges can be reduced considerably with automatic hyperparameter optimization. A number of tools and libraries have cropped up in recent years to facilitate automatic HPO. These tools can accelerate training time, even more when the underlying GPU utilization is optimized.

How to Speed Up HPO by Optimizing GPU Usage

Training models on CPU memory can take hours or even days. Hyperparameter tuning adds even more time to this equation.

HPO libraries (e.g., Ray Tune, Optuna, Amazon SageMaker, and Spearmint) accelerate tuning through automation. They provide sophisticated tuning methods (such as Bayesian optimization) and are often equipped with advanced scheduling functionality. This enables the termination of unpromising trials, thus reaching an optimal solution faster and more efficiently. When combined with distributed training, these solutions are even more powerful.

Distributing training splits the training workload across multiple GPUs that run in parallel. This computational approach not only saves valuable training time but can also be used to dramatically speed up HPO. Automatic queuing tools, such as Run:AI and Celery, can be used together with HPO libraries to calculate hyperparameters across multiple GPUs in parallel.

The combination of HPO with multiple GPUs offers major benefits:

  • Scaling: Data scientists can scale the HPO process to hundreds or even thousands of GPUs that work simultaneously, allowing experiments to run in parallel rather than sequentially. This approach enables teams to retain the highest-performing experiments without having to wait for the results of less-successful experiments and dramatically accelerates training time.
  • Reduced management overhead: Data scientists no longer need to worry about managing underlying resources. Instead of using manual scripts, these tools automatically manage the queue, launching thousands of runs when the resources are available. This greatly simplifies HPO and ensures the cluster is always fully utilized.
  • Accelerated training: Replacing the tedious, slow, and serial manual methods with automation and parallel computing significantly increases HPO efficiency and reduces overall training time.

Data science teams have begun to use this approach and have achieved impressive results. In the Netflix example referenced above, the company used Spearmint to perform Bayesian optimization and used Celery to manage the distributed training across multiple GPUs. This enabled Netflix to tune, train, and update multiple neural networks for each of its international regions on a daily basis. In another example, a random forest classifier was used to perform HPO using a distributed approach. As a result, tuning was achieved 30 times faster, and classifier accuracy increased by 5%.

Accelerating Hyperparameter Tuning with Run:AI

As we have seen in the previous section, distributed training can play a crucial role in efficient hyperparameter tuning. But to fully leverage the power of distributed infrastructure, it’s not enough to have great tuning tools; you must also ensure that compute resources are fully optimized.

The Run:AI deep learning platform takes the complexity out of distributed computing and provides unlimited compute power. It achieves this by pooling compute resources and leveraging them flexibly with elastic GPU clusters. Additional features such as gradient accumulation and a Kubernetes-based scheduler ensure training is never disrupted and that no machines are left idle. Together with HPO tools, these capabilities enable highly efficient tuning.

In addition, using our fractional GPU capabilities, experiments with a smaller hyperparameter space, which require less compute power, can utilize less GPU memory, freeing up additional GPU space and allowing more experiments to run in parallel (as opposed to using an entire GPU for each experiment). Combining Run:AI’s scheduling and fractional capabilities, experimentation can be sped up by 10x or more.

In one customer example, the Run:AI platform was able to spin up 6,000 HPO runs, each using one GPU. This ensured that at any given moment, there were 30 HPO runs executed simultaneously. The tuning was accomplished via Run:AI’s advanced scheduling features, built on top of Kubernetes. This solution also considerably reduced management overhead by eliminating the need for Python scripts, loops to ensure containers were up and running, and code to take care of failures, manage errors, etc.

For details on how Run:AI hyperparameter optimization works, see our quickstart guide here.

Summary

Hyperparameter tuning is an important aspect of machine learning and can considerably reduce the time it takes to train a model and improve its overall quality. Since they cannot be learned in the conventional way, discovering the optimal hyperparameter values can be challenging, time-consuming, and costly, plus require a lot of trial and error. In addition, in many cases, tuning doesn’t just happen once but is an ongoing process, which can create bottlenecks. Calculating these parameters as efficiently and accurately as possible is therefore crucial.

Using automatic HPO tools in conjunction with platforms that optimize parallel GPU usage, such as Run:AI, allows HPO calculation to be scaled to many GPUs at once. This allows for manual and tedious scripts to be replaced with automatic queuing so that resources are fully utilized to simplify and accelerate HPO and model training in general.