What Is Hyperparameter Tuning

and Top 5 Methods

What is Hyperparameter Tuning?

Hyperparameter tuning is the process of selecting the optimal set of hyperparameters for a machine learning model. It is an important step in the model development process, as the choice of hyperparameters can have a significant impact on the model's performance.

There are several approaches to machine learning model optimization, including model-centric approaches and data-centric approaches. Model-centric approaches focus on the characteristics of the model itself, such as the structure of the model or the types of algorithms used. These approaches typically involve searching for the optimal combination of hyperparameters within a predefined set of possible values.

An example of hyperparameter tuning is a grid search. In grid search, the data scientist or machine learning engineer defines a set of hyperparameter values to search over, and the algorithm tries all possible combinations of these values. For example, if the hyperparameters include the learning rate and the number of hidden layers in a neural network, grid search would try all possible combinations of these hyperparameters, such as a learning rate of 0.1 with one hidden layer, a learning rate of 0.1 with two hidden layers, and so on.

This is part of an extensive series of guides about machine learning

Understanding Hyperparameter Space and Distributions

The hyperparameter space is the set of all possible combinations of hyperparameters that can be used to train a machine learning model. It is a multidimensional space, with each dimension representing a different hyperparameter. For example, if the hyperparameters include the learning rate and the number of hidden layers in a neural network, the hyperparameter space would have two dimensions: one for the learning rate and one for the number of hidden layers.

The hyperparameter distribution is the distribution of hyperparameter values within the hyperparameter space. It defines the range of values that each hyperparameter can take on, as well as the probability of each value occurring.

In order to tune hyperparameters, it is necessary to search the hyperparameter space for the combination of hyperparameters that results in the best model performance. The choice of hyperparameter distribution can have a significant impact on the effectiveness of the search, as it determines the range of values that will be explored and the probability of each value being selected.

Learn more in our detailed guide to hyperparameter grid search (coming soon)

Types of Hyperparameter Distributions

There are several types of probability distributions that can be used to define the hyperparameter space in machine learning. These distributions determine the range of values that each hyperparameter can take on, as well as the probability of each value occurring.

Common hyperparameter distributions include:

  • Uniform distribution: All values within a given range are equally likely to be chosen. The uniform distribution is often used when the range of possible values is known and there is no reason to favor one value over another.
  • Normal distribution (Gaussian distribution): The normal distribution is a continuous distribution that is symmetrical around its mean, and it is often used to model variables that are influenced by many factors.
  • Log-normal distribution: A random variable whose logarithm is normally distributed. The log-normal distribution is often used when the variable of interest is positive and the values are skewed, as it allows for a greater range of values.

There are many other probability distributions that can be used in machine learning, including the exponential distribution, the gamma distribution, and the beta distribution. The choice of probability distribution can have a significant impact on the effectiveness of the hyperparameter search, as it determines the range of values that will be explored and the probability of each value being selected.

5 Hyperparameter Optimization Algorithms and Methods

Manual Search

Manual search is a method of hyperparameter tuning in which the data scientist or machine learning engineer manually selects and adjusts the hyperparameters of the model. This method is often used when the number of hyperparameters is relatively small and the model is simple, as it allows the data scientist to have fine-grained control over the hyperparameters.

To use the manual search method, the data scientist defines a set of possible values for each hyperparameter, and then manually selects and adjusts the values until the model performance is satisfactory. For example, the data scientist might start with a learning rate of 0.1 and gradually increase or decrease it until the model's accuracy is maximized.

Pros and cons: The manual search method can be time-consuming and may require significant trial and error to find the optimal combination of hyperparameters. It is also prone to human error, as the data scientist may overlook certain combinations of hyperparameters or may not be able to accurately assess the impact of each hyperparameter on the model's performance.

Grid Search

Grid search is a method of hyperparameter tuning that involves training a model for every possible combination of hyperparameters in a predefined set.

To use the grid search method, the data scientist or machine learning engineer defines a set of possible values for each hyperparameter, and then the algorithm generates all possible combinations of these values. For example, if the hyperparameters include the learning rate and the number of hidden layers in a neural network, the grid search algorithm would try all possible combinations of these hyperparameters, such as a learning rate of 0.1 with one hidden layer, a learning rate of 0.1 with two hidden layers, and so on.

For each combination of hyperparameters, the model is trained and evaluated using a specified metric, such as accuracy or F1 score. The combination of hyperparameters that results in the best model performance is then chosen as the optimal set.

Pros and cons: The grid search method is computationally intensive, as it requires training a separate model for each combination of hyperparameters. It is also limited by the predefined set of possible values for each hyperparameter, which may not include the optimal values. Despite these limitations, the grid search method is widely used due to its simplicity and effectiveness, particularly for smaller and less complex models.

Random Search

Random search is a method of hyperparameter tuning that involves randomly selecting a combination of hyperparameters from a predefined set and training a model using those hyperparameters.

To use the random search method, the data scientist or machine learning engineer defines a set of possible values for each hyperparameter, and then the algorithm randomly selects a combination of these values. For example, if the hyperparameters include the learning rate and the number of hidden layers in a neural network, the random search algorithm might randomly select a learning rate of 0.1 and two hidden layers.

The model is then trained and evaluated using a specified metric, such as accuracy or F1 score. The process is repeated a predefined number of times, and the combination of hyperparameters that results in the best model performance is chosen as the optimal set.

Pros and cons: The random search method is less systematic and may not be as effective at finding the optimal set of hyperparameters, particularly for larger and more complex models. Despite these limitations, the random search method is widely used due to its simplicity and ease of implementation.

Bayesian Optimization

Bayesian search is a method of hyperparameter tuning that uses Bayesian optimization to find the optimal combination of hyperparameters for a machine learning model.

Bayesian optimization works by building a probabilistic model of the objective function (in this case, the performance of the machine learning model) based on the hyperparameter values that have been tried so far. This model is used to predict the next set of hyperparameters to try, based on the expected improvement in model performance. The process is repeated iteratively until the optimal set of hyperparameters is found.

One key advantage of Bayesian optimization is that it can make use of any available information about the objective function, including previous evaluations of the model performance and constraints on the hyperparameter values. This allows it to more efficiently explore the hyperparameter space and find the optimal combination of hyperparameters.

Pros and cons: Bayesian optimization is a more complex method of hyperparameter tuning than grid search or random search, and it requires more computational resources. However, it can be more effective at finding the optimal set of hyperparameters, particularly for larger and more complex models. It is also well-suited to situations where the objective function is noisy or expensive to evaluate.

Learn more in our detailed guide to Bayesian Hyperparameter Optimization

Hyperband

Hyperband is a method of hyperparameter tuning that uses a bandit-based approach to efficiently search the hyperparameter space.

Hyperband works by running a series of "bracketed" trials, in which the model is trained using a range of different hyperparameter configurations at each iteration. At each iteration, the model performance is evaluated using a specified metric, such as accuracy or F1 score. The model with the best performance is selected, and the hyperparameter space is narrowed to focus on the most promising configurations. This process is repeated until the optimal set of hyperparameters is found.

Pros and cons: One key advantage of Hyperband is that it can quickly eliminate unpromising configurations and focus on the most promising ones, which can save time and computational resources. It is also well-suited to situations where the objective function is noisy or expensive to evaluate.

Learn more in our detailed guides to hyperparameter optimization (coming soon)

Hyperparameter Tuning Management with Run:ai

The Run:ai platform takes the complexity out of distributed computing and provides unlimited compute power. It achieves this by pooling compute resources and leveraging them flexibly with elastic GPU clusters. Additional features such as a Kubernetes-based scheduler ensure training is never disrupted and that no machines are left idle. Together with HPO tools, these capabilities enable highly efficient tuning.

In addition, using our fractional GPU capabilities, experiments with a smaller hyperparameter space, which require less compute power, can utilize less GPU memory, freeing up additional GPU space and allowing more experiments to run in parallel (as opposed to using an entire GPU for each experiment). Combining Run:ai’s scheduling and fractional capabilities, experimentation can be sped up by 10x or more.

In one customer example, the Run:ai platform was able to spin up 6,000 HPO runs, each using one GPU. This ensured that at any given moment, there were 30 HPO runs executed simultaneously. The tuning was accomplished via Run:ai’s advanced scheduling features, built on top of Kubernetes. This solution also considerably reduced management overhead by eliminating the need for Python scripts, loops to ensure containers were up and running, and code to take care of failures, manage errors, etc.

Get started with Run:ai today!

See Our Additional Guides on Key Machine Learning Topics

Together with our content partners, we have authored in-depth guides on several other topics that can also be useful as you explore the world of machine learning.

Multi GPU