How to Evaluate Machine Learning Models, Part 4: Hyperparameter Tuning
How to Evaluate Machine Learning Models, Part 4: Hyperparameter Tuning
In the realm of machine learning, hyperparameter tuning is a “meta” learning task. It happens to be one of my favorite subjects because it can appear like black magic, yet its secrets are not impenetrable. In this post, I'll walk through what is hyperparameter tuning, why it's hard, and what kind of smart tuning methods are being developed to do something about it.
First, let’s clarify some basic concepts. Machine learning models are basically mathematical functions that represent the relationship between different aspects of data. For instance, a linear regression model uses a line to represent the relationship between “features” and “target.” The formula looks like this:
where x is a vector that represents features of the data and y is a scalar variable that represents the target (some numeric quantity that we wish to learn to predict).
This model assumes that that the relationship between x and y is linear. The variable w is a weight vector that represents the normal vector for the line; it specifies the slope of the line. This is what’s known as a model parameter, and it is learned during the training phase. “Training a model” involves using an optimization procedure to determine the best model parameter that “fits” the data. (Krishna’s blog post on Parallel SGD gives a great introduction on optimization methods for model training.)
What is a hyperparameter? Why is it important?
There is another set of parameters known as hyperparameters, sometimes also knowns as “nuisance parameters.” These are values that must be specified outside of the training procedure. Vanilla linear regression doesn’t have any hyperparameters. But variants of linear regression do.Ridge regression and lasso both add a regularization term to linear regression; the weight for the regularization term is called the regularization parameter. Decision trees have hyperparameters such as the desired depth and number of leaves in the tree. Support Vector Machines require setting a misclassification penalty term. Kernelized SVM require setting kernel parameters like the width for RBF kernels. The list goes on.
This type of hyperparameter controls the capacity of the model, i.e., how flexible the model is, how many degrees of freedom it has in fitting the data. Proper control of model capacity can preventoverfitting, which happens when the model is too flexible, and the training process adapts too much to the training data, thereby losing predictive accuracy on new test data. So a proper setting of the hyperparameters is important.
Another type of hyperparameters comes from the training process itself. For instance, stochastic gradient descent optimization requires a learning rate or a learning schedule. Some optimization methods require a convergence threshold. Random forests and boosted decision trees require knowing the number of total trees. (Though this could also be classified as a type of regularization hyperparameter.) These also need to be set to reasonable values in order for the training process to find a good model.
Hyperparameter tuning
Hyperparameter settings could have a big impact on the prediction accuracy of the trained model. Optimal hyperparameter settings often differ for different datasets. Therefore they should be tuned for each dataset. Since the training process doesn’t set the hyperparameters, there needs to be a meta process that tunes the hyperparameters. This is what we mean by hyperparameter tuning.
Hyperparameter tuning is a meta-optimization task. As the figure shows, each trial of a particular hyperparameter setting involves training a model--an inner optimization process. The outcome of hyperparameter tuning is the best hyperparameter setting, and the outcome of model training is the best model parameter setting.

Illustration of the hyperparameter tuning machine.
For each proposed hyperparameter setting, the inner model training process comes up with a model for the dataset and outputs evaluation results on hold-out or cross validation datasets. After evaluating a number of hyperparameter settings, the hyperparameter tuner outputs the setting that yields the best performing model. The last step is to train a new model on the entire dataset (training and validation) under the best hyperparameter setting. Here is the pseudo-code. (The training and validation step can be conceptually replaced with a cross-validation step.)
Hyperparameter_tuning (training_data, validation_data, hp_list):
hp_perf = []
foreach hp_setting in hp_list:
m = train_model(training_data, hp_setting)
validation_results = eval_model(m, validation_data)
hp_perf.append(validation_results)
best_hp_setting = hp_list[max_index(hp_perf)]
best_m = train_model(training_data.append(validation_data), best_hp_setting)
return (best_hp_setting, best_m)
Hyperparameter tuning algorithms
Conceptually, hyperparameter tuning is an optimization task, just like model training.
However, these two tasks are quite different in practice. When training a model, the quality of a proposed set of model parameters can be written as a mathematical formula (usually called the loss function). When tuning hyperparameters, however, the quality of those hyperparameters cannot be written down in a closed-form formula, because it depends on the outcome of a blackbox (the model training process).
This is why hyperparameter tuning is much harder. Up until a few years ago, the only available methods were grid search and random search. In the last few years, there’s been increased interest in auto-tuning. Several research groups have worked on the problem, published papers, and released new tools.
Grid search
Grid search, true to its name, picks out a grid of hyperparameter values, evaluates every one of them, and returns the winner. For example, if the hyperparameter is the number of leaves in a decision tree, then the grid could be 10, 20, 30, …, 100. For regularization parameters, it’s common to use exponential scale: 1e-5, 1e-4, 1e-3, … 1. Some guess work is necessary to specify the minimum and maximum values. So sometimes people run a small grid, see if the optimum lies at either end point, and then expand the grid in that direction. This is called manual grid search.
Grid search is dead simple to set up and trivial to parallelize. It is the most expensive method in terms of total computation time. However, if run in parallel, it is fast in terms of wall clock time.
Random search
I love movies where the underdog wins, and I love machine learning papers where simple solutions are shown to be surprisingly effective. This is the storyline of “Random search for hyperparameter optimization” by Bergstra and Bengio. Random search is a slight variation on grid search. Instead of searching over the entire grid, random search only evaluates a random sample of points on the grid. This makes random search a lot cheaper than grid search. Random search wasn’t taken very seriously before. This is because it doesn’t search over all the grid points, so it cannot possibly beat the optimum found by grid search. But then came along Bergstra and Bengio. They showed that, in surprisingly many instances, random search performs about as well as grid search. All in all, trying 60 random points sampled from the grid seems to be good enough.
In hindsight, there is a simple probabilistic explanation for the result: for any distribution over a sample space with a finite maximum, the maximum of 60 random observations lies within the top 5% of the true maximum, with 95% probability. That may sound complicated, but it’s not. Imagine the 5% interval around the true maximum. Now imagine that we sample points from this space and see if any of it lands within that maximum. Each random draw has a 5% chance of landing in that interval, if we draw n points independently, then the probability that all of them miss the desired interval is (1−0.05)n. So the probability that at least one of them succeeds in hitting the interval is 1 minus that quantity. We want at least a .95 probability of success. To figure out the number of draws we need, just solve for n in the equation:
We get n>=60. Ta-da!
The moral of the story is: if the close-to-optimal region of hyperparameters occupies at least 5% of the grid surface, then random search with 60 trials will find that region with high probability.
With its utter simplicity and surprisingly reasonable performance, random search is my to-go method for hyperparameter tuning. It’s trivially parallelizable, just like grid search, but it takes much fewer tries and performance almost as well most of the time.
Smart hyperparameter tuning
Smarter tuning methods are available. Unlike the “dumb” alternatives of grid search and random search, smart hyperparameter tuning are much less parallelizable. Instead of generating all the candidate points upfront and evaluating the batch in parallel, smart tuning techniques pick a few hyperparameter settings, evaluate their quality, then decide where to sample next. This is an inherently iterative and sequential process. It is not very parallelizable. The goal is to make fewer evaluations overall and save on the overall computation time. If wall clock time is your goal, and you can afford multiple machines, then I suggest sticking to random search.
Buyer beware: Smart search algorithms require computation time to figure out where to place the next set of samples. Some algorithms require much more time than others. Hence it only makes sense if the evaluation procedure--the inner optimization box--takes much longer than the process of evaluating where to sample next. Smart search algorithms also contain parameters of their own that need to be tuned. (Hyper-hyperparameters?) Sometimes tuning the hyper-hyperparameters is crucial to make it faster than random search.
Recall that hyperparameter tuning is difficult because we cannot write down the actual mathematical formula for the function we’re optimizing. (The technical term for the function that is being optimized is response surface.) Consequently, we don’t have the derivative of that function, and therefore most of the mathematical optimization tools that we know and love, such as the Newton method or SGD, cannot be applied.
I will highlight three smart tuning methods proposed in recent years: derivative-free optimization, Bayesian optimization, and random forest smart tuning. Derivative-free methods employ heuristics to determine where to sample next. Bayesian optimization and random forest smart tuning both model the response surface with another function, then sample more points based on what the model says.
Jasper Snoek, Hugo Larochelle, and Ryan P. Adams used Gaussian processes to model the response function and something called Expected Improvement to determine the next proposals. Gaussian processes are trippy; they specify distributions over *functions*. When one samples from a Gaussian process, one generates an entire function. Training a Gaussian process adapts this distribution over the data at hand, so that it generates functions that are more likely to model all of the data at once. Given the current estimate of the Gaussian process, one can compute the amount of expected improvement of any point over the current optimum--the Expected Improvement. They showed that this procedure of modeling the hyperparameter response surface and generating the next set of proposed hyperparameter settings can beat the evaluation cost of manual tuning.
Frank Hutter, Holger H. Hoos and Kevin Leyton-Brown suggested training a random forest of regression trees to approximate the response surface. New points are sampled based on where the random forest considers to be the optimal regions. They call this SMAC (Sequential Model-based Algorithm Configuration). Word on the street is that this method works better than Gaussian processes for categorical hyperparameters.
Derivative-free optimization, as the name suggests, is a branch of mathematical optimization for situations where there is no derivative information. Notable derivative-free methods includegenetic algorithm and Nelder-Mead. Essentially, the algorithms boil down to: try a bunch of random points, approximate the gradient, find the most likely search direction, and go there. A few years ago, Misha Bilenko and I tried Nelder-Mead for hyperparameter tuning. We found the algorithm delightfully easy to implement and no less efficient that Bayesian optimization.
Other posts in this series
Part 1: Orientation
Part 2a: Classification metrics
Part 2b: Ranking and regression metrics
Part 3: Validation and offline testing
Software packages
Grid search and random search: GraphLab Create, scikit-learn.
Bayesian optimization using Gaussian processes: Spearmint (from Jasper et al.)
Random forest tuning: SMAC (from Hutter et al.)
Hyper gradient: hypergrad (from Maclaurin et al.)
Further reading
Random search for hyper-parameter optimization, by James Bergstra and Yoshua Bengio. Journal of Machine Learning Research, 2012.
Algorithms for hyper-parameter optimization, by James Bergstra, Rémi Bardenet, Yoshua Bengio and Balázs Kégl. Neural Information Processing Systems, 2011.
Practical Bayesian Optimization of Machine Learning Algorithms, by Jasper Snoek, Hugo Larochelle and Ryan P. Adams. Neural Information Processing Systems, 2012.
Sequential Model-Based Optimization for General Algorithm Configuration, by Frank Hutter, Holger H. Hoos and Kevin Leyton-Brown. Learning and Intelligent Optimization, 2011.
Lazy paired hyper-parameter tuning, by Alice Zheng and Misha Bilenko. International Joint Conference on Artificial Intelligence, 2013.
Introduction to derivative-free optimization, by Andrew R. Conn, Katya Scheinberg and Luis N. Vincente. MPS-SIAM Series on Optimization, 2009.
Gradient-based Hyperparameter Optimization through Reversible Learning, by Dougal Maclaurin, David Duvenaud and Ryan P. Adams. Arxiv, 2015.
How to Evaluate Machine Learning Models, Part 4: Hyperparameter Tuning的更多相关文章
- 【机器学习Machine Learning】资料大全
昨天总结了深度学习的资料,今天把机器学习的资料也总结一下(友情提示:有些网站需要"科学上网"^_^) 推荐几本好书: 1.Pattern Recognition and Machi ...
- 机器学习(Machine Learning)&深度学习(Deep Learning)资料【转】
转自:机器学习(Machine Learning)&深度学习(Deep Learning)资料 <Brief History of Machine Learning> 介绍:这是一 ...
- 机器学习(Machine Learning)与深度学习(Deep Learning)资料汇总
<Brief History of Machine Learning> 介绍:这是一篇介绍机器学习历史的文章,介绍很全面,从感知机.神经网络.决策树.SVM.Adaboost到随机森林.D ...
- How do I learn machine learning?
https://www.quora.com/How-do-I-learn-machine-learning-1?redirected_qid=6578644 How Can I Learn X? ...
- 学习笔记之Machine Learning Crash Course | Google Developers
Machine Learning Crash Course | Google Developers https://developers.google.com/machine-learning/c ...
- [C2P2] Andrew Ng - Machine Learning
##Linear Regression with One Variable Linear regression predicts a real-valued output based on an in ...
- Why The Golden Age Of Machine Learning is Just Beginning
Why The Golden Age Of Machine Learning is Just Beginning Even though the buzz around neural networks ...
- Machine Learning - 第3周(Logistic Regression、Regularization)
Logistic regression is a method for classifying data into discrete outcomes. For example, we might u ...
- How to use data analysis for machine learning (example, part 1)
In my last article, I stated that for practitioners (as opposed to theorists), the real prerequisite ...
随机推荐
- mysql---增删用户
Mysql创建.删除用户 MySql中添加用户,新建数据库,用户授权,删除用户,修改密码(注意每行后边都跟个 ; 表示一个命令语句结束): 1.新建用户 登录MYSQL: @>mysql - ...
- Python学习笔记(二)--变量和数据类型
python中的数据类型 python中什么是变量 python中定义字符串 raw字符串与Unicode字符串 python中的整数和浮点数 python中的bool类型 --- python中的数 ...
- Node.js系列——(1)安装配置与基本使用
1.安装 进入下载地址 小编下载的是msi文件,下一步下一步傻瓜式安装. 打印个hello看看: 2.REPL 全称Read Eval Print Loop,即交互式解释器,可以执行读取.执行.打印. ...
- 解决svn "cannot set LC_CTYPE locale"的问题
解决svn "cannot set LC_CTYPE locale"的问题 在ubuntu 8.10下安装的svn,在将Ubuntu的语言修改为英文之后,出现错误警告: $ svn ...
- Java中int与String间的类型转换
int -> String int i=12345;String s=""; 除了直接调用i.toString();还有以下两种方法第一种方法:s=i+"" ...
- Java JVM- jstat查看jvm的GC情况[转]
ava通过jvm自己管理内存,同时Java提供了一些命令行工具,用于查看内存使用情况.这里主要介绍一下jstat.jmap命令以及相关工具. 一.jstat查看 gc实时执行情况 jstat命令命令格 ...
- 【Python】极简单的方式序列化sqlalchemy结果集为JSON
继承 json.JSONEncoder 实现一个针对sqlalchemy返回类型的处理方式. sqlalchemy的返回类型有大都有两种,一种是Model对象,一种是Query集合(只查询部分字段). ...
- ZOJ2686_Cycle Gameu
题目的意思是给你一个多边形,每条边上有一个权值,你开始在第一个点.每次你必须经过一条有权值的边,并且把该边的权值减小到任意一个非负值,到达该边的另外一个点. 谁第一个无法操作就算输. 题意很简单,解法 ...
- BZOJ 2424 订货(贪心+单调队列)
怎么题解都是用费用流做的啊...用单调队列多优美啊. 题意:某公司估计市场在第i个月对某产品的需求量为Ui,已知在第i月该产品的订货单价为di,上个月月底未销完的单位产品要付存贮费用m,假定第一月月初 ...
- P4005 小 Y 和地铁
题目描述 小 Y 是一个爱好旅行的 OIer.一天,她来到了一个新的城市.由于不熟悉那里的交通系统,她选择了坐地铁. 她发现每条地铁线路可以看成平面上的一条曲线,不同线路的交点处一定会设有 换乘站 . ...