Bayesian optimisation for smart hyperparameter search
Bayesian optimisation for smart hyperparameter search
Fitting a single classifier does not take long, fitting hundreds takes a while. To find the best hyperparameters you need to fit a lot of classifiers. What to do?
This post explores the inner workings of an algorithm you can use to reduce the number of hyperparameter sets you need to try before finding the best set. The algorithm goes under the name of bayesian optimisation. If you are looking for a production ready implementation check out: MOE, metric optimisation engine developed by Yelp.
Gaussian processe regression is a useful tool in general and is used heavily here. Check out my post on Gaussian processes with george for a short introduction.
This post starts with an example where we know the true form of the scoring function. Followed by pitting random grid search against Bayesian optimisation to find the best hyper-parameter for a real classifier.
As usual first some setup and importing:
%matplotlib inline
import random import numpy as np
np.random.seed(9) from scipy.stats import randint as sp_randint import matplotlib.pyplot as plt import seaborn as sns
sns.set_style('whitegrid')
sns.set_context("talk")
By George!¶
Bayesian optimisation uses gaussian processes to fit a regression model to the previously evaluated points in hyper-parameter space. This model is then used to suggest the next (best) point in hyper-parameter space to evaluate the model at.
To choose the best point we need to define a criterion, in this case we use "expected improvement". As we only know the score to with a certain precision we do not want to simply choose the point with the best score. Instead we pick the point which promises the largest expected improvement. This allows us to incorporate the uncertainty about our estimation of the scoring function into the procedure. It leads to a mixture of exploitation and exploration of the parameter space.
Below we setup a toy scoring function (−xsinx), sample a two points from it, and fit our gaussian process model to it.
import george
from george.kernels import ExpSquaredKernel score_func = lambda x: -x*np.sin(x)
x = np.arange(0, 10, 0.1)
# Generate some fake, noisy data. These represent
# the points in hyper-parameter space for which
# we already trained our classifier and evaluated its score
xp = 10 * np.sort(np.random.rand(2))
yerr = 0.2 * np.ones_like(xp)
yp = score_func(xp) + yerr * np.random.randn(len(xp))
# Set up a Gaussian process
kernel = ExpSquaredKernel(1)
gp = george.GP(kernel) gp.compute(xp, yerr) mu, cov = gp.predict(yp, x)
std = np.sqrt(np.diag(cov)) def basic_plot():
fig, ax = plt.subplots()
ax.plot(x, mu, label="GP median")
ax.fill_between(x, mu-std, mu+std, alpha=0.5)
ax.plot(x, score_func(x), '--', label=" True score function (unknown)")
# explicit zorder to draw points and errorbars on top of everything
ax.errorbar(xp, yp, yerr=yerr, fmt='ok', zorder=3, label="samples")
ax.set_ylim(-9,6)
ax.set_ylabel("score")
ax.set_xlabel('hyper-parameter X')
ax.legend(loc='best')
return fig,ax
basic_plot()
(<matplotlib.figure.Figure at 0x10ab63e90>,
<matplotlib.axes._subplots.AxesSubplot at 0x10ab6f590>)
The dashed green line represents the true value of the scoring function as a function of our hypothetical hyper-parameter X. The black dots (and their errorbars) represent points at which we evaluated our classifier and calculated the score. In blue our regression model trying to predict the value of the score function. The shaded area represents the uncertainty on the median (solid blue line) value of the estimated score function value.
Next let's calculate the expected improvement at every value of the hyper-parameter X. We also build a multistart optimisation routine (next_sample) which uses the expected improvement to suggest which point to sample next.
from scipy.optimize import minimize
from scipy import stats
def expected_improvement(points, gp, samples, bigger_better=False):
# are we trying to maximise a score or minimise an error?
if bigger_better:
best_sample = samples[np.argmax(samples)] mu, cov = gp.predict(samples, points)
sigma = np.sqrt(cov.diagonal()) Z = (mu-best_sample)/sigma ei = ((mu-best_sample) * stats.norm.cdf(Z) + sigma*stats.norm.pdf(Z)) # want to use this as objective function in a minimiser so multiply by -1
return -ei else:
best_sample = samples[np.argmin(samples)] mu, cov = gp.predict(samples, points)
sigma = np.sqrt(cov.diagonal()) Z = (best_sample-mu)/sigma ei = ((best_sample-mu) * stats.norm.cdf(Z) + sigma*stats.norm.pdf(Z)) # want to use this as objective function in a minimiser so multiply by -1
return -ei def next_sample(gp, samples, bounds=(0,10), bigger_better=False):
"""Find point with largest expected improvement"""
best_x = None
best_ei = 0
# EI is zero at most values -> often get trapped
# in a local maximum -> multistarting to increase
# our chances to find the global maximum
for rand_x in np.random.uniform(bounds[0], bounds[1], size=30):
res = minimize(expected_improvement, rand_x,
bounds=[bounds],
method='L-BFGS-B',
args=(gp, samples, bigger_better))
if res.fun < best_ei:
best_ei = res.fun
best_x = res.x[0] return best_x fig, ax = basic_plot()
# expected improvement would need its own y axis, so just multiply by ten
ax.plot(x, 10*np.abs(expected_improvement(x, gp, yp)),
label='expected improvement')
ax.legend(loc='best')
<matplotlib.legend.Legend at 0x10c894e50>
print "The algorithm suggests sampling at X=%.4f"%(next_sample(gp, yp))
The algorithm suggests sampling at X=1.5833
The red line shows the expected improvement. Comparing the solid blue line and shaded area with where the exepcted imrpovement is largest it makes sense that the optimisations suggest we should try X=1.58 as the next point to evaluate our scoring function at.
This concludes the toy example part. Let's get moving with something real!
Random Grid Search as Benchmark¶
To make this more interesting than a complete toy example, let's use a regression problem (Friedman1) and a single DecisionTreeRegressor, even though it is fairly fast to fit lots of classifiers on this dataset. Replace both by your setup for your actual problem.
To judge how much more quickly we find the best set of hyperparameters we will pit bayesian optimisation against random grid search. Random grid search is already a big improvement over an exhaustive grid search. I have taken the particular regression problem from Gilles Louppe's PhD thesis: Understanding Random Forests: From Theory to Practice.
from sklearn.grid_search import GridSearchCV
from sklearn.grid_search import RandomizedSearchCV
from sklearn.datasets import make_friedman1
from sklearn.tree import DecisionTreeRegressor from operator import itemgetter # Load the data
X, y = make_friedman1(n_samples=5000) clf = DecisionTreeRegressor() param_dist = {"min_samples_split": sp_randint(1, 101),
} # run randomized search
n_iterations = 8 random_grid = RandomizedSearchCV(clf,
param_distributions=param_dist,
n_iter=n_iterations,
scoring='mean_squared_error')
random_grid = random_grid.fit(X, y)
from scipy.stats import sem params_ = []
scores_ = []
yerr_ = []
for g in random_grid.grid_scores_:
params_.append(g.parameters.values()[0])
scores_.append(g.mean_validation_score)
yerr_.append(sem(g.cv_validation_scores)) fig, ax = plt.subplots()
ax.errorbar(params_, scores_, yerr=yerr_, fmt='ok', label='samples')
ax.set_ylabel("score")
ax.set_xlabel('min samples leaf')
ax.legend(loc='best')
<matplotlib.legend.Legend at 0x10c22bfd0>
With eight evaluations we get a fairly good idea what the score function looks like for this problem. Potentially 1 is the best solution, otherwise steeply falling. The best hyper-parameter setting in this case is eight.
You can see that the search explores all values of min_samples_leaf with equal probability.
def top_parameters(random_grid_cv):
top_score = sorted(random_grid_cv.grid_scores_,
key=itemgetter(1),
reverse=True)[0]
print "Mean validation score: {0:.3f} +- {1:.3f}".format(
top_score.mean_validation_score,
np.std(top_score.cv_validation_scores))
print random_grid_cv.best_params_ top_parameters(random_grid)
Mean validation score: -4.322 +- 0.127
{'min_samples_split': 8}
The top scoring parameter is around eight. Let's see what we can do with a bayesian approach.
Bayesian optimisation¶
Do you have your priors ready? Let's get Bayesian! The question is, can we find at least as good a value for min_samples_split or a better one in eight or less attempts of training a model.
To get things started we evaluate the model at three points of the hyper-parameter. There are used for the first fit of our gaussian process model. The next point at which to evaluate the model is then the point where the expected improvement is largest.
The below two plots show the state of the bayesian optimisation after the first three points are tried and then after the five points choosen according to the expected improvement.
from sklearn.cross_validation import cross_val_score def plot_optimisation(gp, x, params, scores, yerr):
mu, cov = gp.predict(scores, x)
std = np.sqrt(np.diag(cov)) fig, ax = plt.subplots()
ax.plot(x, mu, label="GP median")
ax.fill_between(x, mu-std, mu+std, alpha=0.5) ax_r = ax.twinx()
ax_r.grid(False)
ax_r.plot(x,
np.abs(expected_improvement(x, gp, scores, bigger_better=True)),
label='expected improvement',
c=sns.color_palette()[2])
ax_r.set_ylabel("expected improvement") # explicit zorder to draw points and errorbars on top of everything
ax.errorbar(params, scores, yerr=yerr,
fmt='ok', zorder=3, label='samples')
ax.set_ylabel("score")
ax.set_xlabel('min samples leaf')
ax.legend(loc='best')
return gp def bayes_optimise(clf, X,y, parameter, n_iterations, bounds):
x = range(bounds[0], bounds[1]+1) params = []
scores = []
yerr = [] for param in np.linspace(bounds[0], bounds[1], 3, dtype=int):
clf.set_params(**{parameter: param})
cv_scores = cross_val_score(clf, X,y, scoring='mean_squared_error')
params.append(param)
scores.append(np.mean(cv_scores))
yerr.append(sem(cv_scores)) # Some cheating here, tuning the GP hyperparameters is something
# we skip in this post
kernel = ExpSquaredKernel(1000)
gp = george.GP(kernel, mean=np.mean(scores))
gp.compute(params, yerr) plot_optimisation(gp, x, params, scores, yerr) for n in range(n_iterations-3):
gp.compute(params, yerr)
param = next_sample(gp, scores, bounds=bounds, bigger_better=True) clf.set_params(**{parameter: param})
cv_scores = cross_val_score(clf, X,y, scoring='mean_squared_error')
params.append(param)
scores.append(np.mean(cv_scores))
yerr.append(sem(cv_scores)) plot_optimisation(gp, x, params, scores, yerr)
return params, scores, yerr, clf params, scores, yerr, clf = bayes_optimise(DecisionTreeRegressor(),
X,y,
'min_samples_split',
8, (1,100))
print "Best parameter:"
print params[np.argmax(scores)], 'scores', scores[np.argmax(scores)]
Best parameter:
17.721702255 scores -4.29572348033
You can see that the points are all sampled close to the maximum. Where as the random grid search samples points far away from the peak (above 40 and beyond), the bayesian optimisation concentrates on the region close to the maximum (around 20). This vastly improves the efficiency of finding the true maximum. We could have even stopped before evaluating all of the next five points. They are all pretty close to each other.
The real deal --- MOE¶
While it is quite straightforward to build yourself a small bayesian optimisation procedure, I would recommend you check out MOE. This is a production quality setup for doing global, black box optimisation. It is developed by the good guys at Yelp!. Therefore much more robust than our home made solution.
Conclusions¶
Bayesian optimisation is not scary. With the two examples here you should be convinced that using a smart approach like this is faster than a random grid search (especially in higher dimensions) and that there is nothing magic going on.
If you find a mistake or want to tell me something else get in touch on twitter @betatim
This post started life as a ipython notebook, download it or view it online.
Bayesian optimisation for smart hyperparameter search的更多相关文章
- State of Hyperparameter Selection
State of Hyperparameter Selection DANIEL SALTIEL VIEW NOTEBOOK Historically hyperparameter determina ...
- How to Evaluate Machine Learning Models, Part 4: Hyperparameter Tuning
How to Evaluate Machine Learning Models, Part 4: Hyperparameter Tuning In the realm of machine learn ...
- (转)Illustrated: Efficient Neural Architecture Search ---Guide on macro and micro search strategies in ENAS
Illustrated: Efficient Neural Architecture Search --- Guide on macro and micro search strategies in ...
- Research Guide for Neural Architecture Search
Research Guide for Neural Architecture Search 2019-09-19 09:29:04 This blog is from: https://heartbe ...
- [C4] Andrew Ng - Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization
About this Course This course will teach you the "magic" of getting deep learning to work ...
- (转) NAS(神经结构搜索)综述
NAS(神经结构搜索)综述 文章转载自:http://www.tensorinfinity.com/paper_136.html 本文是对神经结构搜索(NAS)的简单综述,在写作的过程中参考了文献[1 ...
- ICML 2018 | 从强化学习到生成模型:40篇值得一读的论文
https://blog.csdn.net/y80gDg1/article/details/81463731 感谢阅读腾讯AI Lab微信号第34篇文章.当地时间 7 月 10-15 日,第 35 届 ...
- 【转载】NeurIPS 2018 | 腾讯AI Lab详解3大热点:模型压缩、机器学习及最优化算法
原文:NeurIPS 2018 | 腾讯AI Lab详解3大热点:模型压缩.机器学习及最优化算法 导读 AI领域顶会NeurIPS正在加拿大蒙特利尔举办.本文针对实验室关注的几个研究热点,模型压缩.自 ...
- AutoML相关论文
本文为Awesome-AutoML-Papers的译文. 1.AutoML简介 Machine Learning几年来取得的不少可观的成绩,越来越多的学科都依赖于它.然而,这些成果都很大程度上取决于人 ...
随机推荐
- nexus在linux上搭建
Maven 仓库的分类:(maven的仓库只有两大类) 1.本地仓库 2.远程仓库,在远程仓库中又分成了3种: 2.1 中央仓库 2.2 私服 2.3 其它公共库 有个maven私服可以很方便地管理我 ...
- ETL工具之Kettle的简单使用一(不同数据库之间的数据抽取-转换-加载)
ETL工具之Kettle将一个数据库中的数据提取到另外一个数据库中: 1.打开ETL文件夹,双击Spoon.bat启动Kettle 2.资源库选择,诺无则选择取消 3.选择关闭 4.新建一个转换 5. ...
- python脚本批量生成50000条插入数据的sql语句
f = open("xx.txt",'w') for i in range(1,50001): str_i = str(i) realname = "lxs"+ ...
- 第97天:CSS3渐变和过渡详解
一.渐变 渐变是CSS3当中比较丰富多彩的一个特性,通过渐变我们可以实现许多炫丽的效果,有效的减少图片的使用数量,并且具有很强的适应性和可扩展性. 可分为线性渐变.径向渐变 1. 线性渐变 (grad ...
- 第77天:jQuery事件绑定触发
一.元素操作 1. 高度和宽度 $(“div”).height(); // 高度 $(“div”).width(); // 宽度 .height()方法和.css(“height”)的区别: 返回值不 ...
- 【.Net+数据库】sqlserver的四种分页方式
第一种:ROW_NUMBER() OVER()方式 select * from ( select *, ROW_NUMBER() OVER(Order by ArtistId ) AS RowId ...
- CERC2013(C)_Magical GCD
题意是这样的,给你一个序列a[i],需要你选一段连续的序列a[i]到a[j],使得长度乘以这个段的gcd最大. 一开始总是以为是各种神奇的数据结构,诶,后来才发现,机智才是王道啊. 可以这样考虑,每次 ...
- React安装React Devtools调试工具
在运行一个React项目的时候浏览器控制台会提醒你去安装react devtools调试工具. Download the React DevTools for a better development ...
- 【Linux】无法将 Ethernet0 连接到虚拟网络“VMnet8”
Linux安装centos之后,可能会出现ipconfig命令之后没有看到eth0信息,只有lo.log日志包的错为:无法将 Ethernet0 连接到虚拟网络“VMnet8” 解决办法有: 1.在虚 ...
- 洛谷 P2015 二叉苹果树
老规矩,先放题面 题目描述 有一棵苹果树,如果树枝有分叉,一定是分2叉(就是说没有只有1个儿子的结点) 这棵树共有N个结点(叶子点或者树枝分叉点),编号为1-N,树根编号一定是1. 我们用一根树枝两端 ...