分享stackexchange的一篇问答:https://stats.stackexchange.com/questions/11602/training-with-the-full-dataset-after-cross-validation

Q: Is it always a good idea to train with the full dataset after cross-validation? Put it another way, is it ok to train with all the samples in my dataset and not being able to check if this particular fittingoverfits?

A: The way to think of cross-validation is as estimating the performance obtained using a method for building a model, rather than for estimating the performance of a model.

If you use cross-validation to estimate the hyperparameters of a model and then use those hyper-parameters to fit a model to the whole dataset, then that is fine, provided that you recognise that the cross-validation estimate of performance is likely to be (possibly substantially) optimistically biased. This is because part of the model (the hyper-parameters) have been selected to minimise the cross-validation performance, so if the cross-validation statistic has a non-zero variance (and it will) there is the possibility of over-fitting the model selection criterion.

If you want to choose the hyper-parameters and estimate the performance of the resulting model then you need to perform a nested cross-validation, where the outer cross-validation is used to assess the performance of the model, and in each fold cross-validation is used to determine the hyper-parameters separately in each fold. You build the final model by using cross-validation on the whole set to choose the hyper-parameters and then build the classifier on the whole dataset using the optimized hyper-parameters.

This is of course computationally expensive, but worth it as the bias introduced by improper performance estimation can be large. See my paper

G. C. Cawley and N. L. C. Talbot, Over-fitting in model selection and subsequent selection bias in performance evaluation, Journal of Machine Learning Research, 2010. Research, vol. 11, pp. 2079-2107, July 2010. (wwwpdf)

However, it is still possible to have over-fitting in model selection (nested cross-validation just allows you to test for it). A method I have found useful is to add a regularisation term to the cross-validation error that penalises hyper-parameter values likely to result in overly-complex models, see

G. C. Cawley and N. L. C. Talbot, Preventing over-fitting in model selection via Bayesian regularisation of the hyper-parameters, Journal of Machine Learning Research, volume 8, pages 841-861, April 2007. (www,pdf)

So the answers to your question are (i) yes, you should use the full dataset to produce your final model as the more data you use the more likely it is to generalise well but (ii) make sure you obtain an unbiased performance estimate via nested cross-validation and potentially consider penalising the cross-validation statistic to further avoid over-fitting in model selection.

交叉验证的目的可以理解为是为了估算建模方法的性能,而不是具体模型的性能。

如果使用交叉验证来选择超参数,那使用选取出的超参数在全部数据上拟合模型,这是对的。但是需要注意的是,使用这种交叉验证方法得出的性能估计是很有可能有偏差的。这是因为被选出模型的超参数是通过最小化交叉验证的性能而选出来的,这种情况下,交叉验证的性能用于衡量模型的泛化误差,不够准确。

如果既需要选择超参数,又需要估算选出模型的性能,可以选择Nested Cross-Validation。Nested Cross-Validation中的外层交叉验证用于估算模型性能,内层交叉验证用于选择超参数。最后,基于选出的超参数和全部数据集,产生最终的模型。

尽管这样,还是有可能在模型选择阶段存在过拟合(Nested Cross-Validation只是允许你可以对这种情况进行测试,如何测?)。一种解决方法是在cross-validation error中加入正则项,用于惩罚易产生过度复杂模型的超参数。

总结,(1)最终的模型应该使用全部数据集来建模,因为越多的数据,模型泛化能力越好;(2)需要确认性能估计得无偏的,Nested Cross-Validation和加惩罚项是解决性能估计出现偏差的方法。

Cross-Validation & Nested Cross-Validation的更多相关文章

  1. JSR303/JSR-349,hibernate validation,spring validation 之间的关系

    JSR303是一项标准,JSR-349是其的升级版本,添加了一些新特性,他们规定一些校验规范即校验注解,如@Null,@NotNull,@Pattern,他们位于javax.validation.co ...

  2. CROSS APPLY AND CROSS APPLY

    随着业务千奇百怪,DBA数据库设计各有不同,一对多关系存JSON或字符串逗号分隔... 今天小编给大家分享一下针对这个问题的解决办法 问题一.存储过程接受参数格式为XXX,XXX 解决办法:将字符转成 ...

  3. ssis error at other ssis.pipeline "ole db destination" failed validation and returned validation status

    我在修改一个ssis的包,发现这个destination的表被改过了.所以就重建了表.就导致了这个错误. 打开包重新检查下表结构的匹配就好了

  4. Cross Validation done wrong

    Cross Validation done wrong Cross validation is an essential tool in statistical learning 1 to estim ...

  5. [机器学习] 训练集(train set) 验证集(validation set) 测试集(test set)

    在有监督(supervise)的机器学习中,数据集常被分成2~3个即: 训练集(train set) 验证集(validation set) 测试集(test set) 一般需要将样本分成独立的三部分 ...

  6. AI---训练集(train set) 验证集(validation set) 测试集(test set)

    在有监督(supervise)的机器学习中,数据集常被分成2~3个即: 训练集(train set) 验证集(validation set) 测试集(test set) 一般需要将样本分成独立的三部分 ...

  7. Error creating bean with name 'org.springframework.validation.beanvalidation.LocalValidatorFactory

    Error creating bean with name ‘org.springframework.validation.beanvalidation.LocalValidatorFactoryBe ...

  8. <转>SQL Server CROSS APPLY and OUTER APPLY

    Problem SQL Server 2005 introduced the APPLY operator, which is like a join clause and it allows joi ...

  9. Caused by: java.lang.NoClassDefFoundError: javax/validation/ParameterNameProvider

    问题现象:今天部署代码的时候发现,在beta环境可以正常部署,但是到了test环境就一直不成功,我以为是环境问题,就重新部署,但是没效,看了看日志发现问题是:Caused by: java.lang. ...

  10. MVC学习系列12---验证系列之Fluent Validation

    前面两篇文章学习到了,服务端验证,和客户端的验证,但大家有没有发现,这两种验证各自都有弊端,服务器端的验证,验证的逻辑和代码的逻辑混合在一起了,如果代码量很大的话,以后维护扩展起来,就不是很方便.而客 ...

随机推荐

  1. python 科学计算及数据可视化

    第一步:利用python,画散点图. 第二步:需要用到的库有numpy,matplotlib的子库matplotlib.pyplot numpy(Numerical Python extensions ...

  2. 【记录】VMware解决网络找不到服务器的问题

    本想在虚拟机上的Linux上练习安装Mysql8.0版本的,网络连不上的问题卡了N天简直 1. 点击虚拟机右键设置,虚拟机默认设置为NAT模式,这里无需修改. 2. 点击编辑,虚拟网络设置,勾选主机连 ...

  3. DHCP的IP地址租约、释放

    转自:https://blog.csdn.net/wangdk789/article/details/27052505 当DHCP客户端获取到一个IP地址后,并不代表可以永久使用这个地址,而是有一个使 ...

  4. CentOS下使用tcpdump网络抓包

    tcpdump是Linux下的截获分析网络数据包的工具,对优化系统性能有很大参考价值. 安装 tcpdump不是默认安装的,在CentOS下安装: yum install tcpdump 在Ubunt ...

  5. LGOJ P3834 【模板】可持久化线段树 1(主席树)

    代码 #include <cstdio> #include <iostream> #include <algorithm> using namespace std; ...

  6. CF Round #551 (Div. 2) D

    CF Round #551 (Div. 2) D 链接 https://codeforces.com/contest/1153/problem/D 思路 不考虑赋值和贪心,考虑排名. 设\(dp_i\ ...

  7. 【android】安卓手机连接电脑了,但是adb devices找不到设备及找到设备但无权限的问题

    安卓手机连接电脑的时候,会遇到adb连接失败,adb devices为空,或者连接成功,但是显示unauthorized的情况.遇到这种情况,一般认为是手机驱动安装失败,会选择重新下载安装驱动,如果还 ...

  8. stm32库函数建工程和使用Keil自带库建工程有没有区别?发现了同样的程序在两种情况下keil自带库可以运行的情况,不知是什么原因

    我使用库函数建的工程(非Keil自带库),为了实现SPI对Si24r1芯片数据的读写,以验证stm32是否可以和si24r1能够正常通信,发现使用库函数建的工程程序不能通过,读出来的数据和写的数据不一 ...

  9. speech

    1.李开复:一个人的成功,15%靠专业知识,其余15%人际沟通,公众演讲,以及影响他人的能力 2.演讲是一门遗憾的艺术 3.没有准备就等于准备失败 4.追求完美,就是在追求完蛋 5.宁可千日无机会,不 ...

  10. vs2013编译obs源码

    obs源码下载 一种是在GitHub上下载最新的代码 git clone --recursive https://github.com/jp9000/obs-studio.git --recursiv ...