kaggle Cross-Validation
The Cross-Validation Procedure
In cross-validation, we run our modeling process on different subsets of the data to get multiple measures of model quality. For example, we could have 5 folds or experiments. We divide the data into 5 pieces, each being 20% of the full dataset.

We run an experiment called experiment 1 which uses the first fold as a holdout set, and everything else as training data. This gives us a measure of model quality based on a 20% holdout set, much as we got from using the simple train-test split.
We then run a second experiment, where we hold out data from the second
fold (using everything except the 2nd fold for training the model.) This
gives us a second estimate of model quality.
We repeat this process, using every fold once as the holdout. Putting
this together, 100% of the data is used as a holdout at some point.
Returning to our example above from train-test split, if we have 5000
rows of data, we end up with a measure of model quality based on 5000
rows of holdout (even if we don't use all 5000 rows simultaneously.
Trade-offs Between Cross-Validation and Train-Test Split¶
Cross-validation gives a more accurate measure of model quality, which is especially important if you are making a lot of modeling decisions. However, it can take more time to run, because it estimates models once for each fold. So it is doing more total work.
Given these tradeoffs, when should you use each approach? On small datasets, the extra computational burden of running cross-validation isn't a big deal. These are also the problems where model quality scores would be least reliable with train-test split. So, if your dataset is smaller, you should run cross-validation.
For the same reasons, a simple train-test split is sufficient for larger datasets. It will run faster, and you may have enough data that there's little need to re-use some of it for holdout.
There's no simple threshold for what constitutes a large vs small dataset. If your model takes a couple minute or less to run, it's probably worth switching to cross-validation. If your model takes much longer to run, cross-validation may slow down your workflow more than it's worth.
Alternatively, you can run cross-validation and see if the scores for each experiment seem close. If each experiment gives the same results, train-test split is probably sufficient.
# First we read the data
import pandas as pd
data = pd.read_csv('../input/melb_data.csv')
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]
y = data.Price # Then specify a pipeline of our modeling steps (It can be very difficult to do cross-validation properly if you arent't using pipelines)
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Imputer
my_pipeline = make_pipeline(Imputer(), RandomForestRegressor()) # Finally get the cross-validation scores:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(my_pipeline, X, y, scoring='neg_mean_absolute_error')
print(scores) # It is a little surprising that we specify negative mean absolute error in this case. Scikit-learn has a convention where all
# metrics are defined so a high number is better. Using negatives here allows them to be consistent with that convention, though
# negative MAE is almost unheard of elsewhere. print('Mean Absolute Error %2f' %(-1 * scores.mean()))
kaggle Cross-Validation的更多相关文章
- validation set以及cross validation的常见做法
如果给定的样本充足,进行模型选择的一种简单方法是随机地将数据集切分成三部分,分为训练集(training set).验证集(validation set)和测试集(testing set).训练集用来 ...
- 交叉验证(Cross Validation)原理小结
交叉验证是在机器学习建立模型和验证模型参数时常用的办法.交叉验证,顾名思义,就是重复的使用数据,把得到的样本数据进行切分,组合为不同的训练集和测试集,用训练集来训练模型,用测试集来评估模型预测的好坏. ...
- 交叉验证 Cross validation
来源:CSDN: boat_lee 简单交叉验证 hold-out cross validation 从全部训练数据S中随机选择s个样例作为训练集training set,剩余的作为测试集testin ...
- Cross Validation done wrong
Cross Validation done wrong Cross validation is an essential tool in statistical learning 1 to estim ...
- 交叉验证(cross validation)
转自:http://www.vanjor.org/blog/2010/10/cross-validation/ 交叉验证(Cross-Validation): 有时亦称循环估计, 是一种统计学上将数据 ...
- 10折交叉验证(10-fold Cross Validation)与留一法(Leave-One-Out)、分层采样(Stratification)
10折交叉验证 我们构建一个分类器,输入为运动员的身高.体重,输出为其从事的体育项目-体操.田径或篮球. 一旦构建了分类器,我们就可能有兴趣回答类似下述的问题: . 该分类器的精确率怎么样? . 该分 ...
- Cross Validation(交叉验证)
交叉验证(Cross Validation)方法思想 Cross Validation一下简称CV.CV是用来验证分类器性能的一种统计方法. 思想:将原始数据(dataset)进行分组,一部分作为训练 ...
- S折交叉验证(S-fold cross validation)
S折交叉验证(S-fold cross validation) 觉得有用的话,欢迎一起讨论相互学习~Follow Me 仅为个人观点,欢迎讨论 参考文献 https://blog.csdn.net/a ...
- 交叉验证(Cross Validation)简介
参考 交叉验证 交叉验证 (Cross Validation)刘建平 一.训练集 vs. 测试集 在模式识别(pattern recognition)与机器学习(machine lea ...
- cross validation笔记
preface:做实验少不了交叉验证,平时常用from sklearn.cross_validation import train_test_split,用train_test_split()函数将数 ...
随机推荐
- PHP学习创建水印,缩略图
今天网上学习了一段PHP创建缩略图还有打水印的代码,如下: 其中将图片的路径作为参数传给函数,打水印的过程就是首先获取图片和logo的参数信息,然后将logo图片拷贝到原图的某个位置,然后保存,水印打 ...
- ACM学习历程—HDU5667 Sequence(数论 && 矩阵乘法 && 快速幂)
http://acm.hdu.edu.cn/showproblem.php?pid=5667 这题的关键是处理指数,因为最后结果是a^t这种的,主要是如何计算t. 发现t是一个递推式,t(n) = c ...
- python list和元祖
一,元祖 在python中元祖是只能查询和读取的一组数据,在()内的赋值就是元祖,只有查询和读取的功能: 1.len()方法:查询元祖有多少个元素 s = (') print(len(s)) 结果: ...
- PHP:json_encode 保持中文不被转为ASCII码
echo json_encode(array('黄河之水天上来'),JSON_UNESCAPED_UNICODE);
- 在Mac系统下使用自己安装的PHP
今天在安装单元测试框架PHPUnit,需要PHP的最低版本是5.6,由于我的MacBook自带的PHP版本是5.5.36,不能满足安装条件. 看了一下这个网址:https://php-osx.liip ...
- HBase之八--(1):HBase二级索引的设计(案例讲解)
摘要 最近做的一个项目涉及到了多条件的组合查询,数据存储用的是HBase,恰恰HBase对于这种场景的查询特别不给力,一般HBase的查询都是通过RowKey(要把多条件组合查询的字段都拼接在RowK ...
- DBUtils使用BeanListHandler及BeanHandler时返回null
一.使用Bean相关方法时返回null 问题描述: 使用DBUtils查询数据,如果使用ArrayListHandler等都能够返回正确值,但使用BeanListHandler 和 BeanHandl ...
- RabbitMQ 消息队列 安装及使用
RabbitMQ 消息队列安装: linux版本:CentOS 7 安装第一步:先关闭防火墙 1.Centos7.x关闭防火墙 [root@rabbitmq /]# systemctl stop fi ...
- CentOS 7.2 部署Rsync + Lsyncd服务实现文件实时同步/备份 (一)
接收端配置: 1.安装rsync yum -y install rsync 2.配置同步模块 1. 编辑同步配置文件 vi /etc/rsyncd.conf 2. 同步模块配置参数 # any nam ...
- VxVM如何扩展和收缩卷及文件系统
1. 同时扩展卷和文件系统 先用vxassist命令检查DG可用空间 [root@rhelnode1 ~]# vxassist -g testdg maxsize Maximum volume siz ...