Kaggle: House Prices: Advanced Regression Techniques
Kaggle: House Prices: Advanced Regression Techniques
notebook来自https://www.kaggle.com/neviadomski/how-to-get-to-top-25-with-simple-model-sklearn
思路流程:
1.导入数据,查看数据结构和缺失值情况
重点在于查看缺失值情况的写法:NAs = pd.concat([train.isnull().sum(), test.isnull().sum()], axis = 1, keys = ['train', 'test'])
NAs[NAs.sum(axis=1) > 0]
2.数据预处理(删除无用特征,特征转化,缺失值填充,构造新特征,特征值标准化,转化为dummy)
Q:什么样的特征需要做转化?
A:如某些整型数据只表示类别,其数值本身没有意义,则应转化为dummy
重点学习手动将特征转化为dummy的方法(这里情况稍微还要复杂一点,因为存在同一特征对应两列的情况,如Condition1,Condition2)
3.随机打乱数据,分离训练集和测试集
4.构建多个单一模型
5.模型融合
问题:
1.如何判断一个特征是否是无用特征?
2.模型融合的方法?这里为什是np.exp(GB_model.predict(test_features)) + np.exp(ENS_model.predict(test_features_std))?
3.为什么label分布偏斜需要做转化?
#Kaggle: House Prices: Advanced Regression Techniques
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import ensemble, linear_model, tree
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.utils import shuffle
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
train = pd.read_csv('downloads/train.csv')
test = pd.read_csv('downloads/test.csv')
train.head()
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
| 1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
| 2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
| 3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
| 4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
5 rows × 81 columns
#检查缺失值
NAs = pd.concat([train.isnull().sum(), test.isnull().sum()], axis = 1, keys = ['train', 'test']) #sum()默认的axis=0,即跨行
NAs[NAs.sum(axis=1) > 0] #只显示有缺失值的特征
| train | test | |
|---|---|---|
| Alley | 1369 | 1352.0 |
| BsmtCond | 37 | 45.0 |
| BsmtExposure | 38 | 44.0 |
| BsmtFinSF1 | 0 | 1.0 |
| BsmtFinSF2 | 0 | 1.0 |
| BsmtFinType1 | 37 | 42.0 |
| BsmtFinType2 | 38 | 42.0 |
| BsmtFullBath | 0 | 2.0 |
| BsmtHalfBath | 0 | 2.0 |
| BsmtQual | 37 | 44.0 |
| BsmtUnfSF | 0 | 1.0 |
| Electrical | 1 | 0.0 |
| Exterior1st | 0 | 1.0 |
| Exterior2nd | 0 | 1.0 |
| Fence | 1179 | 1169.0 |
| FireplaceQu | 690 | 730.0 |
| Functional | 0 | 2.0 |
| GarageArea | 0 | 1.0 |
| GarageCars | 0 | 1.0 |
| GarageCond | 81 | 78.0 |
| GarageFinish | 81 | 78.0 |
| GarageQual | 81 | 78.0 |
| GarageType | 81 | 76.0 |
| GarageYrBlt | 81 | 78.0 |
| KitchenQual | 0 | 1.0 |
| LotFrontage | 259 | 227.0 |
| MSZoning | 0 | 4.0 |
| MasVnrArea | 8 | 15.0 |
| MasVnrType | 8 | 16.0 |
| MiscFeature | 1406 | 1408.0 |
| PoolQC | 1453 | 1456.0 |
| SaleType | 0 | 1.0 |
| TotalBsmtSF | 0 | 1.0 |
| Utilities | 0 | 2.0 |
#打印R2和RMSE得分
def print_score (prediction, labels):
print('R2: {}'.format(r2_score(prediction, labels)))
print('RMSE: {}'.format(np.sqrt(mean_squared_error(prediction, labels))))
#对给定的模型进行评估,分别打印训练集上的得分和测试集上的得分
def train_test_score(estimator, x_train, x_test, y_train, y_test):
train_predictions = estimator.predict(x_train)
print('------------train-----------')
print_score(train_predictions, y_train)
print('------------test------------')
test_predictions = estimator.predict(x_test)
print_score(test_predictions, y_test)
#将标签从训练集中分离出来
train_label = train.pop('SalePrice')
#将训练集特征和测试集特征拼在一起,便于一起删除无用的特征
features = pd.concat([train, test], keys = ['train', 'test'])
#删除无用特征(为什么说它们是无用特征并没有解释)
features.drop(['Utilities', 'RoofMatl', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'Heating', 'LowQualFinSF',
'BsmtFullBath', 'BsmtHalfBath', 'Functional', 'GarageYrBlt', 'GarageArea', 'GarageCond', 'WoodDeckSF',
'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal'],
axis=1, inplace=True)
print(features.shape)
(2919, 56)
#将series数据转化为str
#问题:什么样的数据需要转化为str
#答:将原来的某些整型数据转化为str,这些整型数据数字大小本身并没有含义,而只是代表一个类,所以转化为str后,后续再转化为dummy
features['MSSubClass'] = features['MSSubClass'].astype(str)
#pandas调用特征的两种方法:.feature和['feature'],两者效果相同,下面就是.feature方法
features.OverallCond = features.OverallCond.astype(str)
features['KitchenAbvGr'] = features['KitchenAbvGr'].astype(str)
features['YrSold'] = features['YrSold'].astype(str)
features['MoSold'] = features['MoSold'].astype(str)
#用众数填充缺失值
features['MSZoning'] = features['MSZoning'].fillna(features['MSZoning'].mode()[0])
features['MasVnrType'] = features['MasVnrType'].fillna(features['MasVnrType'].mode()[0])
features['Electrical'] = features['Electrical'].fillna(features['Electrical'].mode()[0])
features['KitchenQual'] = features['KitchenQual'].fillna(features['KitchenQual'].mode()[0])
features['SaleType'] = features['SaleType'].fillna(features['SaleType'].mode()[0])
#用某个特定值填充缺失值
features['LotFrontage'] = features['LotFrontage'].fillna(features['LotFrontage'].mean())
features['Alley'] = features['Alley'].fillna('NOACCESS')
for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
features[col] = features[col].fillna('NoBSMT')
features['TotalBsmtSF'] = features['TotalBsmtSF'].fillna(0)
features['FireplaceQu'] = features['FireplaceQu'].fillna('NoFP')
for col in ('GarageType', 'GarageFinish', 'GarageQual'):
features[col] = features[col].fillna('NoGRG')
features['GarageCars'] = features['GarageCars'].fillna(0.0)
#构造新特征
features['TotalSF'] = features['TotalBsmtSF'] + features['1stFlrSF'] + features['2ndFlrSF']
features.drop(['TotalBsmtSF', '1stFlrSF', '2ndFlrSF'], axis=1, inplace=True)
print(features.shape)
(2919, 54)
#查看房价分布情况
ax = sns.distplot(train_label)
#发现图像整体向左倾斜,所以做log转变
train_label = np.log(train_label)
ax = sns.distplot(train_label)
#对数字特征做标准化处理
num_features = features.loc[:,['LotFrontage', 'LotArea', 'GrLivArea', 'TotalSF']]
num_features_standarized = (num_features - num_features.mean()) / num_features.std()
num_features_standarized.head()
| LotFrontage | LotArea | GrLivArea | TotalSF | ||
|---|---|---|---|---|---|
| train | 0 | -0.202033 | -0.217841 | 0.413476 | 0.022999 |
| 1 | 0.501785 | -0.072032 | -0.471810 | -0.029167 | |
| 2 | -0.061269 | 0.137173 | 0.563659 | 0.196886 | |
| 3 | -0.436639 | -0.078371 | 0.427309 | -0.092511 | |
| 4 | 0.689469 | 0.518814 | 1.377806 | 0.988072 |
ax = sns.pairplot(num_features_standarized)
#重点
#convert categorical data to dummies
#将所有condition不重复的记录在一个set中
conditions = set([x for x in features['Condition1']] + [x for x in features['Condition2']])
#自定义dummy变量,行数为阳历数,列数为原condition数据转化为dummy后的维数
dummies = pd.DataFrame(data = np.zeros((len(features.index), len(conditions))), index = features.index, columns = conditions)
#遍历所有样例,将原来的condition信息转化为对应的dummy信息
for i, cond in enumerate(zip(features['Condition1'], features['Condition2'])):
#用ix找到位置,注意cond可能包含Condition1和Condition2两个位置的信息,对应dummies数组的两个点,所以需要用ix而不能简单的直接用dummies[i,cond]
dummies.ix[i, cond] = 1
#将dummy后的特征数据拼接到原features后面,并给dummy特征的index增加前缀
features = pd.concat([features, dummies.add_prefix('Cond_')], axis = 1)
#最后就可以删除原来的Condition特征
features.drop(['Condition1', 'Condition2'], axis = 1, inplace =True)
print(features.shape)
(2919, 61)
features.head()
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | LotConfig | ... | TotalSF | Cond_PosA | Cond_Artery | Cond_PosN | Cond_RRAn | Cond_RRAe | Cond_Feedr | Cond_Norm | Cond_RRNn | Cond_RRNe | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| train | 0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NOACCESS | Reg | Lvl | Inside | ... | 2566.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| 1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NOACCESS | Reg | Lvl | FR2 | ... | 2524.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | |
| 2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NOACCESS | IR1 | Lvl | Inside | ... | 2706.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | |
| 3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NOACCESS | IR1 | Lvl | Corner | ... | 2473.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | |
| 4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NOACCESS | IR1 | Lvl | FR2 | ... | 3343.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
5 rows × 61 columns
#convert Exterior to dummies
Exterior = set([x for x in features['Exterior1st']] + [x for x in features['Exterior2nd']])
dummies = pd.DataFrame(data = np.zeros([len(features.index), len(Exterior)]), index = features.index, columns = Exterior)
for i, ext in enumerate(zip(features['Exterior1st'], features['Exterior2nd'])):
dummies.ix[i, ext] = 1
features = pd.concat([features, dummies.add_prefix('Ext_')], axis = 1)
features.drop(['Exterior1st', 'Exterior2nd', 'Ext_nan'], axis = 1, inplace = True)
print(features.shape)
(2919, 78)
features.dtypes[features.dtypes == 'object'].index
Index(['MSSubClass', 'MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour',
'LotConfig', 'LandSlope', 'Neighborhood', 'BldgType', 'HouseStyle',
'OverallCond', 'RoofStyle', 'MasVnrType', 'ExterQual', 'ExterCond',
'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1',
'BsmtFinType2', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenAbvGr',
'KitchenQual', 'FireplaceQu', 'GarageType', 'GarageFinish',
'GarageQual', 'PavedDrive', 'MoSold', 'YrSold', 'SaleType',
'SaleCondition'],
dtype='object')
#遍历特定类型数据的方法:for col in features.dtypes[features.dtypes == 'object'].index
#convert all other categorical vars to dummies
for col in features.dtypes[features.dtypes == 'object'].index:
for_dummy = features.pop(col)
features = pd.concat([features, pd.get_dummies(for_dummy, prefix = col)], axis = 1)
print(features.shape)
(2919, 263)
#用之前几个标准化的数据更新features
features_standardized = features.copy()
features_standardized.update(num_features_standarized)
#重新分离训练集和测试集
#首先分离没有标准化的features
train_features = features.loc['train'].drop(['Id'], axis=1).select_dtypes(include=[np.number]).values
test_features = features.loc['test'].drop(['Id'], axis=1).select_dtypes(include=[np.number]).values
#再分离标准化的数据
train_features_std = features_standardized.loc['train'].drop(['Id'], axis=1).select_dtypes(include=[np.number]).values
test_features_std = features_standardized.loc['test'].drop(['Id'], axis=1).select_dtypes(include=[np.number]).values
print(train_features.shape)
print(train_features_std.shape)
(1460, 262)
(1460, 262)
#shuffle train dataset
train_features_std, train_features, train_label = shuffle(train_features_std, train_features, train_label, random_state = 5)
#split train and test data
x_train, x_test, y_train, y_test = train_test_split(train_features, train_label, test_size = 0.1, random_state = 200)
x_train_std, x_test_std, y_train_std, y_test_std = train_test_split(train_features_std, train_label, test_size = 0.1, random_state = 200)
#构建第一个模型:ElasticNet
ENSTest = linear_model.ElasticNetCV(alphas=[0.0001, 0.0005, 0.001, 0.01, 0.1, 1, 10], l1_ratio=[.01, .1, .5, .9, .99], max_iter=5000).fit(x_train_std, y_train_std)
train_test_score(ENSTest, x_train_std, x_test_std, y_train_std, y_test_std)
------------train-----------
R2: 0.9009283127352861
RMSE: 0.11921419084690392
------------test------------
R2: 0.8967299522701895
RMSE: 0.11097042840114624
#测试模型的交叉验证得分
score = cross_val_score(ENSTest, train_features_std, train_label, cv = 5)
print('Accurary: %0.2f +/- %0.2f' % (score.mean(), score.std()*2))
Accurary: 0.88 +/- 0.10
#构建第二个模型:GradientBoosting
GB = ensemble.GradientBoostingRegressor(n_estimators=3000, learning_rate = 0.05, max_depth = 3, max_features = 'sqrt',
min_samples_leaf = 15,
min_samples_split = 10, loss = 'huber').fit(x_train_std, y_train_std)
train_test_score(GB, x_train_std, x_test_std, y_train_std, y_test_std)
------------train-----------
R2: 0.9607778449577035
RMSE: 0.07698826081848897
------------test------------
R2: 0.9002871760789876
RMSE: 0.10793269100940146
#构建第二个模型:GradientBoosting
GB = ensemble.GradientBoostingRegressor(n_estimators=3000, learning_rate = 0.05, max_depth = 3, max_features = 'sqrt',
min_samples_leaf = 15,
min_samples_split = 10, loss = 'huber').fit(x_train_std, y_train_std)
train_test_score(GB, x_train_std, x_test_std, y_train_std, y_test_std)
Accurary: 0.90 +/- 0.04
#模型融合
GB_model = GB.fit(train_features, train_label)
ENS_model = ENSTest.fit(train_features_std, train_label)
#为什么模型融合公式是这样的?
Final_score = (np.exp(GB_model.predict(test_features)) + np.exp(ENS_model.predict(test_features_std))) / 2
#写入csv文件
pd.DataFrame({'Id':test.Id, 'SalePrice':Final_score}).to_csv('submit.csv', index=False)
Kaggle: House Prices: Advanced Regression Techniques的更多相关文章
- Kaggle:House Prices: Advanced Regression Techniques 数据预处理
本博客是博主在学习了两篇关于 "House Prices: Advanced Regression Techniques" 的教程 (House Prices EDA 和 Comp ...
- Kaggle比赛(二)House Prices: Advanced Regression Techniques
房价预测是我入门Kaggle的第二个比赛,参考学习了他人的一篇优秀教程:https://www.kaggle.com/serigne/stacked-regressions-top-4-on-lead ...
- 7 Types of Regression Techniques you should know!
翻译来自:http://news.csdn.net/article_preview.html?preview=1&reload=1&arcid=2825492 摘要:本文解释了回归分析 ...
- 7 Types of Regression Techniques
https://www.analyticsvidhya.com/blog/2015/08/comprehensive-guide-regression/ What is Regression Anal ...
- advanced regression to predict housing prices
https://docs.google.com/presentation/d/e/2PACX-1vQGlXP6QZH0ATzXYwnrXinJcCn00fxCOoEczPAXU-n3hAPLUfMfi ...
- The Art of Prolog:Advanced Programming Techniques【译文】
申明:此文为译文,仅供学习交流试用,请勿用作商业用途,造成一切后果本人概不负责,转载请说明.本人英语功力尚浅,翻译大多借助于翻译工具,如有失误,欢迎指正. 逻辑程序简介 逻辑程序是一组公理或规则,定义 ...
- 基于Colab Pro & Google Drive的Kaggle实战
原文:https://hippocampus-garden.com/kaggle_colab/ 原文标题:How to Kaggle with Colab Pro & Google Drive ...
- How do I learn machine learning?
https://www.quora.com/How-do-I-learn-machine-learning-1?redirected_qid=6578644 How Can I Learn X? ...
- Kaggle大牛小姐姐自述:我是怎么成为竞赛中Top 0.3%的 | 干货攻略
天天跟数据打交道的研究人员,都有一个成为Kaggle顶级大师(Grandmaster)的梦想. 但每年的Kaggle参赛团队众多,通常一个项目都有数千人至上万人报名,如何在其中脱颖而出? 最近,自动化 ...
随机推荐
- Vue--爬坑
1.路由变化页面数据不刷新问题: 出现这种情况是因为依赖路由的params参数获取写在created生命周期里面,因为相同路由二次甚至多次加载的关系 没有达到监听,退出页面再进入另一个文章页面并不会运 ...
- [经验] Cocos Creator使用笔记 --- 俄罗斯方块 (1)
一: 实现 物体匀速掉落 这是我在做俄罗斯方块的时候遇到的一个问题, 因为原来的方块的掉落是每秒掉落一个像素点, 但是这样看起来的话会是一卡一卡的, 为了让方块在掉落的过程中看起来更加的流畅, 于 ...
- C语言学习从入门到精通书籍,10万读者都认可
C语言程序设计从入门到精通 10万读者认可的编程图书精粹 零基础自学编程的入门图书 详解C语言编程思想和核心技术 很多初学者,对C语言.c++的概念都是模糊不清的,C语言.c++是什么,能做什么,学的 ...
- Core data 如何查看ObjectId
ObjectId是Core Data 为每一个数据对象提供的唯一ID标识,获取ObjectID.并打印的方法如下: 步骤: 1. 获取ManagedObject 2. ManagedObject -& ...
- python 函数map()、filter()、reduce()
map()函数 将一个列表进行遍历,对每一个字符串进行处理: 例如: num_list = ["我","是","哈哈","太 ...
- HihoCoder第一周与POJ3974:最长回文字串
这个题目是hihoCoder第一周的题目,自己打算从第一周开始做起,不知道能追上多少,更不知道这一篇写完,下一篇会是什么时候... 题意很简单. 输入: abababa aaaabaa acacdas ...
- Python测试进阶——(2)配置PyCharm远程调试环境
新建一个Python项目 配置Deployment,用于本地文件和远程文件的同步,在pycharm的菜单栏依次找到:Tools > Deployment > Configuration 点 ...
- linux(centos6.9)下rpm方式安装mysql后mysql服务无法启动
以下两种方式启动都报错:启动失败: [root@node03 ~]# service mysqld startMySQL Daemon failed to start.Starting mysqld: ...
- BFPRT(中位数的中位数算法)
BFPRT(中位数的中位数算法) 类似于快排,但是划分区间的策略不一样. 分组,组内排序: 取出每组的中位数组成一个数组,再取这个数组的中位数: 以取出的中位数作为partition的轴.
- ACM-寻宝
题目描述:寻宝 有这么一块神奇的矩形土地,为什么神奇呢?因为上面藏有很多的宝藏.该土地由N*M个小正方形土地格子组成,每个小正方形土地格子上,如果标有“E”,则表示该格可以通过:如果标有“X”,则表示 ...