Kaggle比赛(一)Titanic: Machine Learning from Disaster
泰坦尼克号幸存预测是本小白接触的第一个Kaggle入门比赛,主要参考了以下两篇教程:
本模型在Leaderboard上的最高得分为0.79904,排名前13%。
由于这个比赛做得比较早了,当时很多分析的细节都忘了,而且由于是第一次做,整体还是非常简陋的。今天心血来潮,就当做个简单的记录(流水账)。
导入相关包:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
读取训练、测试集并合并在一起处理:
train_raw = pd.read_csv('datasets/train.csv')
test_raw = pd.read_csv('datasets/test.csv')
train_test = train_raw.append(test_raw, ignore_index=True, sort=False)
姓名中的称谓可以在一定程度上体现出人的性别、年龄、身份、社会地位等,因而是一个不可忽略的重要特征。我们首先用正则表达式将Name字段中的称谓信息提取出来,然后做归类:
- Mr、Don代表男性
- Miss、Ms、Mlle代表未婚女子
- Mrs、Mme、Lady、Dona代表已婚女士
- Countess、Jonkheer均为贵族身份
- Capt、Col、Dr、Major、Sir这些少数称谓归为其他一类
train_test['Title'] = train_test['Name'].apply(lambda x: re.search('(\w+)\.', x).group(1))
train_test['Title'].replace(['Don'], 'Mr', inplace=True)
train_test['Title'].replace(['Mlle','Ms'], 'Miss', inplace=True)
train_test['Title'].replace(['Mme', 'Lady', 'Dona'], 'Mrs', inplace=True)
train_test['Title'].replace(['Countess', 'Jonkheer'], 'Noble', inplace=True)
train_test['Title'].replace(['Capt', 'Col', 'Dr', 'Major', 'Sir'], 'Other', inplace=True)
对称谓类别进行独热编码(One-Hot encoding):
title_onehot = pd.get_dummies(train_test['Title'], prefix='Title')
train_test = pd.concat([train_test, title_onehot], axis=1)
对性别进行独热处理:
sex_onehot = pd.get_dummies(train_test['Sex'], prefix='Sex')
train_test = pd.concat([train_test, sex_onehot], axis=1)
将SibSp和Parch两个特征组合在一起,构造出表示家庭大小的特征,因为分析表明有亲人同行的乘客比独自一人具有更高的存活率。
train_test['FamilySize'] = train_test['SibSp'] + train_test['Parch'] + 1
用众数对Embarked填补缺失值:
train_test['Embarked'].fillna(train_test['Embarked'].mode()[0], inplace=True)
embarked_onehot = pd.get_dummies(train_test['Embarked'], prefix='Embarked')
train_test = pd.concat([train_test, embarked_onehot], axis=1)
由于Cabin缺失值太多,姑且将有无Cabin作为特征:
train_test['Cabin'].fillna('NO', inplace=True)
train_test['Cabin'] = np.where(train_test['Cabin'] == 'NO', 'NO', 'YES')
cabin_onehot = pd.get_dummies(train_test['Cabin'], prefix='Cabin')
train_test = pd.concat([train_test, cabin_onehot], axis=1)
用同等船舱的票价均值填补Fare的缺失值:
Ktrain_test['Fare'].fillna(train_test.groupby('Pclass')['Fare'].transform('mean'), inplace=True)
由于有团体票,我们将票价均摊到每个人身上:
shares = train_test.groupby('Ticket')['Fare'].transform('count')
train_test['Fare'] = train_test['Fare'] / shares
票价分级:
train_test.loc[train_test['Fare'] < 5, 'Fare'] = 0
train_test.loc[(train_test['Fare'] >= 5) & (train_test['Fare'] < 10), 'Fare'] = 1
train_test.loc[(train_test['Fare'] >= 10) & (train_test['Fare'] < 15), 'Fare'] = 2
train_test.loc[(train_test['Fare'] >= 15) & (train_test['Fare'] < 30), 'Fare'] = 3
train_test.loc[(train_test['Fare'] >= 30) & (train_test['Fare'] < 60), 'Fare'] = 4
train_test.loc[(train_test['Fare'] >= 60) & (train_test['Fare'] < 100), 'Fare'] = 5
train_test.loc[train_test['Fare'] >= 100, 'Fare'] = 6
利用shares构造一个新的特征,将买团体票的乘客分为一类,单独买票的分为一类:
train_test['GroupTicket'] = np.where(shares == 1, 'NO', 'YES')
group_ticket_onehot = pd.get_dummies(train_test['GroupTicket'], prefix='GroupTicket')
train_test = pd.concat([train_test, group_ticket_onehot], axis=1)
对于缺失较多的Age项,直接用平均数或者中位数来填充不太合适。这里我们用机器学习算法,利用其他特征来推测年龄。
missing_age_df = pd.DataFrame(train_test[['Age', 'Parch', 'Sex', 'SibSp', 'FamilySize', 'Title', 'Fare', 'Pclass', 'Embarked']])
missing_age_df = pd.get_dummies(missing_age_df,columns=['Title', 'FamilySize', 'Sex', 'Pclass' ,'Embarked'])
missing_age_train = missing_age_df[missing_age_df['Age'].notnull()]
missing_age_test = missing_age_df[missing_age_df['Age'].isnull()]
def fill_missing_age(missing_age_train, missing_age_test):
missing_age_X_train = missing_age_train.drop(['Age'], axis=1)
missing_age_Y_train = missing_age_train['Age']
missing_age_X_test = missing_age_test.drop(['Age'], axis=1)
# 模型1
gbm_reg = GradientBoostingRegressor(n_estimators=100, max_depth=3, learning_rate=0.01, max_features=3, random_state=42)
gbm_reg.fit(missing_age_X_train, missing_age_Y_train)
missing_age_test['Age_GB'] = gbm_reg.predict(missing_age_X_test)
# 模型2
lrf_reg = LinearRegression(fit_intercept=True, normalize=True)
lrf_reg.fit(missing_age_X_train, missing_age_Y_train)
missing_age_test['Age_LRF'] = lrf_reg.predict(missing_age_X_test)
# 将两个模型预测后的均值作为最终预测结果
missing_age_test['Age'] = np.mean([missing_age_test['Age_GB'], missing_age_test['Age_LRF']])
return missing_age_test
train_test.loc[(train_test.Age.isnull()), 'Age'] = fill_missing_age(missing_age_train, missing_age_test)
划分年龄段:
train_test.loc[train_test['Age'] < 9, 'Age'] = 0
train_test.loc[(train_test['Age'] >= 9) & (train_test['Age'] < 18), 'Age'] = 1
train_test.loc[(train_test['Age'] >= 18) & (train_test['Age'] < 27), 'Age'] = 2
train_test.loc[(train_test['Age'] >= 27) & (train_test['Age'] < 36), 'Age'] = 3
train_test.loc[(train_test['Age'] >= 36) & (train_test['Age'] < 45), 'Age'] = 4
train_test.loc[(train_test['Age'] >= 45) & (train_test['Age'] < 54), 'Age'] = 5
train_test.loc[(train_test['Age'] >= 54) & (train_test['Age'] < 63), 'Age'] = 6
train_test.loc[(train_test['Age'] >= 63) & (train_test['Age'] < 72), 'Age'] = 7
train_test.loc[(train_test['Age'] >= 72) & (train_test['Age'] < 81), 'Age'] = 8
train_test.loc[train_test['Age'] >= 81, 'Age'] = 9
保存PassengerId:
passengerId_test = train_test['PassengerId'][891:]
丢弃多余的特征:
train_test.drop(['PassengerId', 'Name', 'SibSp', 'Parch', 'Title', 'Sex', 'Embarked', 'Cabin', 'Ticket', 'GroupTicket'], axis=1, inplace=True)
划分训练集和测试集:
train = train_test[:891]
test = train_test[891:]
X_train = train.drop(['Survived'], axis=1)
y_train = train['Survived']
X_test = test.drop(['Survived'], axis=1)
分别用随机森林、极端随机树和梯度提升树进行训练,然后利用VotingClassifer建立最终预测模型。
rf = RandomForestClassifier(n_estimators=500, max_depth=5, min_samples_split=13)
et = ExtraTreesClassifier(n_estimators=500, max_depth=7, min_samples_split=8)
gbm = GradientBoostingClassifier(n_estimators=500, learning_rate=0.0135)
voting = VotingClassifier(estimators=[('rf', rf), ('et', et), ('gbm', gbm)], voting='soft')
voting.fit(X_train, y_train)
预测并生成提交文件:
y_predict = voting.predict(X_test)
submission = pd.DataFrame({'PassengerId': passengerId_test, 'Survived': y_predict.astype(np.int32)})
submission.to_csv('submission.csv', index=False)
Kaggle比赛(一)Titanic: Machine Learning from Disaster的更多相关文章
- 机器学习案例学习【每周一例】之 Titanic: Machine Learning from Disaster
下面一文章就总结几点关键: 1.要学会观察,尤其是输入数据的特征提取时,看各输入数据和输出的关系,用绘图看! 2.训练后,看测试数据和训练数据误差,确定是否过拟合还是欠拟合: 3.欠拟合的话,说明模 ...
- Kaggle项目实战一:Titanic: Machine Learning from Disaster
项目地址 https://www.kaggle.com/c/titanic 项目介绍: 除了乘客的编号以外,还包括下表中10个字段,构成了数据的所有特征 Variable Definition Key ...
- Kaggle:Titanic: Machine Learning from Disaster
一直想着抓取股票的变化,偶然的机会在看股票数据抓取的博客看到了kaggle,然后看了看里面的题,感觉挺新颖的,就试了试. 题目如图:给了一个train.csv,现在预测test.csv里面的Passa ...
- 我的第一个 Kaggle 比赛学习 - Titanic
背景 Titanic: Machine Learning from Disaster - Kaggle 2 年前就被推荐照着这个比赛做一下,结果我打开这个页面便蒙了,完全不知道该如何下手. 两年后,再 ...
- kaggle _Titanic: Machine Learning from Disaster
A Data Science Framework: To Achieve 99% Accuracy https://www.kaggle.com/ldfreeman3/a-data-science-f ...
- How do I learn machine learning?
https://www.quora.com/How-do-I-learn-machine-learning-1?redirected_qid=6578644 How Can I Learn X? ...
- [Machine Learning with Python] My First Data Preprocessing Pipeline with Titanic Dataset
The Dataset was acquired from https://www.kaggle.com/c/titanic For data preprocessing, I firstly def ...
- 【机器学习Machine Learning】资料大全
昨天总结了深度学习的资料,今天把机器学习的资料也总结一下(友情提示:有些网站需要"科学上网"^_^) 推荐几本好书: 1.Pattern Recognition and Machi ...
- Python (1) - 7 Steps to Mastering Machine Learning With Python
Step 1: Basic Python Skills install Anacondaincluding numpy, scikit-learn, and matplotlib Step 2: Fo ...
随机推荐
- CDQZ 集训大总结
好爆炸的一次集训…… 成绩: 什么鬼, 烂到一定地步了. 在这里每天考试80%都是暴力,正解思维难度的确比之前大了很多,考的范围也扩大了,比起之前的单独考一个知识点,转变为了多知识点多思维的综合,见了 ...
- GitHub & Git 的学习之始
唉,简单地说,感受只有四个字:蓝瘦香菇. 我的GitHub地址为: https://github.com/LinJingYun (这个,,我不知道具体从哪里找到自己地址啊) 接下来说一下我对git和 ...
- How to Read a Paper丨如何阅读一篇论文
这是我在看论文时无意刷到的博客推荐的一篇文章"How to Read a Paper",教你怎么样看论文.对于研究生来说,看论文基本是日常,一篇论文十多二十页,如何高效地读论文确实 ...
- py+selenium运行时报错Can not connect to the Service IEDriverServer.exe
问题: 运行用例时,出现报错(host文件已加入127.0.0.1 localhost): raise WebDriverException("Can not connect to the ...
- .net学习笔记之访问数据库
.net中访问数据库的两中方法 第一种是通过SqlHelper帮助类来访问数据库, 使用的是ADO.net技术. using System.Data; using System.Data.SqlCli ...
- 个人永久性免费-Excel催化剂功能第19波-Excel与Sqlserver零门槛交互-查询篇
对频繁使用Excel的高级应用的尝试用户来说,绕不过的一个问题Excel的性能问题,对于几万条数据还说得过去,上了10万行的数据量,随便一个函数公式的运算都是一个不小的负荷,有些上进一点的用户会往Ac ...
- centos7 安装NVM 管理node
[转载] 转载自https://blog.csdn.net/shuizhaoshui/article/details/79325931 NVM git地址: https://github.com/cr ...
- 模拟ssh远程执行命令,粘包问题,基于socketserver实现并发的socket
06.27自我总结 1.模拟ssh远程执行命令 利用套接字编来进行远程执行命令 服务端 from socket import * import subprocess server = socket(A ...
- angularjs compine和link的区别
[译]ng指令中的compile与link函数解析 04 September 2014 通常大家在使用ng中的指令的时候,用的链接函数最多的是link属性,下面这篇文章将告诉大家complie,pre ...
- windows和linux下如何对拍
对拍是各种计算机考试检查时必备工具,实际上十分强大,只要你的暴力没有写错就没有问题. 对拍的意思:(怎么有点语文课的意思雾) 对:看见'对'就可以知道有两个. 拍:就是把两个程序结果拍在一起,对照(有 ...