1: Improving Our Features

In the last mission, we made our first submission to Titanic: Machine Learning from Disaster, a machine learning competition on Kaggle.

Our submission wasn't very high-scoring, though. There are three main ways we can improve it:

  • Use a better machine learning algorithm.
  • Generate better features.
  • Combine multiple machine learning algorithms.

In this mission, we'll do all three. First, we'll find a different algorithm to use than logistic regression -- random forests.

2: Random Forest Introduction

As we alluded to in the previous mission, decision trees can pick up nonlinear tendencies in the data. Here's an example:

Age    Sex    Survived
5 0 1
30 1 0
70 0 1
20 0 1

As you can see, there isn't a linear correlation between Age and Survived -- someone who was 30 didn't survive, but people who were 70 and 20 did survive.

We can instead make a decision tree to model the relationship between Age, Sex, and Survived. You've probably seen decision trees or flowcharts before, and the decision trees algorithm isn't any different conceptually. We start with all of our data rows at the root of the tree, then we make splits until we can accurately classify the rows in the leaves. Here's an example:

In the above diagram, we take our initial data, and:

  • Make an initial split. Any row where Age is over 29 goes to the right, and any row where Age is less than 29 goes to the left.
  • The left group all survived, so we make it a leaf node, and assign the Survived outcome 1.
  • The right group didn't all have the same outcome, so we split again, based on the Sex column.
  • We end up with two leaf nodes on the right side, one where everyone survived, and one where everyone didnt.

We could use this decision tree to figure out the survival outcome of a new row:

Age    Sex    Survived
40 0 ?

Based on our tree, we would first split to the right, then split to the left. We would predict that the person in the above row survived (1).

Decision trees have a major flaw, and that is that they overfit to the training data. Because we build up a very "deep" decision tree in terms of splits, we end up with a lot of rules that are specific to the quirks of the training data, and not generalizable to new data sets.

This is where the random forest algorithm can help. With random forests, we build hundreds of trees with slightly randomized input data, and slightly randomized split points. Each tree in a random forest gets a random subset of the overall training data. Each split point in each tree is performed on a random subset of the potential columns to split on. By averaging the predictions of all the trees, we get a stronger overall prediction and minimize overfitting.

3: Implementing A Random Forest

Thankfully for us, sklearn has a nice random forest implementation already. We can use it to construct a random forest and generate cross validated predictions on our dataset.

Instructions

Make cross validated predictions for the titanic dataframe (which has already been loaded in) using 3 folds.

  • Use the random forest algorithm stored in alg to do the cross validation.
  • Use predictors to predict the Survived column. Assign the result to scores.
  • You can use the cross_validation.cross_val_score function to do this.
    • You'll need to initialize an instance of KFold like we did in the last mission, and pass it into the cv keyword argument of the cross_val_score function.

After making cross validated predictions, print out the mean of scores.

Hint

Make predictions with:

kf = cross_validation.KFold(titanic.shape[0], n_folds=3, random_state=1)
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf)

执行代码:

from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"] # Initialize our algorithm with the default paramters
# n_estimators is the number of trees we want to make
# min_samples_split is the minimum number of rows we need to make a split
# min_samples_leaf is the minimum number of samples we can have at the place where a tree branch ends (the bottom points of the tree)
alg = RandomForestClassifier(random_state=1, n_estimators=10, min_samples_split=2, min_samples_leaf=1)
# Compute the accuracy score for all the cross validation folds. (much simpler than what we did before!)
kf = cross_validation.KFold(titanic.shape[0], n_folds=3, random_state=1)
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf) # Take the mean of the scores (because we have one for each fold)
print(scores.mean())

4: Parameter Tuning

The first, and easiest, thing we can do to improve the accuracy of the random forest is to increase the number of trees we're using. Training more trees will take more time, but because of the fact that we're averaging many predictions made on different subsets of the data, having more trees will increase accuracy greatly (up to a point).

We can also tweak the min_samples_split and min_samples_leaf variables to reduce overfitting. Because of how a decision tree works, having splits that go all the way down, or overly deep in the tree can result in fitting to quirks in the dataset, and not true signal. Thus, increasing min_samples_split and min_samples_leaf can reduce overfitting, which will actually improve our score, as we're making predictions on unseen data. A model that is less overfit, and that can generalize better, will actually perform better on unseen data, but worse on seen data.

Instructions

We've changed the parameters used when we initialize alg. We'll need to re-run our model now:

  • Make cross validated predictions for the titanic dataframe using 3 folds.
  • Use predictors to predict the Survived column and assign the result to scores.
  • After making cross validated predictions, print out the mean of scores.

Hint

The parameters given in alg are different.

alg = RandomForestClassifier(random_state=1, n_estimators=50, min_samples_split=4, min_samples_leaf=2)
# Compute the accuracy score for all the cross validation folds. (much simpler than what we did before!)
kf = cross_validation.KFold(titanic.shape[0], 3, random_state=1)
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf) # Take the mean of the scores (because we have one for each fold)
print(scores.mean())

5: Generating New Features

We can also generate new features. Here are some ideas:

  • The length of the name -- this could pertain to how rich the person was, and therefore their position in the Titanic.
  • The total number of people in a family (SibSp + Parch).

An easy way to generate features is to use the .apply method on pandas dataframes. This applies a function you pass in to each element in a dataframe or series. We can pass in a lambda function, which enables us to define a function inline.

To write a lambda function, you write lambda x: len(x). x will take on the value of the input that is passed in -- in this case, the passenger name. The function to the right of the colon is then applied to x, and the result returned. The .apply method takes all of these outputs and constructs a pandas series from them. We can assign this series to a dataframe column.

Instructions

This step is a demo. Play around with code or advance to the next step.

# Generating a familysize column
titanic["FamilySize"] = titanic["SibSp"] + titanic["Parch"] # The .apply method generates a new series
titanic["NameLength"] = titanic["Name"].apply(lambda x: len(x))

6: Using The Title

We can extract the title of the passenger from their name. The titles take the form of Master., Mr., Mrs.. There are a few very commonly used titles, and a "long tail" of one-off titles that only one or two passengers have.

We'll first extract the titles with a regular expression, and then map each unique title to an integer value.

We'll then have a numeric column that corresponds to the appropriate Title.

Instructions

This step is a demo. Play around with code or advance to the next step.

import re

# A function to get the title from a name.
def get_title(name):
# Use a regular expression to search for a title. Titles always consist of capital and lowercase letters, and end with a period.
title_search = re.search(' ([A-Za-z]+)\.', name)
# If the title exists, extract and return it.
if title_search:
return title_search.group(1)
return "" # Get all the titles and print how often each one occurs.
titles = titanic["Name"].apply(get_title)
print(pandas.value_counts(titles)) # Map each title to an integer. Some titles are very rare, and are compressed into the same codes as other titles.
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 7, "Mlle": 8, "Mme": 8, "Don": 9, "Lady": 10, "Countess": 10, "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2}
for k,v in title_mapping.items():
titles[titles == k] = v # Verify that we converted everything.
print(pandas.value_counts(titles)) # Add in the title column.
titanic["Title"] = titles

7: Family Groups

We can also generate a feature indicating which family people are in. Because survival was likely highly dependent on your family and the people around you, this has a good chance at being a good feature.

To get this, we'll concatenate someone's last name with FamilySize to get a unique family id. We'll then be able to assign a code to each person based on their family id.

Instructions

This step is a demo. Play around with code or advance to the next step.

import operator

# A dictionary mapping family name to id
family_id_mapping = {} # A function to get the id given a row
def get_family_id(row):
# Find the last name by splitting on a comma
last_name = row["Name"].split(",")[0]
# Create the family id
family_id = "{0}{1}".format(last_name, row["FamilySize"])
# Look up the id in the mapping
if family_id not in family_id_mapping:
if len(family_id_mapping) == 0:
current_id = 1
else:
# Get the maximum id from the mapping and add one to it if we don't have an id
current_id = (max(family_id_mapping.items(), key=operator.itemgetter(1))[1] + 1)
family_id_mapping[family_id] = current_id
return family_id_mapping[family_id] # Get the family ids with the apply method
family_ids = titanic.apply(get_family_id, axis=1) # There are a lot of family ids, so we'll compress all of the families under 3 members into one code.
family_ids[titanic["FamilySize"] < 3] = -1 # Print the count of each unique id.
print(pandas.value_counts(family_ids)) titanic["FamilyId"] = family_ids

8: Finding The Best Features

Feature engineering is the most important part of any machine learning task, and there are lots more features we could calculate. But we also need a way to figure out which features are the best.

One way to do this is to use univariate feature selection. This essentially goes column by column, and figures out which columns correlate most closely with what we're trying to predict (Survived).

As usual, sklearn has a function that will help us with feature selection, SelectKBest. This selects the best features from the data, and allows us to specify how many it selects.

Instructions

We've updated predictors. Make cross validated predictions for the titanic dataframe using 3 folds.

  • Use predictors to predict the Survived column and assign the result to scores.

After making cross validated predictions, print out the mean of scores.

Hint

scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)

will make predictions.

import numpy as np
from sklearn.feature_selection import SelectKBest, f_classif predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked", "FamilySize", "Title", "FamilyId", "NameLength"] # Perform feature selection
selector = SelectKBest(f_classif, k=5)
selector.fit(titanic[predictors], titanic["Survived"]) # Get the raw p-values for each feature, and transform from p-values into scores
scores = -np.log10(selector.pvalues_) # Plot the scores. See how "Pclass", "Sex", "Title", and "Fare" are the best?
plt.bar(range(len(predictors)), scores)
plt.xticks(range(len(predictors)), predictors, rotation='vertical')
plt.show() # Pick only the four best features.
predictors = ["Pclass", "Sex", "Fare", "Title"] alg = RandomForestClassifier(random_state=1, n_estimators=50, min_samples_split=8, min_samples_leaf=4)
# Compute the accuracy score for all the cross validation folds. (much simpler than what we did before!)
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3) # Take the mean of the scores (because we have one for each fold)
print(scores.mean())

9: Gradient Boosting

Another method that builds on decision trees is a gradient boosting classifier. Boosting involves training decision trees one after another, and feeding the errors from one tree into the next tree. So each tree is building on all the other trees that came before it. This can lead to overfitting if we build too many trees, though. As you get above 100 trees or so, it's very easy to overfit and train to quirks in the dataset. As our dataset is extremely small, we'll limit the tree count to just 25.

Another way to limit overfitting is to limit the depth to which each tree in the gradient boosting process can be built. We'll limit the tree depth to 3 to avoid overfitting.

We'll try boosting instead of our random forest approach and see if we can improve our accuracy.

10: Ensembling

One thing we can do to improve the accuracy of our predictions is to ensemble different classifiers. Ensembling means that we generate predictions using information from a set of classifiers, instead of just one. In practice, this means that we average their predictions.

Generally, the more diverse the models we ensemble, the higher our accuracy will be. Diversity means that the models generate their results from different columns, or use a very different method to generate predictions. Ensembling a random forest classifier with a decision tree probably won't work extremely well, because they are very similar. On the other hand, ensembling a linear regression with a random forest can work very well.

One caveat with ensembling is that the classifiers we use have to be about the same in terms of accuracy. Ensembling one classifier that is much worse than another probably will make the final result worse.

In this case, we'll ensemble logistic regression trained on the most linear predictors (the ones that have a linear ordering, and some correlation to Survived), and a gradient boosted tree trained on all of the predictors.

We'll keep things simple when we ensemble -- we'll average the raw probabilities (from 0 to 1) that we get from our classifiers, and then assume that anything above .5 maps to one, and anything below or equal to .5 maps to 0.

Instructions

This step is a demo. Play around with code or advance to the next step.

from sklearn.ensemble import GradientBoostingClassifier
import numpy as np # The algorithms we want to ensemble.
# We're using the more linear predictors for the logistic regression, and everything with the gradient boosting classifier.
algorithms = [
[GradientBoostingClassifier(random_state=1, n_estimators=25, max_depth=3), ["Pclass", "Sex", "Age", "Fare", "Embarked", "FamilySize", "Title", "FamilyId"]],
[LogisticRegression(random_state=1), ["Pclass", "Sex", "Fare", "FamilySize", "Title", "Age", "Embarked"]]
] # Initialize the cross validation folds
kf = KFold(titanic.shape[0], n_folds=3, random_state=1) predictions = []
for train, test in kf:
train_target = titanic["Survived"].iloc[train]
full_test_predictions = []
# Make predictions for each algorithm on each fold
for alg, predictors in algorithms:
# Fit the algorithm on the training data.
alg.fit(titanic[predictors].iloc[train,:], train_target)
# Select and predict on the test fold.
# The .astype(float) is necessary to convert the dataframe to all floats and avoid an sklearn error.
test_predictions = alg.predict_proba(titanic[predictors].iloc[test,:].astype(float))[:,1]
full_test_predictions.append(test_predictions)
# Use a simple ensembling scheme -- just average the predictions to get the final classification.
test_predictions = (full_test_predictions[0] + full_test_predictions[1]) / 2
# Any value over .5 is assumed to be a 1 prediction, and below .5 is a 0 prediction.
test_predictions[test_predictions <= .5] = 0
test_predictions[test_predictions > .5] = 1
predictions.append(test_predictions) # Put all the predictions together into one array.
predictions = np.concatenate(predictions, axis=0) # Compute accuracy by comparing to the training data.
accuracy = sum(predictions[predictions == titanic["Survived"]]) / len(predictions)
print(accuracy)

11: Matching Our Changes On The Test Set

There are a lot of things we could do to make this analysis better that we'll talk about at the end, but for now, let's make a submission.

The first step is matching all our training set changes on the test set data, like we did in the last mission. We've read the test set into titanic_test. We'll have to match our changes:

  • Generate the NameLength column, which is how long the name is.
  • Generate the FamilySize column, showing how large a family is.
  • Add in the Title column, keeping the same mapping that we had before.
  • Add in a FamilyId column, keeping the ids consistent across the train and test sets.

    Instructions
  • Add the NameLength column to titanic_test.
    • Do this the same way we did it with the titanic dataframe.

Hint

You should be able to use our code from an earlier screen, with minor modifications.

# First, we'll add titles to the test set.
titles = titanic_test["Name"].apply(get_title)
# We're adding the Dona title to the mapping, because it's in the test set, but not the training set
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 7, "Mlle": 8, "Mme": 8, "Don": 9, "Lady": 10, "Countess": 10, "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2, "Dona": 10}
for k,v in title_mapping.items():
titles[titles == k] = v
titanic_test["Title"] = titles
# Check the counts of each unique title.
print(pandas.value_counts(titanic_test["Title"])) # Now, we add the family size column.
titanic_test["FamilySize"] = titanic_test["SibSp"] + titanic_test["Parch"] # Now we can add family ids.
# We'll use the same ids that we did earlier.
print(family_id_mapping) family_ids = titanic_test.apply(get_family_id, axis=1)
family_ids[titanic_test["FamilySize"] < 3] = -1
titanic_test["FamilyId"] = family_ids

12: Predicting On The Test Set

We have some better predictions now, so let's create another submission.

Instructions

  • Turn the predictions into either 0 or 1 by turning the predictions less than or equal to .5 into 0, and the predictions greater than .5 into 1.
  • Then, convert the predictions to integers using the .astype(int) method -- if you don't, Kaggle will give you a score of 0.
  • Finally, create a submission dataframe where the first column is PassengerId, and the second column is Survived (this will be the predictions).

Hint

Generate the submission dataframe with:

submission = pandas.DataFrame({
"PassengerId": titanic_test["PassengerId"],
"Survived": predictions
})

执行代码:

predictors = ["Pclass", "Sex", "Age", "Fare", "Embarked", "FamilySize", "Title", "FamilyId"]

algorithms = [
[GradientBoostingClassifier(random_state=1, n_estimators=25, max_depth=3), predictors],
[LogisticRegression(random_state=1), ["Pclass", "Sex", "Fare", "FamilySize", "Title", "Age", "Embarked"]]
] full_predictions = []
for alg, predictors in algorithms:
# Fit the algorithm using the full training data.
alg.fit(titanic[predictors], titanic["Survived"])
# Predict using the test dataset. We have to convert all the columns to floats to avoid an error.
predictions = alg.predict_proba(titanic_test[predictors].astype(float))[:,1]
full_predictions.append(predictions) # The gradient boosting classifier generates better predictions, so we weight it higher.
predictions = (full_predictions[0] * 3 + full_predictions[1]) / 4
predictions[predictions <= .5] = 0
predictions[predictions > .5] = 1
predictions = predictions.astype(int)
submission = pandas.DataFrame({
"PassengerId": titanic_test["PassengerId"],
"Survived": predictions
})

13: Final Thoughts

Now, we have a submission! It should get you a score of .799 on the leaderboard. You can generate a submission file with submission.to_csv("kaggle.csv", index=False).

There's still more work you can do in feature engineering:

  • Try using features related to the cabins.
  • See if any family size features might help -- do the number of women in a family make the whole family more likely to survive?
  • Does the national origin of the passenger's name have anything to do with survival?

    There's also a lot more we can do on the algorithm side:
  • Try the random forest classifier in the ensemble.
  • A support vector machine might work well with this data.
  • We could try neural networks.
  • Boosting with a different base classifier might work better.

And with ensembling methods:

  • Could majority voting be a better ensembling method than averaging probabilities?

This dataset is very easy to overfit on because there isn't a lot of data, so you'll be grinding for small accuracy gains. You could also try a different Kaggle competition with more data and richer features to sink into.

Hope you enjoyed this tutorial, and good luck with the machine learning competitions!

Improving your submission -- Kaggle Competitions的更多相关文章

  1. Getting started with Kaggle -- Kaggle Competitions

    1: The Competition We'll be learning how to generate a submission for a Kaggle competition. Kaggle i ...

  2. Kaggle Challenge简要介绍

    https://en.wikipedia.org/wiki/Kaggle 以下内容,直接摘自维基百科,主要起到一个记录的作用,提醒自己有时间关注关注这个竞赛. Kaggle is a platform ...

  3. linux服务器上配置进行kaggle比赛的深度学习tensorflow keras环境详细教程

    本文首发于个人博客https://kezunlin.me/post/6b505d27/,欢迎阅读最新内容! full guide tutorial to install and configure d ...

  4. kaggle——NFL Big Data Bowl 2020 Official Starter Notebook

    Introduction In this competition you will predict how many yards a team will gain on a rushing play ...

  5. 通过kaggle api下载数据集

    Kaggle API使用教程 https://www.kaggle.com 的官方 API ,可使用 Python 3 中实现的命令行工具访问. Beta 版 - Kaggle 保留修改当前提供的 A ...

  6. 如何使用Python在Kaggle竞赛中成为Top15

    如何使用Python在Kaggle竞赛中成为Top15 Kaggle比赛是一个学习数据科学和投资时间的非常的方式,我自己通过Kaggle学习到了很多数据科学的概念和思想,在我学习编程之后的几个月就开始 ...

  7. Kaggle Bike Sharing Demand Prediction – How I got in top 5 percentile of participants?

    Kaggle Bike Sharing Demand Prediction – How I got in top 5 percentile of participants? Introduction ...

  8. Kaggle竞赛顶尖选手经验汇总

    What is your first plan of action when working on a new competition? 理解竞赛,数据,评价标准. 建立交叉验证集. 制定.更新计划. ...

  9. 如何在google colab加载kaggle数据

    参考https://medium.com/@yvettewu.dw/tutorial-kaggle-api-google-colaboratory-1a054a382de0 从本地上传到colab上十 ...

随机推荐

  1. CentOS6.5安装Scrapy

    1.安装命令超级简单: [root@mycentos ~]# pip install Scrapy 建立软链接: [root@mycentos ~]# ln -s /usr/local/python3 ...

  2. response.sendRedirect(url)与request.getRequestDispatcher(url).forward(request,response)的区别

    response.sendRedirect(url)跳转到指定的URL地址,产生一个新的request,所以要传递参数只有在url后加参数,如: url?id=1.request.getRequest ...

  3. Unity shader学习之反射

    shader如下: Shader "Custom/Reflection" { Properties { _Cubemap("Cubemap", Cube) = ...

  4. python中安装并使用redis

    数据缓存系统:1:mongodb:是直接持久化,直接存储于硬盘的缓存系统2:redis: 半持久化,存储于内存和硬盘3:memcache:数据只能存储在内存里的缓存系统 redis是一个key-val ...

  5. 软工网络15团队作业4——Alpha阶段敏捷冲刺7.0

    1.每天举行站立式会议,提供当天站立式会议照片一张. 2.项目每个成员的昨天进展.存在问题.今天安排. 成员 昨天已完成 今天计划完成 郭炜埕 实现前端各界面的跳转连接 学习后端相关知识 郑晓丽 完善 ...

  6. python中的lxml模块

    Python中自带了XML的模块,但是性能不太好,相比之下,LXML增加了很多实用的功能. lxml中主要有两部分, 1) etree,主要可以用来解析XML字符串, 内部有两个对象,etree._E ...

  7. GetWindowRect和GetClientRect的注意事项

    发现GetClientRect()函数取值不正确,特此找来了些资料以供参考,具体如下,就可以明白怎么回事了. 一:关于坐标 MFC中绘图时经常涉及到坐标计算,GetWindowRect和GetClie ...

  8. Ubuntu10.04 python2.6下安装matplotlib环境

    一.准备工作1.sudo apt-get install python-numpy2.sudo apt-get install python2.6-dev3.sudo apt-get install ...

  9. Rigid Frameworks (画图二分图规律 + DP + 数学组合容斥)

    题意:方格n*m,然后对于每一个格子有3种画法1左对角线2右对角线3不画,求让图形稳定的画法有多少种? 思路:通过手画二分图可以发现当二分图联通时改图满足条件,然后我们对于一个dp[n][m]可以利用 ...

  10. codeforces 980E The Number Games

    题意: 给出一棵树,要求去掉k个点,使得剩下的还是一棵树,并且要求Σ(2^i)最大,i是剩下的节点的编号. 思路: 要使得剩下的点的2的幂的和最大,那么肯定要保住大的点,这是贪心. 考虑去掉哪些点的话 ...