1: Improving Our Features

In the last mission, we made our first submission to Titanic: Machine Learning from Disaster, a machine learning competition on Kaggle.

Our submission wasn't very high-scoring, though. There are three main ways we can improve it:

Use a better machine learning algorithm.
Generate better features.
Combine multiple machine learning algorithms.

In this mission, we'll do all three. First, we'll find a different algorithm to use than logistic regression -- random forests.

2: Random Forest Introduction

As we alluded to in the previous mission, decision trees can pick up nonlinear tendencies in the data. Here's an example:

Age    Sex    Survived

5      0      1

30     1      0

70     0      1

20     0      1

As you can see, there isn't a linear correlation between Age and Survived -- someone who was 30 didn't survive, but people who were 70 and 20 did survive.

We can instead make a decision tree to model the relationship between Age, Sex, and Survived. You've probably seen decision trees or flowcharts before, and the decision trees algorithm isn't any different conceptually. We start with all of our data rows at the root of the tree, then we make splits until we can accurately classify the rows in the leaves. Here's an example:

In the above diagram, we take our initial data, and:

Make an initial split. Any row where Age is over 29 goes to the right, and any row where Age is less than 29 goes to the left.
The left group all survived, so we make it a leaf node, and assign the Survived outcome 1.
The right group didn't all have the same outcome, so we split again, based on the Sex column.
We end up with two leaf nodes on the right side, one where everyone survived, and one where everyone didnt.

We could use this decision tree to figure out the survival outcome of a new row:

Age    Sex    Survived

40     0      ?

Based on our tree, we would first split to the right, then split to the left. We would predict that the person in the above row survived (1).

Decision trees have a major flaw, and that is that they overfit to the training data. Because we build up a very "deep" decision tree in terms of splits, we end up with a lot of rules that are specific to the quirks of the training data, and not generalizable to new data sets.

This is where the random forest algorithm can help. With random forests, we build hundreds of trees with slightly randomized input data, and slightly randomized split points. Each tree in a random forest gets a random subset of the overall training data. Each split point in each tree is performed on a random subset of the potential columns to split on. By averaging the predictions of all the trees, we get a stronger overall prediction and minimize overfitting.

3: Implementing A Random Forest

Thankfully for us, sklearn has a nice random forest implementation already. We can use it to construct a random forest and generate cross validated predictions on our dataset.

Instructions

Make cross validated predictions for the titanic dataframe (which has already been loaded in) using 3 folds.

Use the random forest algorithm stored in alg to do the cross validation.
Use predictors to predict the Survived column. Assign the result to scores.
You can use the cross_validation.cross_val_score function to do this.
- You'll need to initialize an instance of KFold like we did in the last mission, and pass it into the cv keyword argument of the cross_val_score function.

After making cross validated predictions, print out the mean of scores.

Hint

Make predictions with:

kf = cross_validation.KFold(titanic.shape[0], n_folds=3, random_state=1)

scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf)

执行代码：

from sklearn import cross_validation

from sklearn.ensemble import RandomForestClassifier

predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]

# Initialize our algorithm with the default paramters

# n_estimators is the number of trees we want to make

# min_samples_split is the minimum number of rows we need to make a split

# min_samples_leaf is the minimum number of samples we can have at the place where a tree branch ends (the bottom points of the tree)

alg = RandomForestClassifier(random_state=1, n_estimators=10, min_samples_split=2, min_samples_leaf=1)

# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)

kf = cross_validation.KFold(titanic.shape[0], n_folds=3, random_state=1)

scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf)

# Take the mean of the scores (because we have one for each fold)

print(scores.mean())

4: Parameter Tuning

The first, and easiest, thing we can do to improve the accuracy of the random forest is to increase the number of trees we're using. Training more trees will take more time, but because of the fact that we're averaging many predictions made on different subsets of the data, having more trees will increase accuracy greatly (up to a point).

We can also tweak the min_samples_split and min_samples_leaf variables to reduce overfitting. Because of how a decision tree works, having splits that go all the way down, or overly deep in the tree can result in fitting to quirks in the dataset, and not true signal. Thus, increasing min_samples_split and min_samples_leaf can reduce overfitting, which will actually improve our score, as we're making predictions on unseen data. A model that is less overfit, and that can generalize better, will actually perform better on unseen data, but worse on seen data.

Instructions

We've changed the parameters used when we initialize alg. We'll need to re-run our model now:

Make cross validated predictions for the titanic dataframe using 3 folds.
Use predictors to predict the Survived column and assign the result to scores.
After making cross validated predictions, print out the mean of scores.

Hint

The parameters given in alg are different.

alg = RandomForestClassifier(random_state=1, n_estimators=50, min_samples_split=4, min_samples_leaf=2)

# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)

kf = cross_validation.KFold(titanic.shape[0], 3, random_state=1)

scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf)

# Take the mean of the scores (because we have one for each fold)

print(scores.mean())

5: Generating New Features

We can also generate new features. Here are some ideas:

The length of the name -- this could pertain to how rich the person was, and therefore their position in the Titanic.
The total number of people in a family (SibSp + Parch).

An easy way to generate features is to use the .apply method on pandas dataframes. This applies a function you pass in to each element in a dataframe or series. We can pass in a lambda function, which enables us to define a function inline.

To write a lambda function, you write lambda x: len(x). x will take on the value of the input that is passed in -- in this case, the passenger name. The function to the right of the colon is then applied to x, and the result returned. The .apply method takes all of these outputs and constructs a pandas series from them. We can assign this series to a dataframe column.

Instructions

This step is a demo. Play around with code or advance to the next step.

# Generating a familysize column

titanic["FamilySize"] = titanic["SibSp"] + titanic["Parch"]

# The .apply method generates a new series

titanic["NameLength"] = titanic["Name"].apply(lambda x: len(x))

6: Using The Title

We can extract the title of the passenger from their name. The titles take the form of Master., Mr., Mrs.. There are a few very commonly used titles, and a "long tail" of one-off titles that only one or two passengers have.

We'll first extract the titles with a regular expression, and then map each unique title to an integer value.

We'll then have a numeric column that corresponds to the appropriate Title.

Instructions

This step is a demo. Play around with code or advance to the next step.

import re

# A function to get the title from a name.

def get_title(name):

    # Use a regular expression to search for a title.  Titles always consist of capital and lowercase letters, and end with a period.

    title_search = re.search(' ([A-Za-z]+)\.', name)

    # If the title exists, extract and return it.

    if title_search:

        return title_search.group(1)

    return ""

# Get all the titles and print how often each one occurs.

titles = titanic["Name"].apply(get_title)

print(pandas.value_counts(titles))

# Map each title to an integer.  Some titles are very rare, and are compressed into the same codes as other titles.

title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 7, "Mlle": 8, "Mme": 8, "Don": 9, "Lady": 10, "Countess": 10, "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2}

for k,v in title_mapping.items():

    titles[titles == k] = v

# Verify that we converted everything.

print(pandas.value_counts(titles))

# Add in the title column.

titanic["Title"] = titles

7: Family Groups

We can also generate a feature indicating which family people are in. Because survival was likely highly dependent on your family and the people around you, this has a good chance at being a good feature.

To get this, we'll concatenate someone's last name with FamilySize to get a unique family id. We'll then be able to assign a code to each person based on their family id.

Instructions

This step is a demo. Play around with code or advance to the next step.

import operator

# A dictionary mapping family name to id

family_id_mapping = {}

# A function to get the id given a row

def get_family_id(row):

    # Find the last name by splitting on a comma

    last_name = row["Name"].split(",")[0]

    # Create the family id

    family_id = "{0}{1}".format(last_name, row["FamilySize"])

    # Look up the id in the mapping

    if family_id not in family_id_mapping:

        if len(family_id_mapping) == 0:

            current_id = 1

        else:

            # Get the maximum id from the mapping and add one to it if we don't have an id

            current_id = (max(family_id_mapping.items(), key=operator.itemgetter(1))[1] + 1)

        family_id_mapping[family_id] = current_id

    return family_id_mapping[family_id]

# Get the family ids with the apply method

family_ids = titanic.apply(get_family_id, axis=1)

# There are a lot of family ids, so we'll compress all of the families under 3 members into one code.

family_ids[titanic["FamilySize"] < 3] = -1

# Print the count of each unique id.

print(pandas.value_counts(family_ids))

titanic["FamilyId"] = family_ids

8: Finding The Best Features

Feature engineering is the most important part of any machine learning task, and there are lots more features we could calculate. But we also need a way to figure out which features are the best.

One way to do this is to use univariate feature selection. This essentially goes column by column, and figures out which columns correlate most closely with what we're trying to predict (Survived).

As usual, sklearn has a function that will help us with feature selection, SelectKBest. This selects the best features from the data, and allows us to specify how many it selects.

Instructions

We've updated predictors. Make cross validated predictions for the titanic dataframe using 3 folds.

Use predictors to predict the Survived column and assign the result to scores.

After making cross validated predictions, print out the mean of scores.

Hint

scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)

will make predictions.

import numpy as np

from sklearn.feature_selection import SelectKBest, f_classif

predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked", "FamilySize", "Title", "FamilyId", "NameLength"]

# Perform feature selection

selector = SelectKBest(f_classif, k=5)

selector.fit(titanic[predictors], titanic["Survived"])

# Get the raw p-values for each feature, and transform from p-values into scores

scores = -np.log10(selector.pvalues_)

# Plot the scores.  See how "Pclass", "Sex", "Title", and "Fare" are the best?

plt.bar(range(len(predictors)), scores)

plt.xticks(range(len(predictors)), predictors, rotation='vertical')

plt.show()

# Pick only the four best features.

predictors = ["Pclass", "Sex", "Fare", "Title"]

alg = RandomForestClassifier(random_state=1, n_estimators=50, min_samples_split=8, min_samples_leaf=4)

# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)

scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)

# Take the mean of the scores (because we have one for each fold)

print(scores.mean())

9: Gradient Boosting

Another method that builds on decision trees is a gradient boosting classifier. Boosting involves training decision trees one after another, and feeding the errors from one tree into the next tree. So each tree is building on all the other trees that came before it. This can lead to overfitting if we build too many trees, though. As you get above 100 trees or so, it's very easy to overfit and train to quirks in the dataset. As our dataset is extremely small, we'll limit the tree count to just 25.

Another way to limit overfitting is to limit the depth to which each tree in the gradient boosting process can be built. We'll limit the tree depth to 3 to avoid overfitting.

We'll try boosting instead of our random forest approach and see if we can improve our accuracy.

10: Ensembling

One thing we can do to improve the accuracy of our predictions is to ensemble different classifiers. Ensembling means that we generate predictions using information from a set of classifiers, instead of just one. In practice, this means that we average their predictions.

Generally, the more diverse the models we ensemble, the higher our accuracy will be. Diversity means that the models generate their results from different columns, or use a very different method to generate predictions. Ensembling a random forest classifier with a decision tree probably won't work extremely well, because they are very similar. On the other hand, ensembling a linear regression with a random forest can work very well.

One caveat with ensembling is that the classifiers we use have to be about the same in terms of accuracy. Ensembling one classifier that is much worse than another probably will make the final result worse.

In this case, we'll ensemble logistic regression trained on the most linear predictors (the ones that have a linear ordering, and some correlation to Survived), and a gradient boosted tree trained on all of the predictors.

We'll keep things simple when we ensemble -- we'll average the raw probabilities (from 0 to 1) that we get from our classifiers, and then assume that anything above .5 maps to one, and anything below or equal to .5 maps to 0.

Instructions

This step is a demo. Play around with code or advance to the next step.

from sklearn.ensemble import GradientBoostingClassifier

import numpy as np

# The algorithms we want to ensemble.

# We're using the more linear predictors for the logistic regression, and everything with the gradient boosting classifier.

algorithms = [

    [GradientBoostingClassifier(random_state=1, n_estimators=25, max_depth=3), ["Pclass", "Sex", "Age", "Fare", "Embarked", "FamilySize", "Title", "FamilyId"]],

    [LogisticRegression(random_state=1), ["Pclass", "Sex", "Fare", "FamilySize", "Title", "Age", "Embarked"]]

]

# Initialize the cross validation folds

kf = KFold(titanic.shape[0], n_folds=3, random_state=1)

predictions = []

for train, test in kf:

    train_target = titanic["Survived"].iloc[train]

    full_test_predictions = []

    # Make predictions for each algorithm on each fold

    for alg, predictors in algorithms:

        # Fit the algorithm on the training data.

        alg.fit(titanic[predictors].iloc[train,:], train_target)

        # Select and predict on the test fold.

        # The .astype(float) is necessary to convert the dataframe to all floats and avoid an sklearn error.

        test_predictions = alg.predict_proba(titanic[predictors].iloc[test,:].astype(float))[:,1]

        full_test_predictions.append(test_predictions)

    # Use a simple ensembling scheme -- just average the predictions to get the final classification.

    test_predictions = (full_test_predictions[0] + full_test_predictions[1]) / 2

    # Any value over .5 is assumed to be a 1 prediction, and below .5 is a 0 prediction.

    test_predictions[test_predictions <= .5] = 0

    test_predictions[test_predictions > .5] = 1

    predictions.append(test_predictions)

# Put all the predictions together into one array.

predictions = np.concatenate(predictions, axis=0)

# Compute accuracy by comparing to the training data.

accuracy = sum(predictions[predictions == titanic["Survived"]]) / len(predictions)

print(accuracy)

11: Matching Our Changes On The Test Set

There are a lot of things we could do to make this analysis better that we'll talk about at the end, but for now, let's make a submission.

The first step is matching all our training set changes on the test set data, like we did in the last mission. We've read the test set into titanic_test. We'll have to match our changes:

Generate the NameLength column, which is how long the name is.
Generate the FamilySize column, showing how large a family is.
Add in the Title column, keeping the same mapping that we had before.
Add in a FamilyId column, keeping the ids consistent across the train and test sets.

Instructions
Add the NameLength column to titanic_test.
- Do this the same way we did it with the titanic dataframe.

Hint

You should be able to use our code from an earlier screen, with minor modifications.

# First, we'll add titles to the test set.

titles = titanic_test["Name"].apply(get_title)

# We're adding the Dona title to the mapping, because it's in the test set, but not the training set

title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 7, "Mlle": 8, "Mme": 8, "Don": 9, "Lady": 10, "Countess": 10, "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2, "Dona": 10}

for k,v in title_mapping.items():

    titles[titles == k] = v

titanic_test["Title"] = titles

# Check the counts of each unique title.

print(pandas.value_counts(titanic_test["Title"]))

# Now, we add the family size column.

titanic_test["FamilySize"] = titanic_test["SibSp"] + titanic_test["Parch"]

# Now we can add family ids.

# We'll use the same ids that we did earlier.

print(family_id_mapping)

family_ids = titanic_test.apply(get_family_id, axis=1)

family_ids[titanic_test["FamilySize"] < 3] = -1

titanic_test["FamilyId"] = family_ids

12: Predicting On The Test Set

We have some better predictions now, so let's create another submission.

Instructions

Turn the predictions into either 0 or 1 by turning the predictions less than or equal to .5 into 0, and the predictions greater than .5 into 1.
Then, convert the predictions to integers using the .astype(int) method -- if you don't, Kaggle will give you a score of 0.
Finally, create a submission dataframe where the first column is PassengerId, and the second column is Survived (this will be the predictions).

Hint

Generate the submission dataframe with:

submission = pandas.DataFrame({

        "PassengerId": titanic_test["PassengerId"],

        "Survived": predictions

    })

执行代码:

predictors = ["Pclass", "Sex", "Age", "Fare", "Embarked", "FamilySize", "Title", "FamilyId"]

algorithms = [

    [GradientBoostingClassifier(random_state=1, n_estimators=25, max_depth=3), predictors],

    [LogisticRegression(random_state=1), ["Pclass", "Sex", "Fare", "FamilySize", "Title", "Age", "Embarked"]]

]

full_predictions = []

for alg, predictors in algorithms:

    # Fit the algorithm using the full training data.

    alg.fit(titanic[predictors], titanic["Survived"])

    # Predict using the test dataset.  We have to convert all the columns to floats to avoid an error.

    predictions = alg.predict_proba(titanic_test[predictors].astype(float))[:,1]

    full_predictions.append(predictions)

# The gradient boosting classifier generates better predictions, so we weight it higher.

predictions = (full_predictions[0] * 3 + full_predictions[1]) / 4

predictions[predictions <= .5] = 0

predictions[predictions > .5] = 1

predictions = predictions.astype(int)

submission = pandas.DataFrame({

        "PassengerId": titanic_test["PassengerId"],

        "Survived": predictions

    })

13: Final Thoughts

Now, we have a submission! It should get you a score of .799 on the leaderboard. You can generate a submission file with submission.to_csv("kaggle.csv", index=False).

There's still more work you can do in feature engineering:

Try using features related to the cabins.
See if any family size features might help -- do the number of women in a family make the whole family more likely to survive?
Does the national origin of the passenger's name have anything to do with survival?

There's also a lot more we can do on the algorithm side:
Try the random forest classifier in the ensemble.
A support vector machine might work well with this data.
We could try neural networks.
Boosting with a different base classifier might work better.

And with ensembling methods:

Could majority voting be a better ensembling method than averaging probabilities?

This dataset is very easy to overfit on because there isn't a lot of data, so you'll be grinding for small accuracy gains. You could also try a different Kaggle competition with more data and richer features to sink into.

Hope you enjoyed this tutorial, and good luck with the machine learning competitions!

Improving your submission -- Kaggle Competitions的更多相关文章

Getting started with Kaggle -- Kaggle Competitions
1: The Competition We'll be learning how to generate a submission for a Kaggle competition. Kaggle i ...
Kaggle Challenge简要介绍
https://en.wikipedia.org/wiki/Kaggle 以下内容,直接摘自维基百科,主要起到一个记录的作用,提醒自己有时间关注关注这个竞赛. Kaggle is a platform ...
linux服务器上配置进行kaggle比赛的深度学习tensorflow keras环境详细教程
本文首发于个人博客https://kezunlin.me/post/6b505d27/,欢迎阅读最新内容! full guide tutorial to install and configure d ...
kaggle——NFL Big Data Bowl 2020 Official Starter Notebook
Introduction In this competition you will predict how many yards a team will gain on a rushing play ...
通过kaggle api下载数据集
Kaggle API使用教程 https://www.kaggle.com 的官方 API ,可使用 Python 3 中实现的命令行工具访问. Beta 版 - Kaggle 保留修改当前提供的 A ...
如何使用Python在Kaggle竞赛中成为Top15
如何使用Python在Kaggle竞赛中成为Top15 Kaggle比赛是一个学习数据科学和投资时间的非常的方式,我自己通过Kaggle学习到了很多数据科学的概念和思想,在我学习编程之后的几个月就开始 ...
Kaggle Bike Sharing Demand Prediction – How I got in top 5 percentile of participants?
Kaggle Bike Sharing Demand Prediction – How I got in top 5 percentile of participants? Introduction ...
Kaggle竞赛顶尖选手经验汇总
What is your first plan of action when working on a new competition? 理解竞赛,数据,评价标准. 建立交叉验证集. 制定.更新计划. ...
如何在google colab加载kaggle数据
参考https://medium.com/@yvettewu.dw/tutorial-kaggle-api-google-colaboratory-1a054a382de0 从本地上传到colab上十 ...

随机推荐

Go linux 实践2
今天,看看GO的高级语言特性-方法和接口废话不多说,直接上代码 ************************************************* 1 package main 2 ...
阿里云服务器用smtp发送邮件返失败
阿里云使用SMTP发送邮件失败,因为阿里云服务器屏蔽了25端口,所以发送不成功,解决办法改用587发送QQ邮件,且必须使用SSL,否则不成功. 经测试QQ的465,995不能使用. https://b ...
maven的下载，安装配置以及build一个java web项目
一.下载下载地址:http://maven.apache.org/download.cgi 二.安装下载完成后,解压,进入到bin目录: 三.环境变量配置复制bin目录下的文件的路径(如:xxx ...
字符串转化为int数组
String a = "1,2,3,4,5,6" String str[] = a.split(","); int array[] = new int[str. ...
Linux MySQL数据库文件同步及数据库备份
Mysql数据库链接 mysql -uroot -p -hdatacenter.jiaofukeyan.com -P33069 1.文件同步 rsync -avz --delete root@(需要同 ...
jdbc连接oracle时使用的字符串格式
两种方式: SID的方式: jdbc:oracle:thin:@[HOST][:PORT]:SID SERVICE_NAME的方式: jdbc:oracle:thin:@//[HOST][:PORT] ...
linux安装flash player来播放视频
1下载64位flashplayer插件,可在此下载(偷偷赚俩金币,为省金币也可到官网去搜),得到flashplayer11_b2_install_lin_64_080811.tar.gz: http: ...
jQuery属性--html([val|fn])、text([val|fn])和val([val|fn|arr])
html([val|fn]) 概述取得第一个匹配元素的html内容,这个函数不能用于XML文档.但可以用于XHTML文档. 在一个 HTML 文档中, 我们可以使用 .html() 方法来获取任意一 ...
Spark学习之路（四）Spark的广播变量和累加器
一.概述在spark程序中,当一个传递给Spark操作(例如map和reduce)的函数在远程节点上面运行时,Spark操作实际上操作的是这个函数所用变量的一个独立副本.这些变量会被复制到每台机器上 ...
增强for循环遍历集合或数组
遍历:for循环遍历数组或集合:iterator迭代器遍历集合:还有增强for循环(for each)遍历数组或集合: 遍历数组: 遍历集合:

Improving your submission -- Kaggle Competitions