kaggle入门2——改进特征

1：改进我们的特征

在上一个任务中，我们完成了我们在Kaggle上一个机器学习比赛的第一个比赛提交泰坦尼克号：灾难中的机器学习。

可是我们提交的分数并不是非常高。有三种主要的方法可以让我们能够提高他：

用一个更好的机器学习算法；
生成更好的特征；
合并多重机器学习算法。

在这节的任务总，我们将会完成这三个。首先，我们将找到一个不同的算法来使用逻辑回归——随记森林(randaom forests)。

2：随机森林简介

正如我们在上一节任务中顺便提到的，决策树能从数据中学会非线性趋势。一个例子如下：

Age	Sex	Survived
5	0	1
30	1	0
70	0	1
20	0	1

正如你所见，Age年龄和Survived幸存之间并不是线性相关——30岁的人没有幸存，但是70岁和20岁的人幸存了。

我们反而可以在Age,Sex,Survived之间的关系上建立一个决策树模型。你之前可能见过决策树或者流程图，决策树算法在概念上和他们并没有任何的不同。我们从树的根节点开始我们的数据行，然后直到我们能准确的将行分别插到子节点。一个例子：

在上图中，我们得到了我们初步的数据和：

做了一个初步的划分。右边是Age超过29的所有行，左边是Age小于19的所有行。
左边那组全是幸存者，所以我们将其定为叶子节点并赋值给Survived结果1。
右边那组并不是全都有同样的结果，所以我们再以Sex列的值划分一次。
最后我们右边有了两个叶子节点，一个全是幸存者，另一个都没有幸存。

对于一个新行，我们可以利用这个决策树来计算出幸存结果。

Age	Sex	Survived
40	0	?

根据我们的树，我们首先将划分到右边，然后再划分到左边。我们将预测上面那个人会幸存。(Survived = 1)

决策树有一个重大的缺点就是他们在训练数据集上会过拟合。因为我们根据划分建立了一个很深的决策树，我们结束的时候回有很多大量的规则源于训练数据集中特殊的奇怪特征，并且无法归纳这些的新的数据集。

随机森林可以起到一些帮助。随机森林，我们使用微随机输入数据来构造出许许多多的树，并且微随机的划分这些点。随机森林中的每一颗树都是训练数据全集的一个随机子集。

每一棵树中的每个划分点都能在任一随机子集的潜在列上执行划分。通过求所有树的平均预测值，我们能得到一个更好的整体预测和最小的过拟合。

3：实现一个随机森林

感谢上帝，sklearn已经实现了一个非常棒的随机森林。我们可以使用它在我们的数据上构建一个随机森林并且生成交叉验证预测。

说明

对titanic数据框(已经加载了的数据)做一个交叉验证的预测。用3交叉。用储存在alg的随机森林算法去做交叉验证。用predictions去预测Survived列。将结果赋值到scores。

你可以使用cross_validation.cross_val_score来完成这些。

在做完交叉验证预测之后，将我们的scores的平均值打印出来。

from sklearn import cross_validation

from sklearn.ensemble import RandomForestClassifier

predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]

# Initialize our algorithm with the default paramters

# n_estimators is the number of trees we want to make

# min_samples_split is the minimum number of rows we need to make a split

# min_samples_leaf is the minimum number of samples we can have at the place where a tree branch ends (the bottom points of the tree)

alg = RandomForestClassifier(random_state=1, n_estimators=10, min_samples_split=2, min_samples_leaf=1)

答案：

from sklearn import cross_validation

from sklearn.ensemble import RandomForestClassifier

predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]

# Initialize our algorithm with the default paramters

# n_estimators is the number of trees we want to make

# min_samples_split is the minimum number of rows we need to make a split

# min_samples_leaf is the minimum number of samples we can have at the place where a tree branch ends (the bottom points of the tree)

alg = RandomForestClassifier(random_state=1, n_estimators=10, min_samples_split=2, min_samples_leaf=1)

# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)

scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)

# Take the mean of the scores (because we have one for each fold)

print(scores.mean())

4:调参

我们可以做的最简单的能够提高随机森林的精度的第一件事情就是增加我们使用的树的数量。训练更多的树会花费更多的时间，但是事实是使用更多的树求数据的不同子集的平均预测值会很大的提升预测的准确率(一定程度上)。

我们也可以调整min_samples_split和min_samples_leaf变量来减少过拟合。

决策树是如何工作的(正如我们在视频中解释的一样)，往下一直都会有切分节点，树过深会导致拟合了数据中一些独特的特征，但是不是真的有用的特征。因此，增加min_samples_split和min_samples_leaf能够减少过拟合，实际上当我们在未知数据上做预测时这将会提高我们的分数。一个较少过拟合的模型，会使结果变得更好，使得在未知数据中表现得更好，但是在可见数据中却不好。

指令

为了提高准确率，当我们初始化alg的时候就要调整我们的参数。在titanic数据框上做交叉验证预测。使用predictors预测Survived列。将结果赋值给scores。

在做完交叉验证预测之后，将我们的scores的平均值打印出来。

alg = RandomForestClassifier(random_state=1, n_estimators=150, min_samples_split=4, min_samples_leaf=2)

答案：

alg = RandomForestClassifier(random_state=1, n_estimators=150, min_samples_split=4, min_samples_leaf=2)

# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)

scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)

# Take the mean of the scores (because we have one for each fold)

print(scores.mean())

5:生成新特征

我们也可以生成新特征。一些点子：

名字的长度——这和那人有多富有，所以在泰坦尼克上的位置有关。
一个家庭的总人数(SibSp+Parch)。

一个很简答的方法就是使用pandas数据框的.apply方法来生成特征。这会对你传入数据框(dataframe)或序列(series)的每一个元素应用一个函数。我们也可以传入一个lambda函数使我们能够定义一个匿名函数。

一个匿名的函数的语法是lambda x:len(x)。x将传入的值作为输入值——在本例中，就是乘客的名字。表达式的右边和返回结果将会应用于x。.apply方法读取这些所有输出并且用他们构造出一个pandas序列。我们可以将这个序列赋值到一个数据框列。

demo:

# Generating a familysize column

titanic["FamilySize"] = titanic["SibSp"] + titanic["Parch"]

# The .apply method generates a new series

titanic["NameLength"] = titanic["Name"].apply(lambda x: len(x))

6：使用头衔

我们可以从乘客的名字中提取出他们的头衔。头衔的格式是Master.,Mr.,Mrs.。有一些非常常见的头衔，也有一些“长尾理论”中的一次性头衔只有仅仅一个或者两个乘客使用。第一步使用正则表达式提取头衔，然后将每一个唯一头衔匹配成(map)整型数值。

之后我们将得到一个准确的和Title相对应的数值列。

import re

# A function to get the title from a name.

def get_title(name):

    # Use a regular expression to search for a title.  Titles always consist of capital and lowercase letters, and end with a period.

    title_search = re.search(' ([A-Za-z]+)\.', name)

    # If the title exists, extract and return it.

    if title_search:

        return title_search.group(1)

    return ""

# Get all the titles and print how often each one occurs.

titles = titanic["Name"].apply(get_title)

print(pandas.value_counts(titles))

# Map each title to an integer.  Some titles are very rare, and are compressed into the same codes as other titles.

title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 7, "Mlle": 8, "Mme": 8, "Don": 9, "Lady": 10, "Countess": 10, "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2}

for k,v in title_mapping.items():

    titles[titles == k] = v

# Verify that we converted everything.

print(pandas.value_counts(titles))

# Add in the title column.

titanic["Title"] = titles

7:家族

我们也可以生成一个特征来表示哪些人是一个家族。因为幸存看起来非常依靠你的家族和你旁边的人，这是一个成为好特征的好机会。

为了完成这个任务，我们将通过FamilySize连接某些人的姓来得到一个家庭编号。然后我们将基于他们的家庭编号给每个人赋值一个代码。

demo：

import operator

# A dictionary mapping family name to id

family_id_mapping = {}

# A function to get the id given a row

def get_family_id(row):

    # Find the last name by splitting on a comma

    last_name = row["Name"].split(",")[0]

    # Create the family id

    family_id = "{0}{1}".format(last_name, row["FamilySize"])

    # Look up the id in the mapping

    if family_id not in family_id_mapping:

        if len(family_id_mapping) == 0:

            current_id = 1

        else:

            # Get the maximum id from the mapping and add one to it if we don't have an id

            current_id = (max(family_id_mapping.items(), key=operator.itemgetter(1))[1] + 1)

        family_id_mapping[family_id] = current_id

    return family_id_mapping[family_id]

# Get the family ids with the apply method

family_ids = titanic.apply(get_family_id, axis=1)

# There are a lot of family ids, so we'll compress all of the families under 3 members into one code.

family_ids[titanic["FamilySize"] < 3] = -1

# Print the count of each unique id.

print(pandas.value_counts(family_ids))

titanic["FamilyId"] = family_ids

8:找出最好的特征

在任何的机器学习任务中，特征工程都是最重要的部分，有许多的特征可以让我们计算。但是我们也需要找到一种方法来计算出哪个特征是最好的。

一种方法就是使用单特征选择器(univariate feature selection),这种方法的本质是一列一列的遍历计算出和我们想要预测的结果(Survived)最密切关联的那一列。

和以前一样，sklearn有一个叫做SelectKBest的函数将会帮助我们完成特征选择。这个函数会从数据中选出最好的特征，并且允许我们指定选择的数量。

指令

我们已经更新predictors。为titanic数据框做用3折交叉验证预测。用predictors预测Survived列。将结果赋值给scores。

在做完交叉验证预测之后，将我们的scores的平均值打印出来。

import numpy as np

from sklearn.feature_selection import SelectKBest, f_classif

predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked", "FamilySize", "Title", "FamilyId"]

# Perform feature selection

selector = SelectKBest(f_classif, k=5)

selector.fit(titanic[predictors], titanic["Survived"])

# Get the raw p-values for each feature, and transform from p-values into scores

scores = -np.log10(selector.pvalues_)

# Plot the scores.  See how "Pclass", "Sex", "Title", and "Fare" are the best?

plt.bar(range(len(predictors)), scores)

plt.xticks(range(len(predictors)), predictors, rotation='vertical')

plt.show()

# Pick only the four best features.

predictors = ["Pclass", "Sex", "Fare", "Title"]

alg = RandomForestClassifier(random_state=1, n_estimators=150, min_samples_split=8, min_samples_leaf=4)

答案：

import numpy as np

from sklearn.feature_selection import SelectKBest, f_classif

predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked", "FamilySize", "Title", "FamilyId"]

# Perform feature selection

selector = SelectKBest(f_classif, k=5)

selector.fit(titanic[predictors], titanic["Survived"])

# Get the raw p-values for each feature, and transform from p-values into scores

scores = -np.log10(selector.pvalues_)

# Plot the scores.  See how "Pclass", "Sex", "Title", and "Fare" are the best?

plt.bar(range(len(predictors)), scores)

plt.xticks(range(len(predictors)), predictors, rotation='vertical')

plt.show()

# Pick only the four best features.

predictors = ["Pclass", "Sex", "Fare", "Title"]

alg = RandomForestClassifier(random_state=1, n_estimators=150, min_samples_split=8, min_samples_leaf=4)

# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)

scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)

# Take the mean of the scores (because we have one for each fold)

print(scores.mean())

9:梯度提升(Gradient Boosting)

另外一种方法是以决策树为基础的梯度提升分类器。提升包含了一个接一个的训练决策树，并且将一个树的误差传入到下一棵树。所以每一颗树都是以它之前的所有树为基础构造的。不过如果我们建造太多的树会导致过拟合。当你得到了100棵左右的树，这会非常容易过拟合和训练出数据集中的怪特征。当我们的数据集极小时，我们将会把树的数量限制在25。

另外一种防止过拟合的方法就是限制在梯度提升过程中建立的每一棵树的深度。我们将树的高度限制到3来分避免过拟合。

我们将试图用提升来替换掉我们的随机森林方法并且观察是否会提升我们的预测准确率。

10：集成(Ensembling)

为了提升我们的预测准确率，我们能做的一件事就是集成不同的分类器。集成的意思就是我们利用一系列的分类器的信息来生成预测结果而不是仅仅用一个。在实践中，这意味着我们是求他们预测结果的平均值。

通常来说，我们集成越多的越不同的模型，我们结果的准确率就会越高。多样性的意思是模型从不同列生成结果，或者使用一个非常不同的方法来生成预测结果。集成一个随机森林和一个决策树大概不会得到一个非常好的结果，因为他们非常的相似。换句话说，集成一个线性回归和一个随机森林可以工作得非常棒。

一个关于集成的警示就是我们使用的分类器的准确率必须都是差不多的。集成一个分类器的准确率比另外一个差得多将会导致最后的结果也变差。

在这一节中，我们将会集成基于大多数线性预测训练的逻辑回归(有一个线性排序，和Survived有一些关联)和一个在所有预测元素上训练的梯度提升树。

在我们集成的时候会保持事情的简单——我们将会求我们从分类器中得到的行概率(0~1)的平均值，然后假定所有大于0.5的匹配成1，小于等于0.5的匹配成0。

demo:

from sklearn.ensemble import GradientBoostingClassifier

import numpy as np

# The algorithms we want to ensemble.

# We're using the more linear predictors for the logistic regression, and everything with the gradient boosting classifier.

algorithms = [

    [GradientBoostingClassifier(random_state=1, n_estimators=25, max_depth=3), ["Pclass", "Sex", "Age", "Fare", "Embarked", "FamilySize", "Title", "FamilyId"]],

    [LogisticRegression(random_state=1), ["Pclass", "Sex", "Fare", "FamilySize", "Title", "Age", "Embarked"]]

]

# Initialize the cross validation folds

kf = KFold(titanic.shape[0], n_folds=3, random_state=1)

predictions = []

for train, test in kf:

    train_target = titanic["Survived"].iloc[train]

    full_test_predictions = []

    # Make predictions for each algorithm on each fold

    for alg, predictors in algorithms:

        # Fit the algorithm on the training data.

        alg.fit(titanic[predictors].iloc[train,:], train_target)

        # Select and predict on the test fold.

        # The .astype(float) is necessary to convert the dataframe to all floats and avoid an sklearn error.

        test_predictions = alg.predict_proba(titanic[predictors].iloc[test,:].astype(float))[:,1]

        full_test_predictions.append(test_predictions)

    # Use a simple ensembling scheme -- just average the predictions to get the final classification.

    test_predictions = (full_test_predictions[0] + full_test_predictions[1]) / 2

    # Any value over .5 is assumed to be a 1 prediction, and below .5 is a 0 prediction.

    test_predictions[test_predictions <= .5] = 0

    test_predictions[test_predictions > .5] = 1

    predictions.append(test_predictions)

# Put all the predictions together into one array.

predictions = np.concatenate(predictions, axis=0)

# Compute accuracy by comparing to the training data.

accuracy = sum(predictions[predictions == titanic["Survived"]]) / len(predictions)

print(accuracy)

11:在测试集上匹配我们的变化

我们将在结尾讨论可以使这个分析变得更好我们能够做的很多事情，但是现在让我们来完成一个提交。

第一笔我们将在训练数据集上做的改动匹配到测试数据集上，就像我们在上一章做的那样。我们将测试数据集读取到titanic_test中。我们匹配我们的改变：

生成NameLength,表示名字的长度。
生成FamilySize,表示家庭的大小。
添加Title列，保持和我们之前做的匹配一样。
添加FamilyId列，测试数据集合训练数据集的id保持一致。

指令

添加NameLength列到titanic_test中。和我们在titanic数据框中使用同样的方法来完成。

# First, we'll add titles to the test set.

titles = titanic_test["Name"].apply(get_title)

# We're adding the Dona title to the mapping, because it's in the test set, but not the training set

title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 7, "Mlle": 8, "Mme": 8, "Don": 9, "Lady": 10, "Countess": 10, "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2, "Dona": 10}

for k,v in title_mapping.items():

    titles[titles == k] = v

titanic_test["Title"] = titles

# Check the counts of each unique title.

print(pandas.value_counts(titanic_test["Title"]))

# Now, we add the family size column.

titanic_test["FamilySize"] = titanic_test["SibSp"] + titanic_test["Parch"]

# Now we can add family ids.

# We'll use the same ids that we did earlier.

print(family_id_mapping)

family_ids = titanic_test.apply(get_family_id, axis=1)

family_ids[titanic_test["FamilySize"] < 3] = -1

titanic_test["FamilyId"] = family_ids

答案：

# First, we'll add titles to the test set.

titles = titanic_test["Name"].apply(get_title)

# We're adding the Dona title to the mapping, because it's in the test set, but not the training set

title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 7, "Mlle": 8, "Mme": 8, "Don": 9, "Lady": 10, "Countess": 10, "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2, "Dona": 10}

for k,v in title_mapping.items():

    titles[titles == k] = v

titanic_test["Title"] = titles

# Check the counts of each unique title.

print(pandas.value_counts(titanic_test["Title"]))

# Now, we add the family size column.

titanic_test["FamilySize"] = titanic_test["SibSp"] + titanic_test["Parch"]

# Now we can add family ids.

# We'll use the same ids that we did earlier.

print(family_id_mapping)

family_ids = titanic_test.apply(get_family_id, axis=1)

family_ids[titanic_test["FamilySize"] < 3] = -1

titanic_test["FamilyId"] = family_ids

titanic_test["NameLength"] = titanic_test["Name"].apply(lambda x: len(x))

12：在测试集上做预测

现在我们已经有了一些更好的预测了，所以让我们来创建另外一个提交。

指令

通过将预测结果小于等于0.5的转换成0,将大于0.5的转换成1，将所有的结果转换成非0即1。

然后，用.astype(int)方法将预测结果转换成整型——如果你不这样做，Kaggle将会给你0分。

最后，生成一个第一列是PassengerId，第二列是Survived(这就是预测结果)。

predictors = ["Pclass", "Sex", "Age", "Fare", "Embarked", "FamilySize", "Title", "FamilyId"]

algorithms = [

    [GradientBoostingClassifier(random_state=1, n_estimators=25, max_depth=3), predictors],

    [LogisticRegression(random_state=1), ["Pclass", "Sex", "Fare", "FamilySize", "Title", "Age", "Embarked"]]

]

full_predictions = []

for alg, predictors in algorithms:

    # Fit the algorithm using the full training data.

    alg.fit(titanic[predictors], titanic["Survived"])

    # Predict using the test dataset.  We have to convert all the columns to floats to avoid an error.

    predictions = alg.predict_proba(titanic_test[predictors].astype(float))[:,1]

    full_predictions.append(predictions)

# The gradient boosting classifier generates better predictions, so we weight it higher.

predictions = (full_predictions[0] * 3 + full_predictions[1]) / 4

答案：

predictors = ["Pclass", "Sex", "Age", "Fare", "Embarked", "FamilySize", "Title", "FamilyId"]

algorithms = [

    [GradientBoostingClassifier(random_state=1, n_estimators=25, max_depth=3), predictors],

    [LogisticRegression(random_state=1), ["Pclass", "Sex", "Fare", "FamilySize", "Title", "Age", "Embarked"]]

]

full_predictions = []

for alg, predictors in algorithms:

    # Fit the algorithm using the full training data.

    alg.fit(titanic[predictors], titanic["Survived"])

    # Predict using the test dataset.  We have to convert all the columns to floats to avoid an error.

    predictions = alg.predict_proba(titanic_test[predictors].astype(float))[:,1]

    full_predictions.append(predictions)

# The gradient boosting classifier generates better predictions, so we weight it higher.

predictions = (full_predictions[0] * 3 + full_predictions[1]) / 4

predictions[predictions <= .5] = 0

predictions[predictions > .5] = 1

predictions = predictions.astype(int)

submission = pandas.DataFrame({

        "PassengerId": titanic_test["PassengerId"],

        "Survived": predictions

    })

13:最后的感想

现在，我们有了一个提交！这应该能让你在排行榜得到0.799的分数。你可以使用submission.to_csv("kaggle.csv",index=False)生成一个提交文件。

你任然可以在特征工程上做很多事情：

尝试用和船舱相关的特征。
观察家庭大小特征是否会有帮助——一个家庭中女性的数量多使全家更可能幸存？
乘客的国籍能为其幸存提高什么帮助？

在算法方面我们也有非常多的事情可以做：

尝试在集成中加入随机森林分类器。
在这个数据上支持向量机也许会很有效。
我们可以试试神经网络。
提升一个不同的基础匪类器也许会更好。

集成方法：

多数表决是比概率平均更好的集成方法？

这个数据集很容易过拟合，因为没有很多的数据，所以你将会得到很小的准确率提升。你也可以测试一个不同的有更多数据和更多特征输入的Kaggle比赛。

希望你喜欢这个指南，祝你在机器学习竞赛中好运！