使用sklearn进行交叉验证

模型评估方法

假如我们有一个带标签的数据集D，我们如何选择最优的模型？衡量模型好坏的标准是看这个模型在新的数据集上面表现的如何，也就是看它的泛化误差。因为实际的数据没有标签，所以泛化误差是不可能直接得到的。于是我们需要在数据集D上面划分出来一小部分数据测试D的性能，用它来近似代替泛化误差。

有三种方法可以进行上面的划分操作：留出法、交叉验证法、自助法。

留出法：

留出法的想法很简单，将原始数据直接划分为互斥的两类，其中一部分用来训练模型，另外一部分用来测试。前者就是训练集，后者就是测试集。

在sklearn当中，使用train_test_split可以将数据分为训练集和测试集，下面使用鸢尾花数据集看一看

import numpy as np

from sklearn.model_selection import train_test_split

from sklearn import datasets

from sklearn import svm

iris = datasets.load_iris()

print(iris.data.shape, iris.target.shape)

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=0)

print(X_train.shape, y_train.shape)

print(X_test.shape, y_test.shape)

clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)

print( clf.score(X_test, y_test) )

输出结果如下，test_size可以接受一个浮点数来表示测试集的比例：

(150, 4) (150,)

(90, 4) (90,)

(60, 4) (60,)

0.966666666667

留出法非常的简单。但是存在一些问题，比如有些模型还需要进行超参数评估，这个时候还需要划分一类数据集，叫做验证集。最后数据集的划分划分变成了这样：训练集，验证集还有测试集。训练集是为了进行模型的训练，验证集是为了进行参数的调整，测试集是为了看这个模型的好坏。

但是，上面的划分依然有问题，划分出来验证集还有测试集，那么我们的训练集会变小。并且还有一个问题，那就是我们的模型会随着我们选择的训练集和验证集不同而不同。所以这个时候，我们引入了交叉验证（cross-validation 简称cv）

交叉验证：

交叉验证的基本思想是这样的：将数据集分为k等份，对于每一份数据集，其中k-1份用作训练集，单独的那一份用作测试集。

运用交叉验证进行数据集划分

下面的函数是一些划分的策略，方便我们自己划分数据，并且我们假设数据是独立同分布的（iid）

KFold方法 k折交叉验证

上面说的将数据集划分为k等份的方法叫做k折交叉验证。sklearn中的 KFold是它的实现：

from sklearn.model_selection import KFold

import numpy as np

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])

y = np.array([1, 2, 3, 4])

kf = KFold(n_splits=2)

for train_index, test_index in kf.split(X):

    print('train_index', train_index, 'test_index', test_index)

    train_X, train_y = X[train_index], y[train_index]

    test_X, test_y = X[test_index], y[test_index]

输出如下：

train_index [2 3] test_index [0 1]

train_index [0 1] test_index [2 3]

通过KFold函数，我们可以很方便的得到我们所需要的训练集，还有测试集。

RepeatedKFold p次k折交叉验证

在实际当中，我们只进行一次k折交叉验证还是不够的，我们需要进行多次，最典型的是：10次10折交叉验证，RepeatedKFold方法可以控制交叉验证的次数。

from sklearn.model_selection import RepeatedKFold

import numpy as np

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])

y = np.array([1, 2, 3, 4])

kf = RepeatedKFold(n_splits=2, n_repeats=2, random_state=0)

for train_index, test_index in kf.split(X):

    print('train_index', train_index, 'test_index', test_index)

输出结果如下：

train_index [0 1] test_index [2 3]

train_index [2 3] test_index [0 1]

train_index [1 3] test_index [0 2]

train_index [0 2] test_index [1 3]

LeaveOneOut 留一法

留一法是k折交叉验证当中，k=n（n为数据集个数）的情形

from sklearn.model_selection import LeaveOneOut

X = [1, 2, 3, 4]

loo = LeaveOneOut()

for train_index, test_index in loo.split(X):

    print('train_index', train_index, 'test_index', test_index)

输出结果如下：

train_index [1 2 3] test_index [0]

train_index [0 2 3] test_index [1]

train_index [0 1 3] test_index [2]

train_index [0 1 2] test_index [3]

留一法的缺点是：当n很大的时候，计算量会很大，因为需要进行n次模型的训练，而且训练集的大小为n-1。建议k折交叉验证的时候k的值为5或者10。

LeavePOut 留P法

基本原理和留一法一样，它会产生个训练集和测试集

from sklearn.model_selection import LeavePOut

X = [1, 2, 3, 4]

lpo = LeavePOut(p=2)

for train_index, test_index in lpo.split(X):

    print('train_index', train_index, 'test_index', test_index)

输出结果如下：

train_index [2 3] test_index [0 1]

train_index [1 3] test_index [0 2]

train_index [1 2] test_index [0 3]

train_index [0 3] test_index [1 2]

train_index [0 2] test_index [1 3]

train_index [0 1] test_index [2 3]

ShuffleSplit 随机分配

使用ShuffleSplit方法，可以随机的把数据打乱，然后分为训练集和测试集。它还有一个好处是可以通过random_state这个种子来重现我们的分配方式，如果没有指定，那么每次都是随机的。

from sklearn.model_selection import ShuffleSplit

import numpy as np

X = np.arange(5)

ss = ShuffleSplit(n_splits=4, random_state=0, test_size=0.25)

for train_index, test_index in ss.split(X):

    print('train_index', train_index, 'test_index', test_index)

输出结果如下（因为指定了random_state的值，所以，当你运行这段代码的时候，你的结果和我的是一样的）：

train_index [1 3 4] test_index [2 0]

train_index [1 4 3] test_index [0 2]

train_index [4 0 2] test_index [1 3]

train_index [2 4 0] test_index [3 1]

其它特殊情况的数据划分方法

1：对于分类数据来说，它们的target可能分配是不均匀的，比如在医疗数据当中得癌症的人比不得癌症的人少很多，这个时候，使用的数据划分方法有 StratifiedKFold ，StratifiedShuffleSplit

2：对于分组数据来说，它的划分方法是不一样的，主要的方法有 GroupKFold，LeaveOneGroupOut，LeavePGroupOut，GroupShuffleSplit

3：对于时间关联的数据，方法有TimeSeriesSplit

运用交叉验证进行模型评估

上面讲的是如何使用交叉验证进行数据集的划分。当我们用交叉验证的方法并且结合一些性能度量方法来评估模型好坏的时候，我们可以直接使用sklearn当中提供的交叉验证评估方法，这些方法如下：

cross_value_score

这个方法能够使用交叉验证来计算模型的评分情况，使用方法如下所示：

from sklearn import datasets
from sklearn import svm
from sklearn.model_selection import cross_val_score

iris = datasets.load_iris()

clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, iris.data, iris.target, cv=5)

print(scores)

输出结果如下： [ 0.96666667 1. 0.96666667 0.96666667 1. ]

clf是我们使用的算法，

cv是我们使用的交叉验证的生成器或者迭代器，它决定了交叉验证的数据是如何划分的，当cv的取值为整数的时候，使用(Stratified)KFold方法。

你也可也使用自己的cv，如下所示：

from sklearn.model_selection import ShuffleSplit

my_cv = ShuffleSplit(n_splits=3, test_size=0.3, random_state=0)

scores = cross_val_score(clf, iris.data, iris.target, cv=my_cv)

还有一个参数是 scoring，决定了其中的分数计算方法。

如我们使用 scores = cross_val_score(clf, iris.data, iris.target, cv=5, scoring='f1_macro')

那么得到的结果将是这样的： [ 0.96658312 1. 0.96658312 0.96658312 1. ]

cross_validate

cross_validate方法和cross_validate_score有个两个不同点：它允许传入多个评估方法，可以使用两种方法来传入，一种是列表的方法，另外一种是字典的方法。最后返回的scores为一个字典，字典的key为：dict_keys(['fit_time', 'score_time', 'test_score', 'train_score'])

下面是它的演示代码，当scoring传入列表的时候如下：

from sklearn.model_selection import cross_validate

from sklearn.svm import SVC

from sklearn.datasets import load_iris

iris = load_iris()

scoring = ['precision_macro', 'recall_macro']

clf = SVC(kernel='linear', C=1, random_state=0)

scores = cross_validate(clf, iris.data, iris.target, scoring=scoring,cv=5, return_train_score=False)

print(scores.keys())

print(scores['test_recall_macro'])

结果如下：

dict_keys(['fit_time', 'score_time', 'test_precision_macro', 'test_recall_macro'])

[0.96666667 1.         0.96666667 0.96666667 1.        ]

当scoring传入字典的时候如下：

from sklearn.model_selection import cross_validate

from sklearn.svm import SVC

from sklearn.metrics import make_scorer,recall_score

from sklearn.datasets import load_iris

iris = load_iris()

scoring = {'prec_macro': 'precision_macro','rec_micro': make_scorer(recall_score, average='macro')}

clf = SVC(kernel='linear', C=1, random_state=0)

scores = cross_validate(clf, iris.data, iris.target, scoring=scoring,cv=5, return_train_score=False)

print(scores.keys())

print(scores['test_rec_micro'])

结果如下：

dict_keys(['fit_time', 'score_time', 'test_prec_macro', 'test_rec_micro'])

[0.96666667 1.         0.96666667 0.96666667 1.        ]

cross_validate是如何工作的，它的结果又是什么？

我们讨论参数只有estimator，X和Y这种情况，当只传入这三个参数的时候，cross_validate依然使用交叉验证的方法来进行模型的性能度量，它会返回一个字典来看模型的性能如何的，字典的key为：dict_keys(['fit_time', 'score_time', 'test_score', 'train_score'])，表示的是模型的训练时间，测试时间，测试评分和训练评分。用两个时间参数和两个准确率参数来评估模型，这在我们进行简单的模型性能比较的时候已经够用了。

cross_val_predict

cross_val_predict 和 cross_val_score的使用方法是一样的，但是它返回的是一个使用交叉验证以后的输出值，而不是评分标准。它的运行过程是这样的，使用交叉验证的方法来计算出每次划分为测试集部分数据的值，知道所有的数据都有了预测值。假如数据划分为[1,2,3,4,5]份，它先用[1,2,3,4]训练模型，计算出来第5份的目标值，然后用[1,2,3,5]计算出第4份的目标值，直到都结束为止。

from sklearn.svm import SVC

from sklearn.datasets import load_iris

from sklearn.model_selection import cross_val_predict

from sklearn import metrics

iris = load_iris()

clf = SVC(kernel='linear', C=1, random_state=0)

predicted = cross_val_predict(clf, iris.data, iris.target, cv=10)

print(predicted)

print(metrics.accuracy_score(predicted, iris.target))

结果如下：

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1

 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 2 2 2 2

 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

 2 2]

0.9733333333333334

自助法

我们刚开始介绍的留出法（hold-out）还有我们介绍的交叉验证法（cross validation），这两种方法都可以进行模型评估。当然，还有一种方法，那就是自助法（bootstrapping），它的基本思想是这样的，对于含有m个样本的数据集D，我们对它进行有放回的采样m次，最终得到一个含有m个样本的数据集D^',这个数据集D^'会有重复的数据,我们把它用作训练数据。按照概率论的思想，在m个样本中，有1/e的样本从来没有采到，将这些样本即D\D^'当做测试集。具体的推导见周志华的机器学习2.2.3。自助法在数据集很小的时候可以使用，在集成学习的时候也有应用。

参考：

Cross-validation: evaluating estimator performance

机器学习中训练集、验证集和测试集的作用

周志华《机器学习》