一、特征选择可以减少过拟合代码实例

 该实例来自机器学习实战第四章

#coding=utf-8

'''
We use KNN to show that feature selection maybe reduce overfitting
''' from sklearn.base import clone
from itertools import combinations
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score class SBS():
def __init__(self, estimator, k_features, scoring=accuracy_score, test_size=0.25, random_state=1):
self.scoring = scoring
self.estimator = clone(estimator)
self.k_features = k_features
self.test_size = test_size
self.random_state = random_state def fit(self, X, y):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = self.test_size, random_state=self.random_state)
dim = X_train.shape[1] self.indices_ = tuple(range(dim))
self.subsets_ = [self.indices_]
score = self._calc_score(X_train, y_train, X_test, y_test, self.indices_) self.scores_ = [score] while dim > self.k_features:
scores = []
subsets = [] for p in combinations(self.indices_, r=dim-1):
score = self._calc_score(X_train, y_train, X_test, y_test, p)
scores.append(score)
subsets.append(p)
best = np.argmax(scores)
self.indices_ = subsets[best]
self.subsets_.append(self.indices_)
dim -= 1 self.scores_.append(scores[best]) self.k_score_ = self.scores_[-1] return self def transform(self, X):
return X[:, self.indices_] def _calc_score(self, X_train, y_train, X_test, y_test, indices):
self.estimator.fit(X_train[:, indices], y_train)
y_pred = self.estimator.predict(X_test[:, indices])
score = self.scoring(y_test, y_pred) return score import pandas as pd
from sklearn.model_selection import train_test_split df_wine = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None)
df_wine.columns = ['Class label', 'Alcohol',
'Malic acid',
'Ash',
'Alcalinity of ash',
'Magnesium',
'Total phenols',
'Flavanoids',
'Nonflavanoid phenols',
'Proanthocyanins',
'Color intensity',
'Hue',
'OD280/OD315 of diluted wines',
'Proline'] X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
X_train_std = stdsc.fit_transform(X_train)
X_test_std = stdsc.transform(X_test) from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
knn = KNeighborsClassifier(n_neighbors=2)
sbs = SBS(knn, k_features=1)
sbs.fit(X_train_std, y_train) k_feat = [len(k) for k in sbs.subsets_] plt.figure(figsize=(8,10))#must be a tuple
plt.subplot(2,1,1)
plt.plot(k_feat, sbs.scores_, marker='o')
plt.ylim([0.7, 1.1])
plt.ylabel('Accuracy')
plt.xlabel('Number of features')
plt.grid()
#plt.show() #Let's see what those five features are that yield such a good performance on validation dataset
#subsets_的第九个元素是选择了13个特征中的五个来进行训练
k5 = list(sbs.subsets_[8])
print(df_wine.columns[1:][k5])
'''
Index(['Alcohol', 'Malic acid', 'Alcalinity of ash', 'Hue', 'Proline'], dtype='object')
''' #Let's evaluate the performance of the KNN classifer on the original test set
knn.fit(X_train_std, y_train)
print("Training Accuracy:", knn.score(X_train_std, y_train))
print("Test Accuracy:", knn.score(X_test_std, y_test))
'''
Training accuracy: 0.9838709677419355
Test Accuracy: 0.9444444444444444
'''
#We find a slight degree of overftting if we used all the 13 features on training knn.fit(X_train_std[:, k5], y_train)
print("Training Accuracy:", knn.score(X_train_std[:, k5], y_train))
print("Test Accuracy:", knn.score(X_test_std[:, k5], y_test))
'''
Training Accuracy: 0.9596774193548387
Test Accuracy: 0.9629629629629629
'''
#We reduced overfitting and the prediction accuracy improved. #RF Show Feature Importance
from sklearn.ensemble import RandomForestClassifier
feat_labels = df_wine.columns[1:]
forest = RandomForestClassifier(n_estimators=10000, random_state=0, n_jobs=-1)
forest.fit(X_train, y_train)
importances = forest.feature_importances_
indices = np.argsort(importances)[::-1] for f in range(X_train.shape[1]):
print("%2d) %-*s %f" % (f+1, 30, feat_labels[indices[f]], importances[indices[f]])) plt.subplot(2,1,2)
plt.title("Feature Importances")
plt.bar(range(X_train.shape[1]), importances[indices], color='lightblue', align='center')
plt.xticks(range(X_train.shape[1]), feat_labels[indices], rotation=90)
plt.xlim([-1, X_train.shape[1]])
plt.tight_layout()
plt.show()

Feature Selection Can Reduce Overfitting And RF Show Feature Importance的更多相关文章

  1. 10-3[RF] feature selection

    main idea: 计算每一个feature的重要性,选取重要性前k的feature: 衡量一个feature重要的方式:如果一个feature重要,则在这个feature上加上noise,会对最后 ...

  2. The Practical Importance of Feature Selection(变量筛选重要性)

    python机器学习-乳腺癌细胞挖掘(博主亲自录制视频) https://study.163.com/course/introduction.htm?courseId=1005269003&u ...

  3. 【转】[特征选择] An Introduction to Feature Selection 翻译

    中文原文链接:http://www.cnblogs.com/AHappyCat/p/5318042.html 英文原文链接: An Introduction to Feature Selection ...

  4. 单因素特征选择--Univariate Feature Selection

    An example showing univariate feature selection. Noisy (non informative) features are added to the i ...

  5. highly variable gene | 高变异基因的选择 | feature selection | 特征选择

    在做单细胞的时候,有很多基因属于noise,就是变化没有规律,或者无显著变化的基因.在后续分析之前,我们需要把它们去掉. 以下是一种找出highly variable gene的方法: The fea ...

  6. 机器学习-特征选择 Feature Selection 研究报告

    原文:http://www.cnblogs.com/xbinworld/archive/2012/11/27/2791504.html 机器学习-特征选择 Feature Selection 研究报告 ...

  7. the steps that may be taken to solve a feature selection problem:特征选择的步骤

    參考:JMLR的paper<an introduction to variable and feature selection> we summarize the steps that m ...

  8. [Feature] Feature selection

    Ref: 1.13. Feature selection Ref: 1.13. 特征选择(Feature selection) 大纲列表 3.1 Filter 3.1.1 方差选择法 3.1.2 相关 ...

  9. [Feature] Feature selection - Embedded topic

    基于惩罚项的特征选择法 一.直接对特征筛选 Ref: 1.13.4. 使用SelectFromModel选择特征(Feature selection using SelectFromModel) 通过 ...

随机推荐

  1. 基于SSH框架的在线考勤系统开发的质量属性

    我要开发的是一个基于SSH框架的在线考勤系统. 质量属性是指影响质量的相关因素,下面我将分别从6个系统质量属性(可用性,易用性,可修改性,性能,安全性,可测试性)来分析我的系统,以及如何实现这些质量属 ...

  2. 剑指offer:复杂链表的复制

    题目描述: 输入一个复杂链表(每个节点中有节点值,以及两个指针,一个指向下一个节点,另一个特殊指针指向任意一个节点),返回结果为复制后复杂链表的head.(注意,输出结果中请不要返回参数中的节点引用, ...

  3. 男神女神配——alpha阶段总结

    一.需求分析 虽然公共社交网络系统能够满足大多数高校校园用户在校园网络社交的需求,但是针对校园学习.工作和文化生活等方面的支持以及学校个性化需求方面却存在不足.利用电子校务平台的数据,设计了与真实校园 ...

  4. 数学战神app(小学生四则运算app)进度

    背景音乐仍有瑕疵,还在完善,不过大概完成,完善按钮声音,提示音等. 许家豪:负责代码程序设计 陈思明:界面背景美化 吴旭涛.王宏财:查缺补漏

  5. docker:Dockerfile构建LNMP平台

    docker:Dockerfile构建LNMP平台   1.dockerfile介绍  Dockerfile是Docker用来构建镜像的文本文件,包含自定义的指令和格式.可以通过docker buil ...

  6. Docker-Compose 安装

    1. 什么是Docker-Compose Compose项目来源于之前的fig项目,使用python语言编写,与docker/swarm配合度很高. Compose 是 Docker 容器进行编排的工 ...

  7. 一个很初级的错误 Destructor忘记override导致内存泄露

      TxxObj= class    public     Destructor Destroy(); override;!!!此处若无override,将导致内存泄露     end; Destru ...

  8. Delphi下EasyGrid使用体会

    最近在编写软件的时候,非常需要一款支持多表头的StringGrid控件,朋友介绍使用EasyGrid控件,这款控件大概从04年开始就没有再更新,网上有关与它的资料也较少.但是通过其demo,此软件还是 ...

  9. 简单易懂的博弈论讲解(巴什博弈、尼姆博弈、威佐夫博弈、斐波那契博弈、SG定理)

    博弈论入门: 巴什博弈: 两个顶尖聪明的人在玩游戏,有一堆$n$个石子,每次每个人能取$[1,m]$个石子,不能拿的人输,请问先手与后手谁必败? 我们分类讨论一下这个问题: 当$n\le m$时,这时 ...

  10. BZOJ5249 九省联考2018IIIDX(线段树+贪心)

    显然这形成了一个树形结构.考虑这样一种贪心:按照曲目顺序,每次取消其父亲的预留,并选择当前可选择(保证其子树有合法选择且满足预留)的最大值,然后对其子树预留出大于等于他的一些值.这个做法显然是正确的. ...