一、特征选择可以减少过拟合代码实例

 该实例来自机器学习实战第四章

#coding=utf-8

'''
We use KNN to show that feature selection maybe reduce overfitting
''' from sklearn.base import clone
from itertools import combinations
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score class SBS():
def __init__(self, estimator, k_features, scoring=accuracy_score, test_size=0.25, random_state=1):
self.scoring = scoring
self.estimator = clone(estimator)
self.k_features = k_features
self.test_size = test_size
self.random_state = random_state def fit(self, X, y):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = self.test_size, random_state=self.random_state)
dim = X_train.shape[1] self.indices_ = tuple(range(dim))
self.subsets_ = [self.indices_]
score = self._calc_score(X_train, y_train, X_test, y_test, self.indices_) self.scores_ = [score] while dim > self.k_features:
scores = []
subsets = [] for p in combinations(self.indices_, r=dim-1):
score = self._calc_score(X_train, y_train, X_test, y_test, p)
scores.append(score)
subsets.append(p)
best = np.argmax(scores)
self.indices_ = subsets[best]
self.subsets_.append(self.indices_)
dim -= 1 self.scores_.append(scores[best]) self.k_score_ = self.scores_[-1] return self def transform(self, X):
return X[:, self.indices_] def _calc_score(self, X_train, y_train, X_test, y_test, indices):
self.estimator.fit(X_train[:, indices], y_train)
y_pred = self.estimator.predict(X_test[:, indices])
score = self.scoring(y_test, y_pred) return score import pandas as pd
from sklearn.model_selection import train_test_split df_wine = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None)
df_wine.columns = ['Class label', 'Alcohol',
'Malic acid',
'Ash',
'Alcalinity of ash',
'Magnesium',
'Total phenols',
'Flavanoids',
'Nonflavanoid phenols',
'Proanthocyanins',
'Color intensity',
'Hue',
'OD280/OD315 of diluted wines',
'Proline'] X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
X_train_std = stdsc.fit_transform(X_train)
X_test_std = stdsc.transform(X_test) from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
knn = KNeighborsClassifier(n_neighbors=2)
sbs = SBS(knn, k_features=1)
sbs.fit(X_train_std, y_train) k_feat = [len(k) for k in sbs.subsets_] plt.figure(figsize=(8,10))#must be a tuple
plt.subplot(2,1,1)
plt.plot(k_feat, sbs.scores_, marker='o')
plt.ylim([0.7, 1.1])
plt.ylabel('Accuracy')
plt.xlabel('Number of features')
plt.grid()
#plt.show() #Let's see what those five features are that yield such a good performance on validation dataset
#subsets_的第九个元素是选择了13个特征中的五个来进行训练
k5 = list(sbs.subsets_[8])
print(df_wine.columns[1:][k5])
'''
Index(['Alcohol', 'Malic acid', 'Alcalinity of ash', 'Hue', 'Proline'], dtype='object')
''' #Let's evaluate the performance of the KNN classifer on the original test set
knn.fit(X_train_std, y_train)
print("Training Accuracy:", knn.score(X_train_std, y_train))
print("Test Accuracy:", knn.score(X_test_std, y_test))
'''
Training accuracy: 0.9838709677419355
Test Accuracy: 0.9444444444444444
'''
#We find a slight degree of overftting if we used all the 13 features on training knn.fit(X_train_std[:, k5], y_train)
print("Training Accuracy:", knn.score(X_train_std[:, k5], y_train))
print("Test Accuracy:", knn.score(X_test_std[:, k5], y_test))
'''
Training Accuracy: 0.9596774193548387
Test Accuracy: 0.9629629629629629
'''
#We reduced overfitting and the prediction accuracy improved. #RF Show Feature Importance
from sklearn.ensemble import RandomForestClassifier
feat_labels = df_wine.columns[1:]
forest = RandomForestClassifier(n_estimators=10000, random_state=0, n_jobs=-1)
forest.fit(X_train, y_train)
importances = forest.feature_importances_
indices = np.argsort(importances)[::-1] for f in range(X_train.shape[1]):
print("%2d) %-*s %f" % (f+1, 30, feat_labels[indices[f]], importances[indices[f]])) plt.subplot(2,1,2)
plt.title("Feature Importances")
plt.bar(range(X_train.shape[1]), importances[indices], color='lightblue', align='center')
plt.xticks(range(X_train.shape[1]), feat_labels[indices], rotation=90)
plt.xlim([-1, X_train.shape[1]])
plt.tight_layout()
plt.show()

Feature Selection Can Reduce Overfitting And RF Show Feature Importance的更多相关文章

  1. 10-3[RF] feature selection

    main idea: 计算每一个feature的重要性,选取重要性前k的feature: 衡量一个feature重要的方式:如果一个feature重要,则在这个feature上加上noise,会对最后 ...

  2. The Practical Importance of Feature Selection(变量筛选重要性)

    python机器学习-乳腺癌细胞挖掘(博主亲自录制视频) https://study.163.com/course/introduction.htm?courseId=1005269003&u ...

  3. 【转】[特征选择] An Introduction to Feature Selection 翻译

    中文原文链接:http://www.cnblogs.com/AHappyCat/p/5318042.html 英文原文链接: An Introduction to Feature Selection ...

  4. 单因素特征选择--Univariate Feature Selection

    An example showing univariate feature selection. Noisy (non informative) features are added to the i ...

  5. highly variable gene | 高变异基因的选择 | feature selection | 特征选择

    在做单细胞的时候,有很多基因属于noise,就是变化没有规律,或者无显著变化的基因.在后续分析之前,我们需要把它们去掉. 以下是一种找出highly variable gene的方法: The fea ...

  6. 机器学习-特征选择 Feature Selection 研究报告

    原文:http://www.cnblogs.com/xbinworld/archive/2012/11/27/2791504.html 机器学习-特征选择 Feature Selection 研究报告 ...

  7. the steps that may be taken to solve a feature selection problem:特征选择的步骤

    參考:JMLR的paper<an introduction to variable and feature selection> we summarize the steps that m ...

  8. [Feature] Feature selection

    Ref: 1.13. Feature selection Ref: 1.13. 特征选择(Feature selection) 大纲列表 3.1 Filter 3.1.1 方差选择法 3.1.2 相关 ...

  9. [Feature] Feature selection - Embedded topic

    基于惩罚项的特征选择法 一.直接对特征筛选 Ref: 1.13.4. 使用SelectFromModel选择特征(Feature selection using SelectFromModel) 通过 ...

随机推荐

  1. Struts2中的图片验证码

    1.Struts中建一个action <action name="Code" class="LoginAction" method="code& ...

  2. "留拍"-注册/登录详解

    1. 注册 打开 “留拍” 软件,进入 主页面 ,然后按 注册 按钮: 在注册页面什么内容 都没有写 上去的情况下,按 完成 按钮: 首先把URL封装起来: public class URL { pu ...

  3. Service Fabric

    Service Fabric 开源 微软的Azure Service Fabric的官方博客在3.24日发布了一篇博客 Service Fabric .NET SDK goes open source ...

  4. Spring Cloud的Zuul的使用问题

    Zuul Client 放在移动App中,Zuul Server可以做集群. Zuul Client放在jar包吗?ios怎么办? Zuul与Spring Security配合使用,与Shiro做集成 ...

  5. ESXi虚拟机开机进入bios的方法

    想要修改启动顺序, 发现界面比较难弄 应该是在设置里面有修正. 首先编辑设置 增加如下设置 就可以了.

  6. OA与BPM的区别

      BPM OA 软件架构 JAVA..NET.基于SOA架构 JAVA..NET.PHP.Domino 驱动模式 流程驱动 文档驱动 交互 人与人,人与系统,系统与系统 人与人 软件功能       ...

  7. hive 远程管理

  8. reshape、shuffle、save_weights

    #-*- coding: utf-8 -*- import pandas as pd from random import shuffle import matplotlib.pyplot as pl ...

  9. HTTP消息头(HTTP headers)-常用的HTTP请求头与响应头

    HTTP消息头是指,在超文本传输协议( Hypertext Transfer Protocol ,HTTP)的请求和响应消息中,协议头部分的那些组件.HTTP消息头用来准确描述正在获取的资源.服务器或 ...

  10. nginx的安装应用

    Nginx的安装 # yum install gcc pcre-devel zlib-devel –y #./configure –prefix=/usr/local/nginx #make & ...