Feature Selection Can Reduce Overfitting And RF Show Feature Importance

一、特征选择可以减少过拟合代码实例

　该实例来自机器学习实战第四章

#coding=utf-8

'''

We use KNN to show that feature selection maybe reduce overfitting

'''

from sklearn.base import clone

from itertools import combinations

import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

class SBS():

          def __init__(self, estimator, k_features, scoring=accuracy_score, test_size=0.25, random_state=1):

                    self.scoring = scoring

                    self.estimator = clone(estimator)

                    self.k_features = k_features

                    self.test_size = test_size

                    self.random_state = random_state

          def fit(self, X, y):

                    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = self.test_size, random_state=self.random_state)

                    dim = X_train.shape[1]

                    self.indices_ = tuple(range(dim))

                    self.subsets_ = [self.indices_]

                    score = self._calc_score(X_train, y_train, X_test, y_test, self.indices_)

                    self.scores_ = [score]

                    while dim > self.k_features:

                              scores = []

                              subsets = []

                              for p in combinations(self.indices_, r=dim-1):

                                        score = self._calc_score(X_train, y_train, X_test, y_test, p)

                                        scores.append(score)

                                        subsets.append(p)

                              best = np.argmax(scores)

                              self.indices_ = subsets[best]

                              self.subsets_.append(self.indices_)

                              dim -= 1

                              self.scores_.append(scores[best])

                    self.k_score_ = self.scores_[-1]

                    return self

          def transform(self, X):

                    return X[:, self.indices_]

          def _calc_score(self, X_train, y_train, X_test, y_test, indices):

                    self.estimator.fit(X_train[:, indices], y_train)

                    y_pred = self.estimator.predict(X_test[:, indices])

                    score = self.scoring(y_test, y_pred)

                    return score

import pandas as pd

from sklearn.model_selection import train_test_split

df_wine = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None)

df_wine.columns = ['Class label', 'Alcohol',

                   'Malic acid',

                   'Ash',

                   'Alcalinity of ash',

                   'Magnesium',

                   'Total phenols',

                   'Flavanoids',

                   'Nonflavanoid phenols',

                   'Proanthocyanins',

                   'Color intensity',

                   'Hue',

                   'OD280/OD315 of diluted wines',

                   'Proline']

X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

from sklearn.preprocessing import StandardScaler

stdsc = StandardScaler()

X_train_std = stdsc.fit_transform(X_train)

X_test_std = stdsc.transform(X_test)

from sklearn.neighbors import KNeighborsClassifier

import matplotlib.pyplot as plt

knn = KNeighborsClassifier(n_neighbors=2)

sbs = SBS(knn, k_features=1)

sbs.fit(X_train_std, y_train)

k_feat = [len(k) for k in sbs.subsets_]

plt.figure(figsize=(8,10))#must be a tuple

plt.subplot(2,1,1)

plt.plot(k_feat, sbs.scores_, marker='o')

plt.ylim([0.7, 1.1])

plt.ylabel('Accuracy')

plt.xlabel('Number of features')

plt.grid()

#plt.show()

#Let's see what those five features are that yield such a good performance on validation dataset

#subsets_的第九个元素是选择了13个特征中的五个来进行训练

k5 = list(sbs.subsets_[8])

print(df_wine.columns[1:][k5])

'''

Index(['Alcohol', 'Malic acid', 'Alcalinity of ash', 'Hue', 'Proline'], dtype='object')

'''

#Let's evaluate the performance of the KNN classifer on the original test set

knn.fit(X_train_std, y_train)

print("Training Accuracy:", knn.score(X_train_std, y_train))

print("Test Accuracy:", knn.score(X_test_std, y_test))

'''

Training accuracy: 0.9838709677419355

Test Accuracy: 0.9444444444444444

'''

#We find a slight degree of overftting if we used all the 13 features on training

knn.fit(X_train_std[:, k5], y_train)

print("Training Accuracy:", knn.score(X_train_std[:, k5], y_train))

print("Test Accuracy:", knn.score(X_test_std[:, k5], y_test))

'''

Training Accuracy: 0.9596774193548387

Test Accuracy: 0.9629629629629629

'''

#We reduced overfitting and the prediction accuracy improved.

#RF Show Feature Importance

from sklearn.ensemble import RandomForestClassifier

feat_labels = df_wine.columns[1:]

forest = RandomForestClassifier(n_estimators=10000, random_state=0, n_jobs=-1)

forest.fit(X_train, y_train)

importances = forest.feature_importances_

indices = np.argsort(importances)[::-1]

for f in range(X_train.shape[1]):

          print("%2d) %-*s %f" % (f+1, 30,  feat_labels[indices[f]], importances[indices[f]]))

plt.subplot(2,1,2)

plt.title("Feature Importances")

plt.bar(range(X_train.shape[1]), importances[indices], color='lightblue', align='center')

plt.xticks(range(X_train.shape[1]), feat_labels[indices], rotation=90)

plt.xlim([-1, X_train.shape[1]])

plt.tight_layout()

plt.show()

Feature Selection Can Reduce Overfitting And RF Show Feature Importance的更多相关文章

10-3[RF] feature selection
main idea: 计算每一个feature的重要性,选取重要性前k的feature: 衡量一个feature重要的方式:如果一个feature重要,则在这个feature上加上noise,会对最后 ...
The Practical Importance of Feature Selection（变量筛选重要性）
python机器学习-乳腺癌细胞挖掘(博主亲自录制视频) https://study.163.com/course/introduction.htm?courseId=1005269003&u ...
【转】[特征选择] An Introduction to Feature Selection 翻译
中文原文链接:http://www.cnblogs.com/AHappyCat/p/5318042.html 英文原文链接: An Introduction to Feature Selection ...
单因素特征选择--Univariate Feature Selection
An example showing univariate feature selection. Noisy (non informative) features are added to the i ...
highly variable gene | 高变异基因的选择 | feature selection | 特征选择
在做单细胞的时候,有很多基因属于noise,就是变化没有规律,或者无显著变化的基因.在后续分析之前,我们需要把它们去掉. 以下是一种找出highly variable gene的方法: The fea ...
机器学习-特征选择 Feature Selection 研究报告
原文:http://www.cnblogs.com/xbinworld/archive/2012/11/27/2791504.html 机器学习-特征选择 Feature Selection 研究报告 ...
the steps that may be taken to solve a feature selection problem：特征选择的步骤
參考:JMLR的paper<an introduction to variable and feature selection> we summarize the steps that m ...
[Feature] Feature selection
Ref: 1.13. Feature selection Ref: 1.13. 特征选择(Feature selection) 大纲列表 3.1 Filter 3.1.1 方差选择法 3.1.2 相关 ...
[Feature] Feature selection - Embedded topic
基于惩罚项的特征选择法一.直接对特征筛选 Ref: 1.13.4. 使用SelectFromModel选择特征(Feature selection using SelectFromModel) 通过 ...

随机推荐

NumsCount
package com.home.test; import java.util.Arrays; public class NumsCount { public void getNumCount(int ...
Junit4使用实验报告
一.题目简介 Junit4的使用及求和测试. 二.源码的github链接 https://github.com/bjing123/test-/blob/master/Arithmetic.txt ht ...
【Alpha阶段】展示博客发布！
1.团队成员简介 Email:qianlxc@126.com Free time:8:00 7:00 a.m ~ 11:00 12:00p.m Introduction: 我是一个热情的人.开朗的人. ...
PAT 1081 检查密码
https://pintia.cn/problem-sets/994805260223102976/problems/994805261217153024 本题要求你帮助某网站的用户注册模块写一个密码 ...
Linux下Vim使用备忘
1.Insert键,决定是Insert模式还是Replace模式. 2.Esc键,退出编辑模式(Insert Or Replace). 3.:wq (ZZ) 保存并退出Vim. http://caib ...
数据类型+内置方法 python学习第六天
元组用途:不可变的列表,能存多个值,但多个值只有取的需求而没有改的需求. 定义方式:在()内用逗号分隔开多个元素,可以存放任意类型的值. names=(‘alex’,’blex’,’clex’) 强 ...
ava 8中的新功能特性
正如我之前所写的,Java 8中的新功能特性改变了游戏规则.对Java开发者来说这是一个全新的世界,并且是时候去适应它了. 在这篇文章里,我们将会去了解传统循环的一些替代方案.在Java 8的新功能特 ...
python之hasattr、getattr和setattr函数
hasattr函数使用方法 # hasattr函数使用方法 # hasattr(object,attr) # 判断一个对象里是否有某个属性或方法,返回布尔值,有为True,否则False class ...
Bootstrap导航
前面的话导航对于一位前端人员来说并不陌生.可以说导航是一个网站重要的元素组件之一,便于用户查找网站所提供的各项功能服务.本文将详细介绍Bootstrap导航基础样式 Bootstrap框架中制作导 ...
POJ 1661 （Help Jimmy ）
Help Jimmy Time Limit: 1000MS Memory Limit: 10000K Total Submissions: 13669 Accepted: 4541 Descr ...

Feature Selection Can Reduce Overfitting And RF Show Feature Importance

Feature Selection Can Reduce Overfitting And RF Show Feature Importance的更多相关文章

随机推荐

热门专题