Boruta特征选择

官方github地址：https://github.com/scikit-learn-contrib/boruta_py?tab=readme-ov-file

论文地址：https://www.jstatsoft.org/article/view/v036i11

官方代码：

import pandas as pd

from sklearn.ensemble import RandomForestClassifier

from boruta import BorutaPy

# load X and y

# NOTE BorutaPy accepts numpy arrays only, hence the .values attribute

X = pd.read_csv('examples/test_X.csv', index_col=0).values

y = pd.read_csv('examples/test_y.csv', header=None, index_col=0).values

y = y.ravel()

# define random forest classifier, with utilising all cores and

# sampling in proportion to y labels

rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5)

# define Boruta feature selection method

feat_selector = BorutaPy(rf, n_estimators='auto', verbose=2, random_state=1)

# find all relevant features - 5 features should be selected

feat_selector.fit(X, y)

# check selected features - first 5 features are selected

feat_selector.support_

# check ranking of features

feat_selector.ranking_

# call transform() on X to filter it down to selected features

X_filtered = feat_selector.transform(X)

在本地运行时出现了问题：AttributeError: module 'numpy' has no attribute 'int'. np.int was a deprecated alias for the builtin int.就是numpy的1.20版本以后的都不在支持np.int，我尝试了降低numpy版本，但是报错wheel出问题了。看了github上的issues很多人都遇到了同样的问题，解决办法就是在调用boruta = BorutaPy(estimator=rf)前加三行代码：

np.int = np.int32

np.float = np.float64

np.bool = np.bool_

boruta = BorutaPy(estimator=rf)

boruta.fit(x, y)

下面是我修改后以及适配我的需求的代码：

import pandas as pd

from sklearn.ensemble import RandomForestClassifier

from boruta import BorutaPy

import numpy as np

file_names_to_add = ['xxx', 'xxxx']

file_path2 = '../xxxx'

for file_name in file_names_to_add:

    input_file_path = f"{file_path2}{file_name}.xlsx"

    print(input_file_path) 

    sheet_name_nor = 'xxx'

    y_tos = ['xxx', '...']

    for y_to in y_tos:

        sheet_name_uni = y_to

        print(sheet_name_uni)

        df = pd.read_excel(input_file_path, sheet_name=sheet_name_nor)

        cols_to_pre = ['xxxxxxx', 'xxxxxx','...']

        missing_cols = [col for col in cols_to_pre if col not in df.columns]

        if missing_cols:

            print(f"{missing_cols} not found in the, skipping.")

            cols_to_pre = [col for col in cols_to_pre if col in df.columns]

        # load X and y

        # NOTE BorutaPy accepts numpy arrays only, hence the .values attribute

        X = df[cols_to_pre].values

        y = df[y_to].values

        np.int = np.int32

        np.float = np.float64

        np.bool = np.bool_

        # define random forest classifier, with utilising all cores and

        # sampling in proportion to y labels

        rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5)

        # define Boruta feature selection method

        feat_selector = BorutaPy(rf, n_estimators='auto', verbose=2, random_state=1)

        # find all relevant features - 5 features should be selected

        feat_selector.fit(X, y)

        # # check selected features - first 5 features are selected

        # feat_selector.support_

        # # check ranking of features

        # feat_selector.ranking_

        # call transform() on X to filter it down to selected features

        # X_filtered = feat_selector.transform(X)

        selected_features = [cols_to_pre[i] for i, support in enumerate(feat_selector.support_) if support]

        print('Selected features: ', selected_features)

        print('Feature ranking: ', feat_selector.ranking_)

因为'feat_selector.support_' 放回的是一个布尔数组，当我们想打印出选出来的特征时直接打印不行，需要通过使用布尔索引来解决这个问题。

selected_features = [cols_to_pre[i] for i, support in enumerate(feat_selector.support_) if support]

上段代码遍历 cols_to_pre 列表，并且只选择 feat_selector.support_ 中为 True 的列。

Boruta特征选择的更多相关文章

特征选择Boruta
A good feature subset is one that: contains features highly correlated with (predictive of) the clas ...
挑子学习笔记：特征选择——基于假设检验的Filter方法
转载请标明出处: http://www.cnblogs.com/tiaozistudy/p/hypothesis_testing_based_feature_selection.html Filter ...
用信息值进行特征选择(Information Value)
Posted by c cm on January 3, 2014 特征选择(feature selection)或者变量选择(variable selection)是在建模之前的重要一步.数据接口越 ...
MIL 多示例学习特征选择
一个主要的跟踪系统包含三个成分:1)外观模型,通过其可以估计目标的似然函数.2)运动模型,预测位置.3)搜索策略,寻找当前帧最有可能为目标的位置.MIL主要的贡献在第一条上. MIL与CT的不同在于后 ...
【转】[特征选择] An Introduction to Feature Selection 翻译
中文原文链接:http://www.cnblogs.com/AHappyCat/p/5318042.html 英文原文链接: An Introduction to Feature Selection ...
单因素特征选择--Univariate Feature Selection
An example showing univariate feature selection. Noisy (non informative) features are added to the i ...
主成分分析（PCA）特征选择算法详解
1. 问题真实的训练数据总是存在各种各样的问题: 1. 比如拿到一个汽车的样本,里面既有以“千米/每小时”度量的最大速度特征,也有“英里/小时”的最大速度特征,显然这两个特征有一个多余. 2. 拿到 ...
干货：结合Scikit-learn介绍几种常用的特征选择方法
原文 http://dataunion.org/14072.html 主题特征选择 scikit-learn 作者: Edwin Jarvis 特征选择(排序)对于数据科学家.机器学习从业者来说非 ...
【Machine Learning】wekaの特征选择简介
看过这篇博客的都应该明白,特征选择代码实现应该包括3个部分: 搜索算法: 评估函数: 数据: 因此,代码的一般形式为: AttributeSelection attsel = new Attribut ...
weka特征选择（IG、chi-square)
一.说明 IG是information gain 的缩写,中文名称是信息增益,是选择特征的一个很有效的方法(特别是在使用svm分类时).这里不做详细介绍,有兴趣的可以googling一下. chi-s ...

随机推荐

VuePress + Github Pages 搭建文档博客
说明最近想把常用的一些干货知识点都集中起来,方便发布和查找.相当于创建一个自己的知识库,我就叫它Java技术文档.虽然博客写文档也挺方便,但是在于文档的集中阅读和管理方面还是不够简洁和快速.此处就以 ...
Java定时器（Timer）
1.介绍 Timer和TimerTask是用于在后台线程中调度任务的java util类.简单地说,TimerTask是要执行的任务,Timer是调度器. 2.调度一次性任务 2.1 指定延迟后执行 ...
Java使用正则表达式判断字符串中是否包含某子字符串
需求: 给定一个字符串s,判断当s中包含"tree fiddy"或"3.50"或"three thirty"子字符串返回true,否则返回f ...
letcode-两数相除
题解设未知数: Br= 125 / 3,拆进行如下拆解: Br = 125 / 3 Br = (29 + 96)/3 Br = 29/3 + (32 * 3) / 3 Br = 29/3 + (2 ...
EnumColorProfiles WcsGetDefaultColorProfile WcsSetDefaultColorProfile的使用
#include <Windows.h> #include <Icm.h> #include <iostream> #include <string> ...
MYSQL查询数据表中某个字段包含某个数值
当某个字段中字符串是"1,2,3,4,5,6"或者"123456"查询数据表中某个字段是否包含某个值1:模糊查询使用like select * ...
内存管理机制 & 垃圾回收机制
内存管理机制 python是由c开发出来的. 看源码分析,下载python安装包tar包解压后主要看Include和Objects这两个文件夹 # 分析在创建对象时,如 v = 0.3 源码内部: ...
Kotlin return@xxx 的坑
Kotlin Return 到标签先看例子: (1..5).forEach { if (it == 3) { return@forEach } println(it) } println(" ...
【华为机试ACM基础#02】从单向链表中删除指定值的节点、输出单向链表中倒数第k个节点（熟悉链表的输入方式）
从单向链表中删除指定值的节点输入一个单向链表和一个节点的值,从单向链表中删除等于该值的节点,删除后如果链表中无节点则返回空指针. 链表的值不能重复. 构造过程,例如输入一行数据为: 6 2 1 2 ...
深入解析ASP.NET Core MVC应用的模块化设计[上篇]
ASP.NET Core MVC的"模块化"设计使我们可以构成应用的基本单元Controller定义在任意的模块(程序集)中,并在运行时动态加载和卸载.这种为"飞行中的飞 ...

Boruta特征选择

Boruta特征选择

Boruta特征选择的更多相关文章

随机推荐

热门专题