吴裕雄--天生自然 PYTHON数据分析：威斯康星乳腺癌(诊断)数据分析（续一）

drop_list1 = ['perimeter_mean','radius_mean','compactness_mean','concave points_mean','radius_se','perimeter_se','radius_worst','perimeter_worst','compactness_worst','concave points_worst','compactness_se','concave points_se','texture_worst','area_worst']

x_1 = x.drop(drop_list1,axis = 1 )        # do not modify x, we will use it later

x_1.head()

#correlation map

f,ax = plt.subplots(figsize=(14, 14))

sns.heatmap(x_1.corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax)

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import f1_score,confusion_matrix

from sklearn.metrics import accuracy_score

# split data train 70 % and test 30 %

x_train, x_test, y_train, y_test = train_test_split(x_1, y, test_size=0.3, random_state=42)

#random forest classifier with n_estimators=10 (default)

clf_rf = RandomForestClassifier(random_state=43)

clr_rf = clf_rf.fit(x_train,y_train)

ac = accuracy_score(y_test,clf_rf.predict(x_test))

print('Accuracy is: ',ac)

cm = confusion_matrix(y_test,clf_rf.predict(x_test))

sns.heatmap(cm,annot=True,fmt="d")

from sklearn.feature_selection import SelectKBest

from sklearn.feature_selection import chi2

# find best scored 5 features

select_feature = SelectKBest(chi2, k=5).fit(x_train, y_train)

print('Score list:', select_feature.scores_)

print('Feature list:', x_train.columns)

x_train_2 = select_feature.transform(x_train)

x_test_2 = select_feature.transform(x_test)

#random forest classifier with n_estimators=10 (default)

clf_rf_2 = RandomForestClassifier()

clr_rf_2 = clf_rf_2.fit(x_train_2,y_train)

ac_2 = accuracy_score(y_test,clf_rf_2.predict(x_test_2))

print('Accuracy is: ',ac_2)

cm_2 = confusion_matrix(y_test,clf_rf_2.predict(x_test_2))

sns.heatmap(cm_2,annot=True,fmt="d")

from sklearn.feature_selection import RFE

# Create the RFE object and rank each pixel

clf_rf_3 = RandomForestClassifier()

rfe = RFE(estimator=clf_rf_3, n_features_to_select=5, step=1)

rfe = rfe.fit(x_train, y_train)

print('Chosen best 5 feature by rfe:',x_train.columns[rfe.support_])

from sklearn.feature_selection import RFECV

# The "accuracy" scoring is proportional to the number of correct classifications

clf_rf_4 = RandomForestClassifier()

rfecv = RFECV(estimator=clf_rf_4, step=1, cv=5,scoring='accuracy')   #5-fold cross-validation

rfecv = rfecv.fit(x_train, y_train)

print('Optimal number of features :', rfecv.n_features_)

print('Best features :', x_train.columns[rfecv.support_])

# Plot number of features VS. cross-validation scores

import matplotlib.pyplot as plt

plt.figure()

plt.xlabel("Number of features selected")

plt.ylabel("Cross validation score of number of selected features")

plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)

plt.show()

clf_rf_5 = RandomForestClassifier()

clr_rf_5 = clf_rf_5.fit(x_train,y_train)

importances = clr_rf_5.feature_importances_

std = np.std([tree.feature_importances_ for tree in clf_rf.estimators_],

             axis=0)

indices = np.argsort(importances)[::-1]

# Print the feature ranking

print("Feature ranking:")

for f in range(x_train.shape[1]):

    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

# Plot the feature importances of the forest

plt.figure(1, figsize=(14, 13))

plt.title("Feature importances")

plt.bar(range(x_train.shape[1]), importances[indices],

       color="g", yerr=std[indices], align="center")

plt.xticks(range(x_train.shape[1]), x_train.columns[indices],rotation=90)

plt.xlim([-1, x_train.shape[1]])

plt.show()

# split data train 70 % and test 30 %

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

#normalization

x_train_N = (x_train-x_train.mean())/(x_train.max()-x_train.min())

x_test_N = (x_test-x_test.mean())/(x_test.max()-x_test.min())

from sklearn.decomposition import PCA

pca = PCA()

pca.fit(x_train_N)

plt.figure(1, figsize=(14, 13))

plt.clf()

plt.axes([.2, .2, .7, .7])

plt.plot(pca.explained_variance_ratio_, linewidth=2)

plt.axis('tight')

plt.xlabel('n_components')

plt.ylabel('explained_variance_ratio_')

吴裕雄--天生自然 PYTHON数据分析：威斯康星乳腺癌(诊断)数据分析（续一）的更多相关文章

吴裕雄--天生自然 PYTHON数据分析：糖尿病视网膜病变数据分析（完整版）
# This Python 3 environment comes with many helpful analytics libraries installed # It is defined by ...
吴裕雄--天生自然 PYTHON数据分析：所有美国股票和etf的历史日价格和成交量分析
# This Python 3 environment comes with many helpful analytics libraries installed # It is defined by ...
吴裕雄--天生自然 python数据分析：健康指标聚集分析（健康分析）
# This Python 3 environment comes with many helpful analytics libraries installed # It is defined by ...
吴裕雄--天生自然 python数据分析：葡萄酒分析
# import pandas import pandas as pd # creating a DataFrame pd.DataFrame({'Yes': [50, 31], 'No': [101 ...
吴裕雄--天生自然 PYTHON数据分析：人类发展报告——HDI, GDI，健康，全球人口数据数据分析
import pandas as pd # Data analysis import numpy as np #Data analysis import seaborn as sns # Data v ...
吴裕雄--天生自然 python数据分析：医疗费数据分析
import numpy as np import pandas as pd import os import matplotlib.pyplot as pl import seaborn as sn ...
吴裕雄--天生自然 PYTHON语言数据分析：ESA的火星快车操作数据集分析
import os import numpy as np import pandas as pd from datetime import datetime import matplotlib imp ...
吴裕雄--天生自然 python语言数据分析：开普勒系外行星搜索结果分析
import pandas as pd pd.DataFrame({'Yes': [50, 21], 'No': [131, 2]}) pd.DataFrame({'Bob': ['I liked i ...
吴裕雄--天生自然 PYTHON数据分析：基于Keras的CNN分析太空深处寻找系外行星数据
#We import libraries for linear algebra, graphs, and evaluation of results import numpy as np import ...

随机推荐

ZJNU 1223 - 素数距离——高级
因为最大可以达到int极限明显直接筛选不可能完成所以从其因子入手因为任何不是素数的数都有除了1与其自身之外的因子因此,我们筛出2^(31/2)≍46350之内的所有素数,以其作为因子再将题目给 ...
Java进行http请求时，放置会话信息到header里面
public class CreateHttpTest { public static void main(String[] args) { createHttp(); } public static ...
关于Linux下Oracle安装后启动的问题
1.首先,切换成oracle用户,启动监听服务.(中间的横杠必须加上,不然会出现command not found 的错误) 命令1:su - oralce 命令2:lsnrctl start 参 ...
Linux文件共享的实现方式
前两天跟老师去北京开了一个会议,好久没学习了,今天才回学校,其中的辛酸就不说了.来正文: 1.什么是文件共享 (1).文件共享就是同一个文件(同一个文件指的是同一个inode,同一个pathname) ...
SEO优化技巧
一.搜索引擎工作原理当我们在输入框中输入关键词,点击搜索或查询时,然后得到结果.深究其背后的故事,搜索引擎做了很多事情. 在搜索引擎网站,比如百度,在其后台有一个非常庞大的数据库,里面存储了海量的关 ...
Python笔记_第四篇_高阶编程_GUI编程之Tkinter_3.数据显示
1. 表格数据显示: 图示: 实例: import tkinter from tkinter import ttk # 创建主窗口__编程头部 win = tkinter.Tk() # 设置标题 wi ...
Android圆角布局、天气应用、树状图、日食动画、仿饿了么导航效果等源码
Android精选源码 Android通用圆角布局源码 Android天气应用源码,界面美观一个支持定制的树状 Android 自定义View PIN 码专用输入控件,支持任意长度和输入任意数据 A ...
MplayerX 安装
从老的笔记本中,把MplayerX.app 复制到新笔记本并放到应用程序目录中,可以直接用. 但播放时出现花屏,百度得到原因是新的硬件加速不支持, 解决办法是,在偏好设置-> 高级 -> ...
第一个----关于GPIO的总结
首先,自己本来报的是单片机的 ,但是因为队友的脑残,给我报成了嵌入式,哎,惨啊,就得从头看这个云里雾里的东西,但是没办法,都报名了不能呢个交白卷,不然自己就是逃兵了,还有20天就比赛了我得加 ...
[原]win10开机时开启NumLock
修改如下注册表项下的InitialKeyboardIndicators的值为80000002,重启即可. HKEY_USERS\.Default\Control Panel\Keyboard\ HKE ...

吴裕雄--天生自然 PYTHON数据分析：威斯康星乳腺癌(诊断)数据分析（续一）

吴裕雄--天生自然 PYTHON数据分析：威斯康星乳腺癌(诊断)数据分析（续一）的更多相关文章

随机推荐

热门专题