Examples of Scikit-learn Usages
Examples of Machine Learning Toolkit Usage
Scikit-learn
KFold K-折交叉验证
>>> import numpy as np
>>> from sklearn.model_selection import KFold
>>> X = ["a", "b", "c", "d"]
>>> kf = KFold(n_splits=2)
>>> for train, test in kf.split(X):
... print("%s %s" % (train, test))
[2 3] [0 1]
[0 1] [2 3]
Reference : http://scikit-learn.org/stable/modules/cross_validation.html#k-fold
Decision Trees Classification 决策树分类
>>> from sklearn import tree
>>> X = [[0, 0], [1, 1]]
>>> Y = [0, 1]
>>> clf = tree.DecisionTreeClassifier()
>>> clf = clf.fit(X, Y)
>>> clf.predict([[2., 2.]])
array([1])
Reference : http://scikit-learn.org/stable/modules/tree.html#classification
KNN k近邻
该算法可以用一句成语来帮助理解:近朱者赤近墨者黑。
from sklearn.neighbors import KNeighborsClassifier
knc = KNeighborsClassifier()
knc.fit(X_train, y_train)
y_pred = knc.predict(X_test)
Logistic Regression 逻辑斯蒂回归
>>> from sklearn.linear_model import LogisticRegression
>>> x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=33)
>>> model = LogisticRegression(penalty='l2', random_state=0, solver='newton-cg', multi_class='multinomial')
>>> model = fit(x_train, y_train)
>>> y_pred = model.predict(x_test)
Leave One Out 留一法
>>> from sklearn.model_selection import LeaveOneOut
>>> X = [1, 2, 3, 4]
>>> loo = LeaveOneOut()
>>> for train, test in loo.split(X):
... print("%s %s" % (train, test))
[1 2 3] [0]
[0 2 3] [1]
[0 1 3] [2]
[0 1 2] [3]
Reference : http://scikit-learn.org/stable/modules/cross_validation.html#leave-one-out-loo
train_test_split 随机分割
随机地,将数组或矩阵分割成训练集和测试集
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=33)
参数 test_size
如果是 float,应该在0到1之间,并且代表数据集在列车分割中所包含的比例。
如果是 int,表示训练样本的绝对数量。
如果是 None,则自动将值设置为测试大小的补充。
参数 random_state
如果 int,随机状态是随机数生成器所使用的种子;
如果是 RandomState 实例,随机数是随机数生成器;
如果是 None,随机数生成器是NP-随机使用的随机状态实例。
StandardScaler 特征标准化
标准化数据特征,保证每个维度的特征数据方差为1,均值为0。使得预测结果1不会被某些维度过大的特征而主导
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)
Reference: 《Python机器学习及实践》 https://book.douban.com/subject/26886337
实践
StandardScaler 在鸢尾花(Iris)数据上的表现并不好。未使用 StandardScaler 处理特征时,可以获得:
accuracy 0.947368
avg precision 0.96
avg recall 0.95
f1-score 0.95
代码如下:
# -*- encoding=utf8 -*-
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
if __name__ == '__main__':
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=33)
knc = KNeighborsClassifier()
knc.fit(X_train, y_train)
y_pred = knc.predict(X_test)
print("accuracy is %f" % (knc.score(X_test, y_test)))
print(classification_report(y_test, y_pred, target_names=iris.target_names))
使用了 StandardScaler 以后,这四个指标反而下降了,分别如下所示:
accuracy 0.894737
avg precision 0.92
avg recall 0.89
f1-score 0.90
而使用了 StandardScaler 的代码如下:
# -*- encoding=utf8 -*-
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler
if __name__ == '__main__':
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=33)
# 标准化数据特征,保证每个维度的特征数据方差为1,均值为0.
# 使得预测结果1不会被某些维度过大的特征而主导
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)
knc = KNeighborsClassifier()
knc.fit(X_train, y_train)
y_pred = knc.predict(X_test)
print("accuracy is %f" % (knc.score(X_test, y_test)))
print(classification_report(y_test, y_pred, target_names=iris.target_names))
这是一个奇怪的问题,需要今后更进一步的探究。
shuffle 随机打乱
该函数可以随机地打乱训练数据和测试数据(让训练数据和测试数据保持对应)
from sklearn.utils import shuffle
x = [1,2,3,4]
y = [1,2,3,4]
x,y = shuffle(x,y)
Out:
x : [1,4,3,2]
y : [1,4,3,2]
Reference : http://scikit-learn.org/stable/modules/generated/sklearn.utils.shuffle.html
Classification Report
Presicion, recall and F1-score.
>>> from sklearn.metrics import classification_report
>>> print(classification_report(y_test, y_pred, target_names=iris.target_names))
precision recall f1-score support
setosa 1.00 1.00 1.00 8
versicolor 0.79 1.00 0.88 11
virginica 1.00 0.84 0.91 19
accuracy 0.92 38
macro avg 0.93 0.95 0.93 38
weighted avg 0.94 0.92 0.92 38
XGBoost
from xgboost import XGBClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
if __name__ == '__main__':
iris = load_iris()
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target)
xgb = XGBClassifier()
xgb.fit(x_train, y_train)
y_pred = xgb.predict(x_test)
print(classification_report(y_test, y_pred))
实验结果
precision recall f1-score support
0 1.00 1.00 1.00 14
1 0.93 1.00 0.97 14
2 1.00 0.90 0.95 10
avg / total 0.98 0.97 0.97 38
Examples of Scikit-learn Usages的更多相关文章
- scikit learn 模块 调参 pipeline+girdsearch 数据举例:文档分类 (python代码)
scikit learn 模块 调参 pipeline+girdsearch 数据举例:文档分类数据集 fetch_20newsgroups #-*- coding: UTF-8 -*- import ...
- (原创)(三)机器学习笔记之Scikit Learn的线性回归模型初探
一.Scikit Learn中使用estimator三部曲 1. 构造estimator 2. 训练模型:fit 3. 利用模型进行预测:predict 二.模型评价 模型训练好后,度量模型拟合效果的 ...
- (原创)(四)机器学习笔记之Scikit Learn的Logistic回归初探
目录 5.3 使用LogisticRegressionCV进行正则化的 Logistic Regression 参数调优 一.Scikit Learn中有关logistics回归函数的介绍 1. 交叉 ...
- Scikit Learn: 在python中机器学习
转自:http://my.oschina.net/u/175377/blog/84420#OSC_h2_23 Scikit Learn: 在python中机器学习 Warning 警告:有些没能理解的 ...
- Scikit Learn
Scikit Learn Scikit-Learn简称sklearn,基于 Python 语言的,简单高效的数据挖掘和数据分析工具,建立在 NumPy,SciPy 和 matplotlib 上.
- 机器学习-scikit learn学习笔记
scikit-learn官网:http://scikit-learn.org/stable/ 通常情况下,一个学习问题会包含一组学习样本数据,计算机通过对样本数据的学习,尝试对未知数据进行预测. 学习 ...
- Linear Regression with Scikit Learn
Before you read This is a demo or practice about how to use Simple-Linear-Regression in scikit-lear ...
- 【359】scikit learn 官方帮助文档
官方网站链接 sklearn.neighbors.KNeighborsClassifier sklearn.tree.DecisionTreeClassifier sklearn.naive_baye ...
- 如何使用scikit—learn处理文本数据
答案在这里:http://www.tuicool.com/articles/U3uiiu http://scikit-learn.org/stable/modules/feature_extracti ...
- Query意图分析:记一次完整的机器学习过程(scikit learn library学习笔记)
所谓学习问题,是指观察由n个样本组成的集合,并根据这些数据来预测未知数据的性质. 学习任务(一个二分类问题): 区分一个普通的互联网检索Query是否具有某个垂直领域的意图.假设现在有一个O2O领域的 ...
随机推荐
- js模拟队列----小优先队列
队列:先进先出,后进后出 var Queue = (function(){ var item = new WeakMap(); class Queue{ constructor(){ item.set ...
- itemscope itemtype="http://schema.org/AggregateRating"
Review Canonical URL: http://schema.org/Review Thing > CreativeWork > Review A review of an it ...
- end to end
深度学习中的end to end是什么意思? 端到端就是输入一个数据进入模型,然后模型直接可以输出你想要的结果,也就是一体性. 简单讲就是,Input--->系统(这里指神经网络)---> ...
- LeetCode8.字符串转整数(atoi)
题目链接:https://leetcode-cn.com/problems/string-to-integer-atoi/ 实现 atoi,将字符串转为整数. 该函数首先根据需要丢弃任意多的空格字符, ...
- not value specified for parameter问题解决方案
前段时间遇到这个问题找了半天没有找到,今天又调试了突然发现出现这个问题的根本原因是sql语句中的参数没有赋值或者参数类型与数据库字段类型不匹配所导致的. 例如: String sql = " ...
- hive javaapi 002
默认开启10000端口开启前,编辑hive-site.xml设置impersonation,防止hdfs权限问题,这样hive server会以提交用户的身份去执行语句,如果设置为false,则会以起 ...
- caffe中的caffemodel参数提取方法
需要的文件为:deploy.prototxt caffemodel net = caffe.Net(deploy.txt,caffe_model,caffe.TEST)具体代码: import caf ...
- sql 将某一列转成字符串并且去掉最后一个逗号
) SET @center_JZHW = ( SELECT DISTINCT STUFF( ( SELECT ','''+ qudao+'''' FROM CreatedType WITH ( NOL ...
- report源码分析——report_object和report_message
uvm的report机制,主要涉及uvm_report_object,uvm_report_handle,uvm_report_server这三个类: uvm_report_object主要是提供uv ...
- VC2012+QT新建一个控制台程序
1.新建一个项目,选择控制台程序 2.下一步.project setting 可以包含模块,可以再这选择也可以之后选择 3.配置工程属性 1)需要源码的话添加VC++目录里的源目录 2)包含头文件 ...