Examples of Machine Learning Toolkit Usage

Scikit-learn

KFold K-折交叉验证

>>> import numpy as np

>>> from sklearn.model_selection import KFold

>>> X = ["a", "b", "c", "d"]

>>> kf = KFold(n_splits=2)

>>> for train, test in kf.split(X):

...     print("%s %s" % (train, test))

[2 3] [0 1]

[0 1] [2 3]

Reference : http://scikit-learn.org/stable/modules/cross_validation.html#k-fold

Decision Trees Classification 决策树分类

>>> from sklearn import tree

>>> X = [[0, 0], [1, 1]]

>>> Y = [0, 1]

>>> clf = tree.DecisionTreeClassifier()

>>> clf = clf.fit(X, Y)

>>> clf.predict([[2., 2.]])

array([1])

Reference : http://scikit-learn.org/stable/modules/tree.html#classification

KNN k近邻

该算法可以用一句成语来帮助理解：近朱者赤近墨者黑。

from sklearn.neighbors import KNeighborsClassifier

knc = KNeighborsClassifier()

knc.fit(X_train, y_train)

y_pred = knc.predict(X_test)

Logistic Regression 逻辑斯蒂回归

>>> from sklearn.linear_model import LogisticRegression

>>> x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=33)

>>> model = LogisticRegression(penalty='l2', random_state=0, solver='newton-cg', multi_class='multinomial')

>>> model = fit(x_train, y_train)

>>> y_pred = model.predict(x_test)

Reference: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression

Leave One Out 留一法

>>> from sklearn.model_selection import LeaveOneOut

>>> X = [1, 2, 3, 4]

>>> loo = LeaveOneOut()

>>> for train, test in loo.split(X):

...     print("%s %s" % (train, test))

[1 2 3] [0]

[0 2 3] [1]

[0 1 3] [2]

[0 1 2] [3]

Reference : http://scikit-learn.org/stable/modules/cross_validation.html#leave-one-out-loo

train_test_split 随机分割

随机地，将数组或矩阵分割成训练集和测试集

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

iris = load_iris()

x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=33)

参数 test_size

如果是 float，应该在0到1之间，并且代表数据集在列车分割中所包含的比例。

如果是 int，表示训练样本的绝对数量。

如果是 None，则自动将值设置为测试大小的补充。

参数 random_state

如果 int，随机状态是随机数生成器所使用的种子；

如果是 RandomState 实例，随机数是随机数生成器；

如果是 None，随机数生成器是NP-随机使用的随机状态实例。

StandardScaler 特征标准化

标准化数据特征，保证每个维度的特征数据方差为1，均值为0。使得预测结果1不会被某些维度过大的特征而主导

from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

X_train = ss.fit_transform(X_train)

X_test = ss.transform(X_test)

Reference: 《Python机器学习及实践》 https://book.douban.com/subject/26886337

实践

StandardScaler 在鸢尾花（Iris）数据上的表现并不好。未使用 StandardScaler 处理特征时，可以获得：

accuracy 0.947368

avg precision 0.96

avg recall 0.95

f1-score 0.95

代码如下：

# -*- encoding=utf8 -*-

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import classification_report

if __name__ == '__main__':

    iris = load_iris()

    X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=33)

    knc = KNeighborsClassifier()

    knc.fit(X_train, y_train)

    y_pred = knc.predict(X_test)

    print("accuracy is %f" % (knc.score(X_test, y_test)))

    print(classification_report(y_test, y_pred, target_names=iris.target_names))

使用了 StandardScaler 以后，这四个指标反而下降了，分别如下所示：

accuracy 0.894737

avg precision 0.92

avg recall 0.89

f1-score 0.90

而使用了 StandardScaler 的代码如下：

# -*- encoding=utf8 -*-

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import classification_report

from sklearn.preprocessing import StandardScaler

if __name__ == '__main__':

    iris = load_iris()

    X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=33)

    # 标准化数据特征，保证每个维度的特征数据方差为1，均值为0.

    # 使得预测结果1不会被某些维度过大的特征而主导

    ss = StandardScaler()

    X_train = ss.fit_transform(X_train)

    X_test = ss.transform(X_test)

    knc = KNeighborsClassifier()

    knc.fit(X_train, y_train)

    y_pred = knc.predict(X_test)

    print("accuracy is %f" % (knc.score(X_test, y_test)))

    print(classification_report(y_test, y_pred, target_names=iris.target_names))

这是一个奇怪的问题，需要今后更进一步的探究。

shuffle 随机打乱

该函数可以随机地打乱训练数据和测试数据（让训练数据和测试数据保持对应）

from sklearn.utils import shuffle

x = [1,2,3,4]

y = [1,2,3,4]

x,y = shuffle(x,y)

Out:

x : [1,4,3,2]

y : [1,4,3,2]

Reference : http://scikit-learn.org/stable/modules/generated/sklearn.utils.shuffle.html

Classification Report

Presicion, recall and F1-score.

>>> from sklearn.metrics import classification_report

>>> print(classification_report(y_test, y_pred, target_names=iris.target_names))

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00         8

  versicolor       0.79      1.00      0.88        11

   virginica       1.00      0.84      0.91        19

    accuracy                           0.92        38

   macro avg       0.93      0.95      0.93        38

weighted avg       0.94      0.92      0.92        38

reference: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn.metrics.classification_report

XGBoost

from xgboost import XGBClassifier

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

from sklearn.metrics import classification_report

if __name__ == '__main__':

    iris = load_iris()

    x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target)

    xgb = XGBClassifier()

    xgb.fit(x_train, y_train)

    y_pred = xgb.predict(x_test)

    print(classification_report(y_test, y_pred))

实验结果

             precision    recall  f1-score   support

          0       1.00      1.00      1.00        14

          1       0.93      1.00      0.97        14

          2       1.00      0.90      0.95        10

avg / total       0.98      0.97      0.97        38

Examples of Scikit-learn Usages的更多相关文章

scikit learn 模块调参 pipeline+girdsearch 数据举例：文档分类（python代码）
scikit learn 模块调参 pipeline+girdsearch 数据举例:文档分类数据集 fetch_20newsgroups #-*- coding: UTF-8 -*- import ...
(原创)（三）机器学习笔记之Scikit Learn的线性回归模型初探
一.Scikit Learn中使用estimator三部曲 1. 构造estimator 2. 训练模型:fit 3. 利用模型进行预测:predict 二.模型评价模型训练好后,度量模型拟合效果的 ...
(原创)（四）机器学习笔记之Scikit Learn的Logistic回归初探
目录 5.3 使用LogisticRegressionCV进行正则化的 Logistic Regression 参数调优一.Scikit Learn中有关logistics回归函数的介绍 1. 交叉 ...
Scikit Learn: 在python中机器学习
转自:http://my.oschina.net/u/175377/blog/84420#OSC_h2_23 Scikit Learn: 在python中机器学习 Warning 警告:有些没能理解的 ...
Scikit Learn
Scikit Learn Scikit-Learn简称sklearn,基于 Python 语言的,简单高效的数据挖掘和数据分析工具,建立在 NumPy,SciPy 和 matplotlib 上.
机器学习-scikit learn学习笔记
scikit-learn官网:http://scikit-learn.org/stable/ 通常情况下,一个学习问题会包含一组学习样本数据,计算机通过对样本数据的学习,尝试对未知数据进行预测. 学习 ...
Linear Regression with Scikit Learn
Before you read This is a demo or practice about how to use Simple-Linear-Regression in scikit-lear ...
【359】scikit learn 官方帮助文档
官方网站链接 sklearn.neighbors.KNeighborsClassifier sklearn.tree.DecisionTreeClassifier sklearn.naive_baye ...
如何使用scikit—learn处理文本数据
答案在这里:http://www.tuicool.com/articles/U3uiiu http://scikit-learn.org/stable/modules/feature_extracti ...
Query意图分析：记一次完整的机器学习过程（scikit learn library学习笔记）
所谓学习问题,是指观察由n个样本组成的集合,并根据这些数据来预测未知数据的性质. 学习任务(一个二分类问题): 区分一个普通的互联网检索Query是否具有某个垂直领域的意图.假设现在有一个O2O领域的 ...

随机推荐

pip安装时遇到的问题集锦，持续更新！
1.Python安装时出现Could not fetch URL https://pypi.python.org/simple/pool/: There was a problem confirmin ...
无法序列化会话状态。在“StateServer”或“SQLServer”模式下，ASP.NET 将序列化会话状态对象，因此不允许使用无法序列化的对象或 MarshalByRef 对象。如果自定义会话状态存储在“Custom”模式下执行了类似的序列化，则适用同样的限制。
将项目部署到服务器后发现有如下问题,查了网上好多说是需要被序列化的类没有写上[Serializable]标志,所以把全部需要序列化的列都写上了标志发现还是不是,最后查到了发现网上说的并不太准确,而是需 ...
iOS UI基础-7.0 UIScrollView
概述移动设备的屏幕大小是极其有限的,因此直接展示在用户眼前的内容也相当有限.当展示的内容较多,超出一个屏幕时,用户可通过滚动手势来查看屏幕以外的内容,普通的UIView不具备滚动功能,不能显示过多的 ...
字符串ASCII码排序
在对接第三方支付渠道的时候,第三方会要求参数按照ASCII码从小到大排序. 如下是渠道方有关生成签名规则的java代码示例: //初始化0010merkey.private文件: String mer ...
php删除文件或文件夹
<?php function deleteDir($dir) { if (!$handle = @opendir($dir)) { return false; } while (false != ...
sqli-labs(十八)
第五十五关:和上一关类似,只是拼凑的方法不一样,所以需要先判断后台是怎么拼凑的分别输入id=1'--+ id=1"--+ id=') --+ ...
java微信小程序调用支付接口
简介:微信小程序支付这里的坑还是有的,所以提醒各位在编写的一定要注意!!! 1.首先呢,你需要准备openid,appid,还有申请微信支付后要设置一个32位的密钥,需要先生成一个sign,得到pre ...
原型链(_proto_) 与原型(prototype) 有啥关系？
prototype对象里面方法及属性是共享的...... 1.JavaScript 中每一个对象都拥有原型链(__proto__)指向其构造函数的原型( prototype),object._prot ...
node.js中ws模块创建服务端和客户端，网页WebSocket客户端
首先下载websocket模块,命令行输入 npm install ws 1.node.js中ws模块创建服务端 // 加载node上websocket模块 ws; var ws = require( ...
编写python的程序
执行python程序有两种方式: 1.交互式环境:输入代码立即执行优点:调试程序方便缺点:无法永久保存程序 2.代码写入文件 ...

Examples of Scikit-learn Usages