Decision_function：scores，predict以及其他

机器学习的评估

PR曲线用于positive类数据占比比较小，或者你更加在意false postion（相比于false negative）；其他情况采用ROC曲线；比如Demo中手写体5的判断，因为只有少量5，所以从ROC上面来看分类效果不错，但是从PR曲线可以看到分类器效果不佳。

y_scores = sgd_clf.decision_function([some_digit])

decision_function代表的是参数实例到各个类所代表的超平面的距离；在梯度下滑里面特有的（随机森林里面没有decision_function），这个返回的距离，或者说是分值；后续的对于这个值的利用方式是指定阈值来进行过滤：

>>> y_scores = sgd_clf.decision_function([some_digit])

>>> y_scores

array([ 161855.74572176])

>>> threshold = 0

>>> y_some_digit_pred = (y_scores > threshold)

array([ True], dtype=bool)

>>> threshold = 200000

>>> y_some_digit_pred = (y_scores > threshold)

>>> y_some_digit_pred

array([False], dtype=bool)

通过上面例子看到了，通过decision_function可以获得一种"分值"，这个分值的几何意义就是当前点到超平面（hyperplane）的距离；然后，你可以利用这个分值来和某个阈值做比较（距离的阈值），超过阈值则通过，低于阈值则不通过。再举一个例子：

>>> sgd_clf.fit(X_train, y_train) # y_train, not y_train_5

>>> sgd_clf.predict([some_digit])

array([ 5.])

some_digit_scores=sdg_clf.decision_function([some_digit])

some_digit_scores

array([[-311402.62954431, -363517.28355739, -446449.5306454 ,

-183226.61023518, -414337.15339485, 161855.74572176,

-452576.39616343, -471957.14962573, -518542.33997148,

-536774.63961222]])

sgd_clf.fit(X_train, y_train)这个梯度下降算法学习的对象是说有手写训练样本以及0-9的分类标签，基于学习的模型调用decision_function之后，获取是[some_digit]所有的标签到超平面的距离，其中只有5是正值，所以如果调用predict的话返回的就是5。但是，如果我们训练的分类器是二元分类器（True，false），那么情况又不同：

y_train_5 =(y_train==5)

>>> sgd_clf.fit(X_train,y_train_5) # y_train, not y_train_5

>>> sgd_clf.predict([some_digit])

array([ True])

因为y_train_5这个标签集合只有True和False两种标签，所以训练之后的模型预测的知识True和false；所以到底是二元分类还是多元分类完全取决于训练的时候的标签集。

predict：用于分类模型的预测分类；

fit：对于线性回归的模型学习，称之为"拟合"；

y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3, method="predict_proba")

cross_val_predict是交叉获取分类概率（注意，这里的method参数设置为"predict_proba"，代表返回值返回的是预期分类的概率）

参考：

http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

？这有一个问题其实每太搞懂，就是scores和predict的关系到底什么，cross_val_score的机制和cross_val_predit之间的差别是什么，文中代码如下：

from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier(random_state=42)

y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3,

method="predict_proba")

But to plot a ROC curve, you need scores, not probabilities. A simple solution is to

use the positive class's probability as the score:

y_scores_forest = y_probas_forest[:, 1] # score = proba of positive class

fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_5,y_scores_forest)

不过这里的代码可以看出一些端倪：

from sklearn.ensemble import RandomForestClassifier

forest_clf=RandomForestClassifier(random_state=42)

y_probas_forest=cross_val_predict(forest_clf, X_train, y_train_5, cv=3, method="predict_proba")

print y_probas_forest

[[0.9 0.1] [1. 0. ] [1. 0. ] ... [1. 0. ] [0.9 0.1] [1. 0. ]]

y_scores_forest=y_probas_forest[:, 1]

print y_scores_forest

[0.1 0. 0. ... 0. 0.1 0. ]

你可以看到，scores是probas的二维数组的第二维的值。那么问题来了，作为cross_val_predict里面的数据，二维数组中这个二维到底是什么？这个二维数组其实代表的是各个分类的概率，对于二分类而言，就是为negative的概率以及position概率；对于scores其实就是为position的分类信息。那就意味着如果N个分类（classification），那么就是N维数组了。

另外对于森林分类器里面有一个method的参数，例子中值是"predict_proba"，这个代表着预测各个分类的概率；他还有很多其他选项：

predict：代表的是预测的分类，就是会挑选概率最大的分类返回；

predict_log_proba：算法和predic_proba是一样的，但是最后对于结果会取对数运算，目的是放大值，避免在概率的相乘中会产生一些极小值，然后会因为舍入问题导致误差；另外一些机器算法（比如散度KL）本身就是基于对数运算的。最后，贝叶斯的分类算法需要通过对数运算（log）来实现稳定性；

对于cv=3，代表采用三折交叉验证，即将数据随机分为三份（或者尽量保持数据的均匀分布性），每次拿其中的一份来做测试集（另外两份做训练集），然后将三次的结果（每个测试样本各个分类的概率）做一下平均值；

参考

https://stackoverflow.com/questions/20335944/why-use-log-probability-estimates-in-gaussiannb-scikit-learn

https://www.reddit.com/r/MLQuestions/comments/5lzv9o/sklearn_why_predict_log_proba/

https://baike.baidu.com/item/%E5%AF%B9%E6%95%B0%E5%85%AC%E5%BC%8F

https://stats.stackexchange.com/questions/329857/what-is-the-difference-between-decision-function-predict-proba-and-predict-fun

https://stackoverflow.com/questions/36543137/whats-the-difference-between-predict-proba-and-decision-function-in-scikit-lear

Decision_function：scores，predict以及其他的更多相关文章

[LeetCode] Predict the Winner 预测赢家
Given an array of scores that are non-negative integers. Player 1 picks one of the numbers from eith ...
[Swift]LeetCode486. 预测赢家 | Predict the Winner
Given an array of scores that are non-negative integers. Player 1 picks one of the numbers from eith ...
Predict the Winner LT486
Given an array of scores that are non-negative integers. Player 1 picks one of the numbers from eith ...
Minimax-486. Predict the Winner
Given an array of scores that are non-negative integers. Player 1 picks one of the numbers from eith ...
动态规划-Predict the Winner
2018-04-22 19:19:47 问题描述: Given an array of scores that are non-negative integers. Player 1 picks on ...
486. Predict the Winner
Given an array of scores that are non-negative integers. Player 1 picks one of the numbers from eith ...
LeetCode Predict the Winner
原题链接在这里:https://leetcode.com/problems/predict-the-winner/description/ 题目: Given an array of scores t ...
LN : leetcode 486 Predict the Winner
lc 486 Predict the Winner 486 Predict the Winner Given an array of scores that are non-negative inte ...
[LeetCode] 486. Predict the Winner 预测赢家
Given an array of scores that are non-negative integers. Player 1 picks one of the numbers from eith ...
LC 486. Predict the Winner
Given an array of scores that are non-negative integers. Player 1 picks one of the numbers from eith ...

随机推荐

数据库操作——SQL
()修改数据表内容 UPDATE t_com_staffinfo set upnative = '河南省郑州市金水区' WHERE id = 1082
用js实现个优先队列吧
队列是一种很常用的数据结构,它是一组遵循先进先出(FIFO)规则的项.在现实生活中,最常见的队列的例子就是排队.队列有一些方法,入队.出队.队列的长度,清空队列等.用js实现一个普通的队列代码如下: ...
memory prefix un,under,uni out1
1● un 不非,无打开 ,解开 ,开出 2● under ʌnd ə 向下,副的,不足的 3● uni 单一 ,单
Css的向左浮动、先右浮动、绝对定位、相对定位的简单使用
1.div层的浮动 1)div向左浮动.向右浮动 <!doctype html> <html> <head> <meta charset="utf- ...
jsp jsp常用指令
jsp指令是为jsp引擎设计的,他们并不直接产生任何可见输出,而只是告诉引擎如何处理jsp页面中的其余部分. jsp中的指令 page指令 include指令 taglib指令 jsp指令的基本语法 ...
编译EXE文件的时候pcap编译不进去。 pyinstaller pcap pypcap 报错
如果生成的exe源码中有import pcap 那么你目标机上就要先装npcap 并勾选winpcap API. 然后就不出这个问题了. 暂时的办法是第一个exe不包含import pcap.自检np ...
MyEclipse CI 2018.8.0正式发布（附下载）
MyEclipse线上特惠,在线立享专属折扣!火热开启中>> MyEclipse 2018最终版日前正式发布,新版本通过构建Eclipse Photo.支持Java 10和Java EE ...
Cross-Site Script
Cross-Site Script(跨站脚本)XSS 整理于<浅析XSS(Cross Site Script)漏洞原理> 了解XSS的触发条件就先得从HTML(超文本标记语言)开始,我 ...
设置 placeholder 字体颜色 : ::
::-webkit-input-placeholder { color: red;} :-moz-placeholder { color: red;}::-moz-placeholder{colo ...
led灯的驱动电流和电阻
通常led灯条所采用的LED驱动电流都是20mA, 这网站里有led电阻的详细计算过程:http://www.bao1314.net/792.html

Decision_function：scores，predict以及其他

Decision_function：scores，predict以及其他的更多相关文章

随机推荐

热门专题