Spark ML机器学习库评估指标示例

本文主要对 Spark ML库下模型评估指标的讲解，以下代码均以Jupyter Notebook进行讲解，Spark版本为2.4.5。模型评估指标位于包org.apache.spark.ml.evaluation下。

模型评估指标是指测试集的评估指标，而不是训练集的评估指标

1、回归评估指标

RegressionEvaluator

Evaluator for regression, which expects two input columns: prediction and label.

评估指标支持以下几种：

val metricName: Param[String]

"rmse" (default): root mean squared error
"mse": mean squared error
"r2": R2 metric
"mae": mean absolute error

Examples

# import dependencies

import org.apache.spark.ml.regression.LinearRegression

import org.apache.spark.ml.evaluation.RegressionEvaluator

// Load training data

val data = spark.read.format("libsvm")

  .load("/data1/software/spark/data/mllib/sample_linear_regression_data.txt")

val lr = new LinearRegression()

  .setMaxIter(10)

  .setRegParam(0.3)

  .setElasticNetParam(0.8)

// Fit the model

val lrModel = lr.fit(training)

// Summarize the model over the training set and print out some metrics

val trainingSummary = lrModel.summary

println(s"Train MSE: ${trainingSummary.meanSquaredError}")

println(s"Train RMSE: ${trainingSummary.rootMeanSquaredError}")

println(s"Train MAE: ${trainingSummary.meanAbsoluteError}")

println(s"Train r2: ${trainingSummary.r2}")

val predictions = lrModel.transform(test)

// 计算精度

val evaluator = new RegressionEvaluator()

  .setLabelCol("label")

  .setPredictionCol("prediction")

  .setMetricName("mse")

val accuracy = evaluator.evaluate(predictions)

print(s"Test MSE: ${accuracy}")

输出：

Train MSE: 101.57870147367461

Train RMSE: 10.078625971513905

Train MAE: 8.108865602095849

Train r2: 0.039467152584195975

Test MSE: 114.28454406581636

2、分类评估指标

2.1 BinaryClassificationEvaluator

Evaluator for binary classification, which expects two input columns: rawPrediction and label. The rawPrediction column can be of type double (binary 0/1 prediction, or probability of label 1) or of type vector (length-2 vector of raw predictions, scores, or label probabilities).

评估指标支持以下几种：

val metricName: Param[String]

param for metric name in evaluation (supports "areaUnderROC" (default), "areaUnderPR")

Examples

import org.apache.spark.ml.classification.LogisticRegression

import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator

import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator

// Load training data

val data = spark.read.format("libsvm").load("/data1/software/spark/data/mllib/sample_libsvm_data.txt")

val Array(train, test) = data.randomSplit(Array(0.8, 0.2))

val lr = new LogisticRegression()

  .setMaxIter(10)

  .setRegParam(0.3)

  .setElasticNetParam(0.8)

// Fit the model

val lrModel = lr.fit(train)

// Summarize the model over the training set and print out some metrics

val trainSummary = lrModel.summary

println(s"Train accuracy: ${trainSummary.accuracy}")

println(s"Train weightedPrecision: ${trainSummary.weightedPrecision}")

println(s"Train weightedRecall: ${trainSummary.weightedRecall}")

println(s"Train weightedFMeasure: ${trainSummary.weightedFMeasure}")

val predictions = lrModel.transform(test)

predictions.show(5)

// 模型评估

val evaluator = new BinaryClassificationEvaluator()

  .setLabelCol("label")

  .setRawPredictionCol("rawPrediction")

  .setMetricName("areaUnderROC")

val auc = evaluator.evaluate(predictions)

print(s"Test AUC: ${auc}")

val mulEvaluator = new MulticlassClassificationEvaluator()

  .setLabelCol("label")

  .setPredictionCol("prediction")

  .setMetricName("weightedPrecision")

val precision = evaluator.evaluate(predictions)

print(s"Test weightedPrecision: ${precision}")

输出结果：

Train accuracy: 0.9873417721518988

Train weightedPrecision: 0.9876110961486668

Train weightedRecall: 0.9873417721518987

Train weightedFMeasure: 0.9873124561568825

+-----+--------------------+--------------------+--------------------+----------+

|label|            features|       rawPrediction|         probability|prediction|

+-----+--------------------+--------------------+--------------------+----------+

|  0.0|(692,[122,123,148...|[0.29746771419036...|[0.57382336211209...|       0.0|

|  0.0|(692,[125,126,127...|[0.42262389447949...|[0.60411095396791...|       0.0|

|  0.0|(692,[126,127,128...|[0.74220898710237...|[0.67747871191347...|       0.0|

|  0.0|(692,[126,127,128...|[0.77729372618481...|[0.68509655708828...|       0.0|

|  0.0|(692,[127,128,129...|[0.70928896866149...|[0.67024402884354...|       0.0|

+-----+--------------------+--------------------+--------------------+----------+

Test AUC: 1.0

Test weightedPrecision: 1.0

2.2 MulticlassClassificationEvaluator

Evaluator for multiclass classification, which expects two input columns: prediction and label.

注：既然适用于多分类，当然适用于上面的二分类

评估指标支持如下几种：

val metricName: Param[String]

param for metric name in evaluation (supports "f1" (default), "weightedPrecision", "weightedRecall", "accuracy")

Examples

import org.apache.spark.ml.Pipeline

import org.apache.spark.ml.classification.DecisionTreeClassificationModel

import org.apache.spark.ml.classification.DecisionTreeClassifier

import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator

import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}

// Load the data stored in LIBSVM format as a DataFrame.

val data = spark.read.format("libsvm").load("/data1/software/spark/data/mllib/sample_libsvm_data.txt")

// Index labels, adding metadata to the label column.

// Fit on whole dataset to include all labels in index.

val labelIndexer = new StringIndexer()

  .setInputCol("label")

  .setOutputCol("indexedLabel")

  .fit(data)

// Automatically identify categorical features, and index them.

val featureIndexer = new VectorIndexer()

  .setInputCol("features")

  .setOutputCol("indexedFeatures")

  .setMaxCategories(4) // features with > 4 distinct values are treated as continuous.

  .fit(data)

// Split the data into training and test sets (30% held out for testing).

val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))

// Train a DecisionTree model.

val dt = new DecisionTreeClassifier()

  .setLabelCol("indexedLabel")

  .setFeaturesCol("indexedFeatures")

// Convert indexed labels back to original labels.

val labelConverter = new IndexToString()

  .setInputCol("prediction")

  .setOutputCol("predictedLabel")

  .setLabels(labelIndexer.labels)

// Chain indexers and tree in a Pipeline.

val pipeline = new Pipeline()

  .setStages(Array(labelIndexer, featureIndexer, dt, labelConverter))

// Train model. This also runs the indexers.

val model = pipeline.fit(trainingData)

// Make predictions.

val predictions = model.transform(testData)

// Select example rows to display.

predictions.select("predictedLabel", "label", "features").show(5)

// Select (prediction, true label) and compute test error.

val evaluator = new MulticlassClassificationEvaluator()

  .setLabelCol("indexedLabel")

  .setPredictionCol("prediction")

  .setMetricName("accuracy")

val accuracy = evaluator.evaluate(predictions)

println(s"Test Error = ${(1.0 - accuracy)}")

输出结果：

+--------------+-----+--------------------+

|predictedLabel|label|            features|

+--------------+-----+--------------------+

|           0.0|  0.0|(692,[95,96,97,12...|

|           0.0|  0.0|(692,[122,123,124...|

|           0.0|  0.0|(692,[122,123,148...|

|           0.0|  0.0|(692,[126,127,128...|

|           0.0|  0.0|(692,[126,127,128...|

+--------------+-----+--------------------+

only showing top 5 rows

Test Error = 0.040000000000000036

Spark ML机器学习库评估指标示例的更多相关文章

【Udacity】机器学习性能评估指标
评估指标 Evaluation metrics 机器学习性能评估指标选择合适的指标分类与回归的不同性能指标分类的指标(准确率.精确率.召回率和 F 分数) 回归的指标(平均绝对误差和均方误差) ...
Spark ML机器学习
Spark提供了常用机器学习算法的实现, 封装于spark.ml和spark.mllib中. spark.mllib是基于RDD的机器学习库, spark.ml是基于DataFrame的机器学习库. ...
[机器学习] 性能评估指标（精确率、召回率、ROC、AUC)
混淆矩阵介绍这些概念之前先来介绍一个概念:混淆矩阵(confusion matrix).对于 k 元分类,其实它就是一个k x k的表格,用来记录分类器的预测结果.对于常见的二元分类,它的混淆矩阵是 ...
UDA机器学习基础—评估指标
这里举例说明混淆矩阵精确率召回率 F1
机器学习性能评估指标（精确率、召回率、ROC、AUC）
http://blog.csdn.net/u012089317/article/details/52156514 ,y^)=1nsamples∑i=1nsamples(yi−y^i)2
Spark 中的机器学习库及示例
MLlib 是 Spark 的机器学习库,旨在简化机器学习的工程实践工作,并方便扩展到更大规模.MLlib 由一些通用的学习算法和工具组成,包括分类.回归.聚类.协同过滤.降维等,同时还包括底层的优化 ...
《Spark 官方文档》机器学习库（MLlib）指南
spark-2.0.2 机器学习库(MLlib)指南 MLlib是Spark的机器学习(ML)库.旨在简化机器学习的工程实践工作,并方便扩展到更大规模.MLlib由一些通用的学习算法和工具组成,包括分 ...
掌握Spark机器学习库（课程目录）
第1章初识机器学习在本章中将带领大家概要了解什么是机器学习.机器学习在当前有哪些典型应用.机器学习的核心思想.常用的框架有哪些,该如何进行选型等相关问题. 1-1 导学 1-2 机器学习概述 1- ...
[DeeplearningAI笔记]ML strategy_1_1正交化/单一数字评估指标
机器学习策略 ML strategy 觉得有用的话,欢迎一起讨论相互学习~Follow Me 1.1 什么是ML策略机器学习策略简介情景模拟假设你正在训练一个分类器,你的系统已经达到了90%准确 ...

随机推荐

MFC修改系统托盘的图标
最近开始学习MFC,发现程序在任务栏,窗口和exe都使用的默认图标,那么,我们想使用自己的图标该如何做? 第一种方法: 1.我们将自己要使用的icon的图标导入项目中. 资源视图-->xx.rc ...
arm汇编几个经典例题
这几个例题来自我们的上机实验,通过这几个例题基本上能掌握arm汇编一些最基本的操作 arm汇编实现1-100的加法 12345678910111213 AREA Example1,CODE,READO ...
mysql不常用查询
--查看数据库版本 SELECT VERSION(); --查看字符编码与安装路径 SHOW VARIABLES LIKE '%char%';
37）PHP,获取数据库数据并在html中显示（晋级4）
我的php文件和html文件的位置关系: 然后我的主php文件是b.php,我的那个配置文件是BBB.php,我的html文件是login.html 然后我的b.php代码展示: <?php c ...
为什么java的接口的方法是public abstract修饰？为什么属性是public static final 修饰？
为什么java的接口的方法是public abstract修饰? 1.首先要明白接口的定义和作用是什么: 接口定义:接口是一个全部由抽象方法组成的集合,里面都是抽象方法和常量,用interface修 ...
ready vs onload
1 ready事件:当DOM载入就绪,可以查询,操纵时绑定一个要执行的函数.它可以极大地提高web应用程序的响应速度. 2 onload事件:js中的方法,网页的所有元素.图片全部加载完毕才执行这个 ...
苹果为啥不愿意替美国FBI解锁，这是一种创新态度？
国外媒体报道,苹果计划对iPhone进行安全更新,最新版的iOS会在手机锁定一个小时后禁用手机充电和数据端口,这意味着,消费者丢失手机或者非正常离开iPhone之后,可以通过锁定手机,来避免手机数据被 ...
忘记mysql root用户密码的解决办法（skip-grant-tables）
skip-grant-tables顾名思义,数据库启动的时候跳跃权限表的限制,不用验证密码,直接登录. 注意: 这种情况只有在忘记root密码不得已重启数据库的情况下使用的.现网环境慎用,需要重启 ...
cs231n spring 2017 lecture5 Convolutional Neural Networks
1. 之前课程里,一个32*32*3的图像被展成3072*1的向量,左乘大小为10*3072的权重矩阵W,可以得到一个10*1的得分,分别对应10类标签. 在Convolution Layer里,图像 ...
第一课安装wamp环境
1.准备怎样选择PHP的版本 IIS 如果想使用IIS配置PHP的话,那么需要选择Non-Thread Safe(NTS)版本的PHP Apache 如果你是用的Apache的版本来自Apache ...

Spark ML机器学习库评估指标示例

1、回归评估指标

RegressionEvaluator

2、分类评估指标

2.1 BinaryClassificationEvaluator

2.2 MulticlassClassificationEvaluator

Spark ML机器学习库评估指标示例的更多相关文章

随机推荐

热门专题