Spark 机器学习

将Mahout on Spark 中的机器学习算法和MLlib中支持的算法统计如下：

主要针对MLlib进行总结

分类与回归

分类和回归是监督式学习;

监督式学习是指使用有标签的数据（LabeledPoint）进行训练，得到模型后，使用测试数据预测结果。其中标签数据是指已知结果的特征数据。

分类和回归的区别：预测结果的变量类型

　　分类预测出来的变量是离散的（比如对邮件的分类，垃圾邮件和非垃圾邮件），对于二元分类的标签是0和1，对于多元分类标签范围是0~C-1,C表示类别数目；

　　回归预测出来的变量是连续的（比如根据年龄和体重预测身高）

线性回归

　　线性回归是回归中最常用的方法之一，是指用特征的线性组合来预测输出值。

　　线性回归算法可以使用的类有:

　　　　LinearRegressionWithSGD
　　　　RidgeRegressionWithSGD
　　　　LassoWithSGD

　　　　ridge regression 使用 L2 正规化;
　　　　Lasso 使用 L1 正规化;

　　参数：

　　　　stepSize:梯度下降的步数

　　　　numIterations:迭代次数

　　　　设置intercept:是否给数据加上一个干扰特征或者偏差特征，一个始终值为1的特征，默认不增加false

　　{stepSize: 1.0, numIterations: 100, miniBatchFraction: 1.0}

　　模型的使用：

　　　　1、对数据进行预测,使用model.predict()

　　　　2、获取数据特征的权重model.weights()

　　模型的评估：

　　　　均方误差

例子：

import org.apache.spark.{SparkContext, SparkConf}

import org.apache.spark.mllib.regression.LabeledPoint

import org.apache.spark.mllib.regression.LinearRegressionModel

import org.apache.spark.mllib.regression.LinearRegressionWithSGD

import org.apache.spark.mllib.linalg.Vectors

/**

  * Created by Edward on 2016/9/21.

  */

object LinearRegression {

  def main(args: Array[String]) {

    val conf: SparkConf = new SparkConf().setAppName("LinearRegression").setMaster("local")

    val sc = new SparkContext(conf)

    // Load and parse the data

    val data = sc.textFile("data/mllib/ridge-data/lpsa.data")

    val parsedData = data.map { line =>

      val parts = line.split(',')

      LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))

    }.cache()

    // Building the model

    val numIterations = 100

    val model = LinearRegressionWithSGD.train(parsedData, numIterations)

//    var lr = new LinearRegressionWithSGD().setIntercept(true)

//    val model = lr.run(parsedData)

    //获取特征权重，及干扰特征

    println("weights:%s, intercept:%s".format(model.weights,model.intercept))

    // Evaluate model on training examples and compute training error

    val valuesAndPreds = parsedData.map { point =>

      val prediction = model.predict(point.features)

      (point.label, prediction)

    }

    //计算 均方误差

    val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()

    println("training Mean Squared Error = " + MSE)

    // Save and load model

    model.save(sc, "myModelPath")

    val sameModel = LinearRegressionModel.load(sc, "myModelPath")

  }

}

数据：

-0.4307829,-1.63735562648104 -2.00621178480549 -1.86242597251066 -1.02470580167082 -0.522940888712441 -0.863171185425945 -1.04215728919298 -0.864466507337306

-0.1625189,-1.98898046126935 -0.722008756122123 -0.787896192088153 -1.02470580167082 -0.522940888712441 -0.863171185425945 -1.04215728919298 -0.864466507337306

-0.1625189,-1.57881887548545 -2.1887840293994 1.36116336875686 -1.02470580167082 -0.522940888712441 -0.863171185425945 0.342627053981254 -0.155348103855541

-0.1625189,-2.16691708463163 -0.807993896938655 -0.787896192088153 -1.02470580167082 -0.522940888712441 -0.863171185425945 -1.04215728919298 -0.864466507337306

0.3715636,-0.507874475300631 -0.458834049396776 -0.250631301876899 -1.02470580167082 -0.522940888712441 -0.863171185425945 -1.04215728919298 -0.864466507337306

0.7654678,-2.03612849966376 -0.933954647105133 -1.86242597251066 -1.02470580167082 -0.522940888712441 -0.863171185425945 -1.04215728919298 -0.864466507337306
...

数据第一列表示标签数据，也就是结果数据，其他列表示特征数据；

预测就是再给一组特征数据，预测结果；

结果：

weights:[0.5808575763272221,0.18930001482946976,0.2803086929991066,0.1110834181777876,0.4010473965597895,-0.5603061626684255,-0.5804740464000981,0.8742741176970946], intercept:0.0
training Mean Squared Error = 6.207597210613579

逻辑回归

是一种二元分类方法，也是多类分类方法；

　　逻辑回归可以使用的方法：

　　LogisticRegressionWithLBFGS (建议使用这个)

　　LogisticRegressionWithSGD

　　参数：

　　与线性回归类似

　　模型的使用：

　　1、对数据进行预测,使用model.predict()
　　2、获取数据特征的权重model.weights()

　　模型的评估：

　　二元分类：AUC(Area Under roc Curve)

import org.apache.spark.{SparkContext, SparkConf}

import org.apache.spark.SparkContext

import org.apache.spark.mllib.classification.{LogisticRegressionWithLBFGS, LogisticRegressionModel}

import org.apache.spark.mllib.evaluation.{BinaryClassificationMetrics, MulticlassMetrics}

import org.apache.spark.mllib.regression.LabeledPoint

import org.apache.spark.mllib.util.MLUtils

/**

  * Created by Edward on 2016/9/21.

  */

object LogisticRegression {

  def main(args: Array[String]) {

    val conf: SparkConf = new SparkConf().setAppName("LogisticRegression").setMaster("local")

    val sc: SparkContext = new SparkContext(conf)

    // Load training data in LIBSVM format.

    val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

    // Split data into training (60%) and test (40%).

    val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)

    val training = splits(0).cache()

    val test = splits(1)

    // Run training algorithm to build the model

    val model = new LogisticRegressionWithLBFGS()

      .setNumClasses(10)

      .run(training)

    model.setThreshold(0.8)

    // Compute raw scores on the test set.

    val predictionAndLabels = test.map { case LabeledPoint(label, features) =>

      val prediction = model.predict(features)

      (prediction, label)

    }

    //多元矩阵

    // Get evaluation metrics.

    //val metrics = new MulticlassMetrics(predictionAndLabels)

    //val precision = metrics.precision

    //println("Precision = " + precision)

    //二元矩阵

    val metrics = new BinaryClassificationMetrics(predictionAndLabels)

    //通过ROC对模型进行评估,值趋近于1 receiver operating characteristic (ROC), 接受者操作特征 曲线下面积

    val auROC: Double = metrics.areaUnderROC()

    println("Area under ROC = " + auROC)

    //通过PR对模型进行评估，值趋近于1 precision-recall (PR), 精确率

    val underPR: Double = metrics.areaUnderPR()

    println("Area under PR = " + underPR)

    // Save and load model

    model.save(sc, "myModelPath")

    val sameModel = LogisticRegressionModel.load(sc, "myModelPath")

  }

}

支持向量机 Support Vector Machines (SVMs)

　　分类算法，二元分类算法

　　和逻辑回归二元分类相似

import org.apache.spark.{SparkContext, SparkConf}

import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD}

import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics

import org.apache.spark.mllib.util.MLUtils

/**

  * Created by Edward on 2016/9/21.

  */

object SVMs {

  def main(args: Array[String]) {

    val conf: SparkConf = new SparkConf().setAppName("SVM").setMaster("local")

    val sc: SparkContext = new SparkContext(conf)

    // Load training data in LIBSVM format.

    val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

    // Split data into training (60%) and test (40%).

    val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)

    val training = splits(0).cache()

    val test = splits(1)

    // Run training algorithm to build the model

    val numIterations = 100

    val model = SVMWithSGD.train(training, numIterations)

    // Clear the default threshold.

    model.clearThreshold()

    // Compute raw scores on the test set.

    val scoreAndLabels = test.map { point =>

      println("feature="+point.features)

      val score = model.predict(point.features)

      (score, point.label)

    }

    scoreAndLabels.foreach(println(_))

    // Get evaluation metrics.

    val metrics = new BinaryClassificationMetrics(scoreAndLabels)

    println("metrics="+metrics)

    val auROC = metrics.areaUnderROC()

    println("Area under ROC = " + auROC)

    // Save and load model

    model.save(sc, "myModelPath")

    val sameModel = SVMModel.load(sc, "myModelPath")

    sc.stop()

  }

}

数据：

0 128:51 129:159 130:253 131:159 132:50 155:48 156:238 157:252 158:252 159:252 160:237 182:54 183:227 184:253 185:252 186:239 187:233 188:252 189:57 190:6 208:10 209:60 210:224

1 159:124 160:253 161:255 162:63 186:96 187:244 188:251 189:253 190:62 214:127 215:251 216:251 217:253 218:62 
...

协同过滤 Collaborative Filtering

　　Spark中协同过滤算法主要由交替最小二乘法来实现 alternating least squares (ALS)

　　参数：

　　numBlocks block块的数量，用来控制并行度

　　rank 特征向量的大小

　　iterations 迭代数量

　　lambda 正规化参数

　　alpha 用来在隐式ALS中计算置信度的常量

　　方法：

　　ALS.train

　　模型的评估：

　　均方误差

例子：

import org.apache.spark.{SparkContext, SparkConf}

import org.apache.spark.mllib.recommendation.ALS

import org.apache.spark.mllib.recommendation.MatrixFactorizationModel

import org.apache.spark.mllib.recommendation.Rating

/**

  * Created by Edward on 2016/9/22.

  */

object CollaborativeALS {

  def main(args: Array[String]) {

    val conf: SparkConf = new SparkConf().setAppName("CollaborativeALS").setMaster("local")

    val sc: SparkContext = new SparkContext(conf)

    // Load and parse the data

    val data = sc.textFile("data/mllib/als/test.data")

    val ratings = data.map(_.split(',') match { case Array(user, item, rate) =>

      Rating(user.toInt, item.toInt, rate.toDouble)

    })

    // Build the recommendation model using ALS

    val rank = 10

    val numIterations = 10

    val model = ALS.train(ratings, rank, numIterations, 0.01)

    // Evaluate the model on rating data

    val usersProducts = ratings.map { case Rating(user, product, rate) =>

      (user, product)

    }

    val predictions =

      model.predict(usersProducts).map { case Rating(user, product, rate) =>

        ((user, product), rate)

      }

    val ratesAndPreds = ratings.map { case Rating(user, product, rate) =>

      ((user, product), rate)

    }.join(predictions)

    val MSE = ratesAndPreds.map { case ((user, product), (r1, r2)) =>

      val err = (r1 - r2)

      err * err

    }.mean()

    //均方误差

    println("Mean Squared Error = " + MSE)

    // Save and load model

    model.save(sc, "target/tmp/myCollaborativeFilter")

    val sameModel = MatrixFactorizationModel.load(sc, "target/tmp/myCollaborativeFilter")

  }

}

持续更新中...

Spark 机器学习的更多相关文章

Spark机器学习· 实时机器学习
Spark机器学习 1 在线学习模型随着接收的新消息,不断更新自己:而不是像离线训练一次次重新训练. 2 Spark Streaming 离散化流(DStream) 输入源:Akka actors. ...
Spark机器学习 Day2 快速理解机器学习
Spark机器学习 Day2 快速理解机器学习有两个问题: 机器学习到底是什么. 大数据机器学习到底是什么. 机器学习到底是什么人正常思维的过程是根据历史经验得出一定的规律,然后在当前情况下根据这 ...
Spark机器学习 Day1 机器学习概述
Spark机器学习 Day1 机器学习概述今天主要讨论个问题:Spark机器学习的本质是什么,其内部构成到底是什么. 简单来说,机器学习是数据+算法. 数据在Spark中做机器学习,肯定有数据来源 ...
Spark机器学习笔记一
Spark机器学习库现支持两种接口的API:RDD-based和DataFrame-based,Spark官方网站上说,RDD-based APIs在2.0后进入维护模式,主要的机器学习API是spa ...
Spark机器学习之协同过滤算法
Spark机器学习之协同过滤算法一).协同过滤 1.1 概念协同过滤是一种借助"集体计算"的途径.它利用大量已有的用户偏好来估计用户对其未接触过的物品的喜好程度.其内在思想是相 ...
2019-1-18 Spark 机器学习
2019-1-18 Spark 机器学习机器学习模MLib板预测 //有视频后续会补充 1547822490122.jpg 1547822525716.jpg 1547822330358.jp ...
Spark机器学习解析下集
上次我们讲过<Spark机器学习(上)>,本文是Spark机器学习的下部分,请点击回顾上部分,再更好地理解本文. 1.机器学习的常见算法常见的机器学习算法有:l 构造条件概率:回归分 ...
Spark机器学习8· 文本处理(spark-shell)
Spark机器学习自然语言处理(NLP,Natural Language Processing) 提取特征建模机器学习 TF-IDF(词频 term frequency–逆向文件频率 inver ...
Spark机器学习7·降维模型(scala&python)
PCA(主成分分析法,Principal Components Analysis) SVD(奇异值分解法,Singular Value Decomposition) http://vis-www.cs ...
Spark机器学习6·聚类模型(spark-shell)
K-均值(K-mean)聚类目的:最小化所有类簇中的方差之和类簇内方差和(WCSS,within cluster sum of squared errors) fuzzy K-means 层次聚类 ...

随机推荐

Ubuntu14.04搭建Caffe(仅CPU）
一直以来都没有写博客的习惯,后来发现以前做的工作如果不注意及时整理和记录往往丢失的很快.对我而言这是一篇具有重要意义的文章,好的习惯要持之以恒,以后的日子我会常驻博客园!由于本人水平有限,智商略低,欢 ...
[MySql] - 开启外部访问
打开 mysql 的查询窗口(使用root),使用SQL: -- 使用mysql库 use mysql; -- 更新密码 update user set password=PASSWORD('xxxx ...
【uTenux实验】事件标志
事件标志是一个用来实现同步的对象,由多个位组成,用作指示对应事件存在的标志.事件标志由用来指示对应事件存在的位模式(bitpattern)和一个等待事件标志的任务队列组成. uTenux提供了一组AP ...
Spring container vs SpringMVC container(webmvc container)
Difference between applicationContext.xml and spring-servlet.xml in Spring Framework Scenario 1 In c ...
Windows下PHP版本选取
1. 下载地址 http://windows.php.net/download/ 2. PHP大版本 PHP4:由于太古老.对OO支持不力已基本被淘汰. PHP5:分为三个分支——PHP5.2之前的版 ...
剑指Offer:面试题23——从上往下打印二叉树(java实现)
问题描述: 从上往下打印出二叉树的每个节点,同层节点从左至右打印. 思路: 按照层次遍历的方法,使用队列辅助. 1.将根结点加入队列. 2.循环出队,打印当前元素,若该结点有左子树,则将其加入队列,若 ...
七、context command
context command是用来新建自己的工具,可以调用OPENGL,获取鼠标操作函数,在view窗口画自己想画的东西.(我是这麽理解的,可以以后再确定一下) 下面是一个context comma ...
c# webform网站图片另存代码
好辛苦生成了二维码,然后要实现点击“保存图片”,弹出选择路径进行保存的效果. look: string qrcodeurl = ""; string username = &quo ...
git 代码组织
在20145306CSAPP2E文件夹下建立相应的文件夹: src:存放源代码文件 include: 存放头文件 bin:存放编译后的目标文件.可执行文件等 lib:存放项目所需的静态库.动态(共享) ...
NHibernate系列文章十九：NHibernate关系之多对多关系（附程序下载）
摘要 NHibernate的多对多关系映射由many-to-many定义. 从这里下载本文的代码NHibernate Demo 1.修改数据库添加Product表添加ProductOrder表数 ...

Spark 机器学习

分类与回归

Spark 机器学习的更多相关文章

随机推荐

热门专题