Spark0.9.0机器学习包MLlib-Classification代码阅读
本章主要讲述MLlib包里面的分类算法实现,目前实现的有LogisticRegression、SVM、NaiveBayes ,前两种算法针对各自的目标优化函数跟正则项,调用了Optimization模块下的随机梯度的优化,并行实现的策略主要在随机梯度的计算,而贝叶斯的的并行策略主要是计算类别的先验概率跟特征的条件概率上面,详细情况如下
LogisticRegression.scala文件
/**
* Classification model trained using Logistic Regression.
*
* @param weights Weights computed for every feature.
* @param intercept Intercept computed for this model.
*/
class LogisticRegressionModel(
override val weights: Array[Double],
override val intercept: Double)
extends GeneralizedLinearModel(weights, intercept)
with ClassificationModel with Serializable {
override def predictPoint(dataMatrix: DoubleMatrix, weightMatrix: DoubleMatrix,
intercept: Double) = {
val margin = dataMatrix.mmul(weightMatrix).get(0) + intercept
round(1.0/ (1.0 + math.exp(margin * -1)))
}
}
逻辑回归的predictPoint函数,函数输入:待预测的数据样本,回归系数weights,intercept截距项,由于逻辑回归的判别函数f=1/(1+exp(-wx)),在代码中margin=-wx,最后返回1/(1+exp(-wx))值的四舍五入,也就是预测标签。
class LogisticRegressionWithSGD private (
var stepSize: Double,
var numIterations: Int,
var regParam: Double,
var miniBatchFraction: Double)
extends GeneralizedLinearAlgorithm[LogisticRegressionModel]
with Serializable {
val gradient = new LogisticGradient()
val updater = new SimpleUpdater()
override val optimizer = new GradientDescent(gradient, updater)
.setStepSize(stepSize)
.setNumIterations(numIterations)
.setRegParam(regParam)
.setMiniBatchFraction(miniBatchFraction)
override val validators = List(DataValidators.classificationLabels)
/**
* Construct a LogisticRegression object with default parameters
*/
def this() = this(1.0, 100, 0.0, 1.0)
def createModel(weights: Array[Double], intercept: Double) = {
new LogisticRegressionModel(weights, intercept)
}
}
源代码 先定义了gradient,updater实例(在optimization文件下下面),其中损失函数用了log-loss,没有用正则项参数,接着重写optimizer 优化算子,最后对该类成员变量stepSize,numIterations,regParam,miniBatchFraction设置默认数值。
object LogisticRegressionWithSGD {
def train(
input: RDD[LabeledPoint],
numIterations: Int,
stepSize: Double,
miniBatchFraction: Double,
initialWeights: Array[Double])
: LogisticRegressionModel =
{
new LogisticRegressionWithSGD(stepSize, numIterations, 0.0, miniBatchFraction).run(
input, initialWeights)
}
def train(
input: RDD[LabeledPoint],
numIterations: Int,
stepSize: Double,
miniBatchFraction: Double)
: LogisticRegressionModel =
{
new LogisticRegressionWithSGD(stepSize, numIterations, 0.0, miniBatchFraction).run(
input)
}
def train(
input: RDD[LabeledPoint],
numIterations: Int,
stepSize: Double)
: LogisticRegressionModel =
{
train(input, numIterations, stepSize, 1.0)
}
def train(
input: RDD[LabeledPoint],
numIterations: Int)
: LogisticRegressionModel =
{
train(input, numIterations, 1.0, 1.0)
}
def main(args: Array[String]) {
if (args.length != 4) {
println("Usage: LogisticRegression <master> <input_dir> <step_size> " +
"<niters>")
System.exit(1)
}
val sc = new SparkContext(args(0), "LogisticRegression")
val data = MLUtils.loadLabeledData(sc, args(1))
val model = LogisticRegressionWithSGD.train(data, args(3).toInt, args(2).toDouble)
println("Weights: " + model.weights.mkString("[", ", ", "]"))
println("Intercept: " + model.intercept)
sc.stop()
}
}
代码中,根据不同的输入定义了4种train的方式,在main函数里面,用到了MLUtils.loadLabeledData(sc,args(1)),该函数把文件输入<标签>,<特征1>,<特征2>...转换成定义的RDD[LabeledPoint]形式。接着调用LR进行训练,最后打印回归系数跟截距项
class SVMModel(
override val weights: Array[Double],
override val intercept: Double)
extends GeneralizedLinearModel(weights, intercept)
with ClassificationModel with Serializable {
override def predictPoint(dataMatrix: DoubleMatrix, weightMatrix: DoubleMatrix,
intercept: Double) = {
val margin = dataMatrix.dot(weightMatrix) + intercept
if (margin < 0) 0.0 else 1.0
}
}
class SVMWithSGD private (
var stepSize: Double,
var numIterations: Int,
var regParam: Double,
var miniBatchFraction: Double)
extends GeneralizedLinearAlgorithm[SVMModel] with Serializable {
val gradient = new HingeGradient()
val updater = new SquaredL2Updater()
override val optimizer = new GradientDescent(gradient, updater)
.setStepSize(stepSize)
.setNumIterations(numIterations)
.setRegParam(regParam)
.setMiniBatchFraction(miniBatchFraction)
override val validators = List(DataValidators.classificationLabels)
def this() = this(1.0, 100, 1.0, 1.0)
def createModel(weights: Array[Double], intercept: Double) = {
new SVMModel(weights, intercept)
}
}
跟LR类似,gradient 换成了对hinge-loss的求梯度,updater换成了对L2正则
object SVMWithSGD {
def train(
input: RDD[LabeledPoint],
numIterations: Int,
stepSize: Double,
regParam: Double,
miniBatchFraction: Double,
initialWeights: Array[Double])
: SVMModel =
{
new SVMWithSGD(stepSize, numIterations, regParam, miniBatchFraction).run(input,
initialWeights)
}
def train(
input: RDD[LabeledPoint],
numIterations: Int,
stepSize: Double,
regParam: Double,
miniBatchFraction: Double)
: SVMModel =
{
new SVMWithSGD(stepSize, numIterations, regParam, miniBatchFraction).run(input)
}
def train(
input: RDD[LabeledPoint],
numIterations: Int,
stepSize: Double,
regParam: Double)
: SVMModel =
{
train(input, numIterations, stepSize, regParam, 1.0)
}
def train(
input: RDD[LabeledPoint],
numIterations: Int)
: SVMModel =
{
train(input, numIterations, 1.0, 1.0, 1.0)
}
def main(args: Array[String]) {
if (args.length != 5) {
println("Usage: SVM <master> <input_dir> <step_size> <regularization_parameter> <niters>")
System.exit(1)
}
val sc = new SparkContext(args(0), "SVM")
val data = MLUtils.loadLabeledData(sc, args(1))
val model = SVMWithSGD.train(data, args(4).toInt, args(2).toDouble, args(3).toDouble)
println("Weights: " + model.weights.mkString("[", ", ", "]"))
println("Intercept: " + model.intercept)
sc.stop()
}
}
class NaiveBayesModel(val pi: Array[Double], val theta: Array[Array[Double]])
extends ClassificationModel with Serializable {
// Create a column vector that can be used for predictions
private val _pi = new DoubleMatrix(pi.length, 1, pi: _*)
private val _theta = new DoubleMatrix(theta)
def predict(testData: RDD[Array[Double]]): RDD[Double] = testData.map(predict)
def predict(testData: Array[Double]): Double = {
val dataMatrix = new DoubleMatrix(testData.length, 1, testData: _*)
val result = _pi.add(_theta.mmul(dataMatrix))
result.argmax()
}
}
朴素贝叶斯分类器,NaiveBayesModel的输入是:训练后得到的,标签类别先验概率pi (P(y=0),P(y=1),...,P(y=K)),特征属性在指定类别下出现的条件概率theta(P(x=1 / y)),对于特征转化为TF-IDF形式可以用来文本分类,当特征转化为0-1编码的时候,基于伯努利模型可以用来分类,第一个predict函数的输入是测试数据集,第二个predict函数的输入是单个测试样本。原本的贝叶斯定理是 根据P(y|x)~ P(x|y)P(y),这里实现的时候,是对两边取了对数,加法的计算效率比乘法更高,最后,返回result.argmax() 也就是后验概率最大的那个类别
class NaiveBayes private (var lambda: Double)
extends Serializable with Logging
{
def this() = this(1.0)
/** Set the smoothing parameter. Default: 1.0. */
def setLambda(lambda: Double): NaiveBayes = {
this.lambda = lambda
this
}
def run(data: RDD[LabeledPoint]) = {
val zeroCombiner = mutable.Map.empty[Int, (Int, DoubleMatrix)]
val aggregated = data.aggregate(zeroCombiner)({(combiner, point) =>
point match {
case LabeledPoint(label, features) =>
val (count, featuresSum) = combiner.getOrElse(label.toInt, (0, DoubleMatrix.zeros(1)))
val fs = new DoubleMatrix(features.length, 1, features: _*)
combiner += label.toInt -> (count + 1, featuresSum.addi(fs))
}
}, { (lhs, rhs) =>
for ((label, (c, fs)) <- rhs) {
val (count, featuresSum) = lhs.getOrElse(label, (0, DoubleMatrix.zeros(1)))
lhs(label) = (count + c, featuresSum.addi(fs))
}
lhs
})
// Kinds of label
val C = aggregated.size
// Total sample count
val N = aggregated.values.map(_._1).sum
val pi = new Array[Double](C)
val theta = new Array[Array[Double]](C)
val piLogDenom = math.log(N + C * lambda)
for ((label, (count, fs)) <- aggregated) {
val thetaLogDenom = math.log(fs.sum() + fs.length * lambda)
pi(label) = math.log(count + lambda) - piLogDenom
theta(label) = fs.toArray.map(f => math.log(f + lambda) - thetaLogDenom)
}
new NaiveBayesModel(pi, theta)
}
}
这个类是实现贝叶斯算法,lambda参数是用来避免P(X|Y)=0的尴尬(学术界叫法:拉普拉斯平滑),核心代码在data.aggregate,首先定义了zeroCombiner这个map类型数据结构,key表示类别,value是(Int, DoubleMatrix)元组类型,Int表示该类别在训练集中的个数(以便求先验概率),DoubleMatrix表示各个特征在该类别下的条件概率
object NaiveBayes {
def train(input: RDD[LabeledPoint]): NaiveBayesModel = {
new NaiveBayes().run(input)
}
def train(input: RDD[LabeledPoint], lambda: Double): NaiveBayesModel = {
new NaiveBayes(lambda).run(input)
}
def main(args: Array[String]) {
if (args.length != 2 && args.length != 3) {
println("Usage: NaiveBayes <master> <input_dir> [<lambda>]")
System.exit(1)
}
val sc = new SparkContext(args(0), "NaiveBayes")
val data = MLUtils.loadLabeledData(sc, args(1))
val model = if (args.length == 2) {
NaiveBayes.train(data)
} else {
NaiveBayes.train(data, args(2).toDouble)
}
println("Pi: " + model.pi.mkString("[", ", ", "]"))
println("Theta:\n" + model.theta.map(_.mkString("[", ", ", "]")).mkString("[", "\n ", "]"))
sc.stop()
}
}
贝叶斯训练方式分有无lambda参数,main函数先定义SparkContext,然后把数据集转化成RDD[LabelPoint]类型,经过训练,打印pi跟theta,最后八卦一下,这个算法是在Intel工作,微博名叫灵魂机器大神写的,可以follow他的github网址https://github.com/soulmachine
Spark0.9.0机器学习包MLlib-Classification代码阅读的更多相关文章
- Spark0.9.0机器学习包MLlib-Optimization代码阅读
基于Spark的一个生态产品--MLlib,实现了经典的机器学算法,源码分8个文件夹,classification文件夹下面包含NB.LR.SVM的实现,clustering文件夹下面包 ...
- Spark MLlib 示例代码阅读
阅读前提:有一定的机器学习基础, 本文重点面向的是应用,至于机器学习的相关复杂理论和优化理论,还是多多看论文,初学者推荐Ng的公开课 /* * Licensed to the Apache Softw ...
- spark0.9.0安装
利用周末的时间安装学习了下最近很火的Spark0.9.0(江湖传言,要革hadoop命,O(∩_∩)O),并体验了该框架下的机器学习包MLlib(spark解决的一个重点就是高效的运行迭代算法),下面 ...
- Spark机器学习之MLlib整理分析
友情提示: 本文档根据林大贵的<Python+Spark 2.0 + Hadoop机器学习与大数据实战>整理得到,代码均为书中提供的源码(python 2.X版本). 本文的可以利用pan ...
- Spark2.0机器学习系列之3:决策树
概述 分类决策树模型是一种描述对实例进行分类的树形结构. 决策树可以看为一个if-then规则集合,具有“互斥完备”性质 .决策树基本上都是 采用的是贪心(即非回溯)的算法,自顶向下递归分治构造. 生 ...
- 小姐姐带你一起学:如何用Python实现7种机器学习算法(附代码)
小姐姐带你一起学:如何用Python实现7种机器学习算法(附代码) Python 被称为是最接近 AI 的语言.最近一位名叫Anna-Lena Popkes的小姐姐在GitHub上分享了自己如何使用P ...
- Ubuntu安装Python机器学习包
1.安装pip $ mkdir ~/.pip $ vi ~/.pip/pip.conf [global] trusted-host=mirrors.aliyun.com index-url=http: ...
- 解决Socket粘包问题——C#代码
解决Socket粘包问题——C#代码 前天晚上,曾经的一个同事问我socket发送消息如果太频繁接收方就会有消息重叠,因为当时在外面,没有多加思考 第一反应还以为是多线程导致的数据不同步导致的,让他加 ...
- spark MLlib Classification and regression 学习
二分类:SVMs,logistic regression,decision trees,random forests,gradient-boosted trees,naive Bayes 多分类: ...
随机推荐
- react-native 项目实战 -- 新闻客户端(2) -- 完善TabBar
1.创建 drawable-xxhdpi 文件夹,保存 TabBar 的 icon图标 android -- app -- src -- main -- res -- drawab ...
- 自定义 alert 弹窗
1.css样式 li-alert.css @-webkit-keyframes opacity { 0% { opacity: 0; /*初始状态 透明度为0*/ } 50% { opacity: 0 ...
- 聚类分析算法及SAS实现
聚类分析是用户细分里面最为重要的工具,而用户细分则是整个精准营销里面的基础. 聚类分析方法分为: 层次法:可分为凝聚式和分列式,适用于观测数比较少的情形 1.凝聚式:将每个观测都归为一类,然后每次都将 ...
- 实现Tab功能
网上实现Tab功能的方法有很多,这里我使用Fragment的方法,我觉着比较简单易懂 MainActivity private android.app.FragmentManager fragment ...
- H5 打开App
IOS 1.只能在safari浏览器打开, 2.必须App安装的情况下. 3.IOS必须配置 URLSchemes 为 "yidiandai://" 才可以. <!DOCT ...
- java游戏开发基础Swing之JRadioButton
© 版权声明:本文为博主原创文章,转载请注明出处 1.按钮(JButton) Swing中的按钮是JButton,它是javax.swing.AbstractButton类的子类,Swing中的按钮可 ...
- WAS集群系列(3):集群搭建:步骤1:准备文件
说明:"指示轨迹"为"点选顺序",截图为点击后效果截图 环境 项目点 指标 WAS版本号 7.0 操作系统 Windows 2008 系统位数 64bit 内存 ...
- linux centos apache开启gzip的方法
开启gzip压缩的方法很简单,连接服务器并打开配置文件“httpd.conf”,找到下面这两句,去掉前面的“#” 代码如下 1 LoadModule deflate_module modules/m ...
- iOS 企业版 安装失败 原因
首先要吐槽下国内的论坛水分略多,以下问题大多是查询stackoverflow等论坛解决的.推荐一款软件,Log Guru,用来查看app安装时的系统日志,很多问题要看日志才知道错误点. 1.首先有几 ...
- C语言基础知识【常量】
C 常量1.常量是固定值,在程序执行期间不会改变.这些固定的值,又叫做字面量.常量可以是任何的基本数据类型,比如整数常量.浮点常量.字符常量,或字符串字面值,也有枚举常量.常量就像是常规的变量,只不过 ...