spark(1.1) mllib 源代码分析

在spark mllib 1.1加入版本stat包，其中包括一些统计数据有关的功能。本文分析中卡方检验和实施的主要原则：

一个、根本

　　在stat包实现Pierxunka方检验，它包括以下类别

　　　　（1）适配度检验（Goodness of Fit test）：验证一组观察值的次数分配是否异于理论上的分配。

　　　　（2）独立性检验（independence test）：验证从两个变量抽出的配对观察值组是否互相独立（比如：每次都从A国和B国各抽一个人，看他们的反应是否与国籍无关）

　　计算公式：

　　　　当中O表示观測值，E表示期望值

　　具体原理能够參考：http://zh.wikipedia.org/wiki/%E7%9A%AE%E7%88%BE%E6%A3%AE%E5%8D%A1%E6%96%B9%E6%AA%A2%E5%AE%9A

二、java api调用example

　　https://github.com/tovin-xu/mllib_example/blob/master/src/main/java/com/mllib/example/stat/ChiSquaredSuite.java

三、源代码分析

　　1、外部api

　　　　通过Statistics类提供了4个外部接口　　

// Goodness of Fit test

def chiSqTest(observed: Vector, expected: Vector): ChiSqTestResult = {

    ChiSqTest.chiSquared(observed, expected)

  }

//Goodness of Fit test

def chiSqTest(observed: Vector): ChiSqTestResult = ChiSqTest.chiSquared(observed)

//independence test

def chiSqTest(observed: Matrix): ChiSqTestResult = ChiSqTest.chiSquaredMatrix(observed)

//independence test

def chiSqTest(data: RDD[LabeledPoint]): Array[ChiSqTestResult] = {

    ChiSqTest.chiSquaredFeatures(data)

}

　　2、Goodness of Fit test实现

　　这个比較简单。关键是依据(observed-expected)²/expected计算卡方值

 /*

   * Pearon's goodness of fit test on the input observed and expected counts/relative frequencies.

   * Uniform distribution is assumed when `expected` is not passed in.

   */

  def chiSquared(observed: Vector,

      expected: Vector = Vectors.dense(Array[Double]()),

      methodName: String = PEARSON.name): ChiSqTestResult = {

    // Validate input arguments

    val method = methodFromString(methodName)

    if (expected.size != 0 && observed.size != expected.size) {

      throw new IllegalArgumentException("observed and expected must be of the same size.")

    }

    val size = observed.size

    if (size > 1000) {

      logWarning("Chi-squared approximation may not be accurate due to low expected frequencies "

        + s" as a result of a large number of categories: $size.")

    }

    val obsArr = observed.toArray

　　// 假设expected值没有设置，默认取1.0 / size

    val expArr = if (expected.size == 0) Array.tabulate(size)(_ => 1.0 / size) else expected.toArray

　　/ 假设expected、observed值都必需要大于1

    if (!obsArr.forall(_ >= 0.0)) {

      throw new IllegalArgumentException("Negative entries disallowed in the observed vector.")

    }

    if (expected.size != 0 && ! expArr.forall(_ >= 0.0)) {

      throw new IllegalArgumentException("Negative entries disallowed in the expected vector.")

    }

    // Determine the scaling factor for expected

    val obsSum = obsArr.sum

    val expSum = if (expected.size == 0.0) 1.0 else expArr.sum

    val scale = if (math.abs(obsSum - expSum) < 1e-7) 1.0 else obsSum / expSum

    // compute chi-squared statistic

    val statistic = obsArr.zip(expArr).foldLeft(0.0) { case (stat, (obs, exp)) =>

      if (exp == 0.0) {

        if (obs == 0.0) {

          throw new IllegalArgumentException("Chi-squared statistic undefined for input vectors due"

            + " to 0.0 values in both observed and expected.")

        } else {

          return new ChiSqTestResult(0.0, size - 1, Double.PositiveInfinity, PEARSON.name,

            NullHypothesis.goodnessOfFit.toString)

        }

      }

　　// 计算(observed-expected)²/expected

      if (scale == 1.0) {

        stat + method.chiSqFunc(obs, exp)

      } else {

        stat + method.chiSqFunc(obs, exp * scale)

      }

    }

    val df = size - 1

    val pValue = chiSquareComplemented(df, statistic)

    new ChiSqTestResult(pValue, df, statistic, PEARSON.name, NullHypothesis.goodnessOfFit.toString)

  }

　　3、independence test实现

　　　　先通过以下的公式计算expected值，矩阵共同拥有 r 行 c 列

　　　　然后依据(observed-expected)²/expected计算卡方值

/*

   * Pearon's independence test on the input contingency matrix.

   * TODO: optimize for SparseMatrix when it becomes supported.

   */

  def chiSquaredMatrix(counts: Matrix, methodName:String = PEARSON.name): ChiSqTestResult = {

    val method = methodFromString(methodName)

    val numRows = counts.numRows

    val numCols = counts.numCols

    // get row and column sums

    val colSums = new Array[Double](numCols)

    val rowSums = new Array[Double](numRows)

    val colMajorArr = counts.toArray

    var i = 0

    while (i < colMajorArr.size) {

      val elem = colMajorArr(i)

      if (elem < 0.0) {

        throw new IllegalArgumentException("Contingency table cannot contain negative entries.")

      }

      colSums(i / numRows) += elem

      rowSums(i % numRows) += elem

      i += 1

    }

    val total = colSums.sum

    // second pass to collect statistic

    var statistic = 0.0

    var j = 0

    while (j < colMajorArr.size) {

      val col = j / numRows

      val colSum = colSums(col)

      if (colSum == 0.0) {

        throw new IllegalArgumentException("Chi-squared statistic undefined for input matrix due to"

          + s"0 sum in column [$col].")

      }

      val row = j % numRows

      val rowSum = rowSums(row)

      if (rowSum == 0.0) {

        throw new IllegalArgumentException("Chi-squared statistic undefined for input matrix due to"

          + s"0 sum in row [$row].")

      }

      val expected = colSum * rowSum / total

      statistic += method.chiSqFunc(colMajorArr(j), expected)

      j += 1

    }

    val df = (numCols - 1) * (numRows - 1)

    val pValue = chiSquareComplemented(df, statistic)

    new ChiSqTestResult(pValue, df, statistic, methodName, NullHypothesis.independence.toString)

  }

spark(1.1) mllib 源代码分析的更多相关文章

Spark机器学习之MLlib整理分析
友情提示: 本文档根据林大贵的<Python+Spark 2.0 + Hadoop机器学习与大数据实战>整理得到,代码均为书中提供的源码(python 2.X版本). 本文的可以利用pan ...
Spark里边：Worker源代码分析和架构
首先由Spark图表理解Worker于Spark中的作用和地位: watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvYW56aHNvZnQ=/font/5a6L ...
Spark SQL 源代码分析系列
从决定写Spark SQL文章的源代码分析,到现在一个月的时间,一个又一个几乎相同的结束很快,在这里也做了一个综合指数,方便阅读,下面是读取顺序 :) 第一章 Spark SQL源代码分析之核心流程 ...
Spark MLlib之线性回归源代码分析
1.理论基础线性回归(Linear Regression)问题属于监督学习(Supervised Learning)范畴,又称分类(Classification)或归纳学习(Inductive Le ...
Spark MLlib LDA 基于GraphX实现原理及源代码分析
LDA背景 LDA(隐含狄利克雷分布)是一个主题聚类模型,是当前主题聚类领域最火.最有力的模型之中的一个,它能通过多轮迭代把特征向量集合按主题分类.眼下,广泛运用在文本主题聚类中. LDA的开源实现有 ...
Spark SQL 源代码分析之 In-Memory Columnar Storage 之 in-memory query
/** Spark SQL源代码分析系列文章*/ 前面讲到了Spark SQL In-Memory Columnar Storage的存储结构是基于列存储的. 那么基于以上存储结构,我们查询cache ...
Spark SQL Catalyst源代码分析之TreeNode Library
/** Spark SQL源代码分析系列文章*/ 前几篇文章介绍了Spark SQL的Catalyst的核心执行流程.SqlParser,和Analyzer,本来打算直接写Optimizer的,可是发 ...
Spark SQL Catalyst源代码分析Optimizer
/** Spark SQL源代码分析系列*/ 前几篇文章介绍了Spark SQL的Catalyst的核心运行流程.SqlParser,和Analyzer 以及核心类库TreeNode,本文将具体解说S ...
Spark SQL源代码分析之核心流程
/** Spark SQL源代码分析系列文章*/ 自从去年Spark Submit 2013 Michael Armbrust分享了他的Catalyst,到至今1年多了,Spark SQL的贡献者从几 ...

随机推荐

php学习笔记--高级教程--读取文件、创建文件、写入文件
打开文件:fopen:fopen(filename,mode);//fopen("test.txt","r"): 打开模式:r 仅仅读方式打开,将文件指针指向 ...
unix pwd使用命令
[语法]: pwd [说明]: 此命令会显示当前的工作文件夹 []: pwd 这显示当前工作文件夹版权声明:本文博主原创文章.博客,未经同意不得转载.
Android供TextView添加多个点击文字
我们使用社会性软件的过程中会或多或少像别人的帖子点,图. : 能够看到用户页面显示出来的仅仅是点了赞的用户的名称,点击这些名称能够进入到该用户的主页.我们就来实现相似的效果.直接上代码吧. @Over ...
移动端 new CustomEvent('input') 兼容问题
最近在安卓自带浏览器上发现 new CustomEvent('input') 不兼容解决办法 (function () { if(!!window.CustomEvent) return; f ...
Cocos2d-X中实现批处理精灵
使用普通方法实现批处理精灵在Sprite.h中加入以下的代码 #ifndef __Sprite_SCENE_H__ #define __Sprite_SCENE_H__ #include " ...
解决github访问问题
github这是个好地方.但是,上不去就蛋疼. 今天github上不去,果断f12下,看下network.发现里面好多请求都是指向 github.global.ssl.fastly.net这个域名的, ...
跳跃Java一些周期，双跳FOR周期
今天写的代码写在一个双层for周期,目前仍在使用Iterator,大致意思是假定在第二个周期在排位赛中给了整个双回路跳. 刚開始,直接使用break.巴拉巴拉的敲了一堆代码,信心满满的就直接执行.等到 ...
Windows Phone 启动器
http://msdn.microsoft.com/zh-CN/library/gg278408(v=vs.92)#BKMK_Launchers using Microsoft.Phone.Contr ...
spring中间scope详细解释
0.思维导图 1. scope概论 spring中scope是一个很关键的概念.简单说就是对象在spring容器(IOC容器)中的生命周期,也能够理解为对象在spring容器中的创建方式. 2. sc ...
ANDROID嵌入式应用Unity3D视图(画廊3D模型)
转载请注明来自大型玉米的博客文章(http://blog.csdn.net/a396901990),谢谢支持! 效果展示: watermark/2/text/aHR0cDovL2Jsb2cuY3N ...

spark(1.1) mllib 源代码分析

spark(1.1) mllib 源代码分析的更多相关文章

随机推荐

热门专题