spark(1.1) mllib 源代码分析

在spark mllib 1.1加入版本stat包，其中包括一些统计数据有关的功能。本文分析中卡方检验和实施的主要原则：

一个、根本

　　在stat包实现Pierxunka方检验，它包括以下类别

　　　　（1）适配度检验（Goodness of Fit test）：验证一组观察值的次数分配是否异于理论上的分配。

　　　　（2）独立性检验（independence test）：验证从两个变量抽出的配对观察值组是否互相独立（比如：每次都从A国和B国各抽一个人，看他们的反应是否与国籍无关）

　　计算公式：

　　　　当中O表示观測值，E表示期望值

　　具体原理能够參考：http://zh.wikipedia.org/wiki/%E7%9A%AE%E7%88%BE%E6%A3%AE%E5%8D%A1%E6%96%B9%E6%AA%A2%E5%AE%9A

二、java api调用example

　　https://github.com/tovin-xu/mllib_example/blob/master/src/main/java/com/mllib/example/stat/ChiSquaredSuite.java

三、源代码分析

　　1、外部api

　　　　通过Statistics类提供了4个外部接口　　

// Goodness of Fit test

def chiSqTest(observed: Vector, expected: Vector): ChiSqTestResult = {

    ChiSqTest.chiSquared(observed, expected)

  }

//Goodness of Fit test

def chiSqTest(observed: Vector): ChiSqTestResult = ChiSqTest.chiSquared(observed)

//independence test

def chiSqTest(observed: Matrix): ChiSqTestResult = ChiSqTest.chiSquaredMatrix(observed)

//independence test

def chiSqTest(data: RDD[LabeledPoint]): Array[ChiSqTestResult] = {

    ChiSqTest.chiSquaredFeatures(data)

}

　　2、Goodness of Fit test实现

　　这个比較简单。关键是依据(observed-expected)²/expected计算卡方值

 /*

   * Pearon's goodness of fit test on the input observed and expected counts/relative frequencies.

   * Uniform distribution is assumed when `expected` is not passed in.

   */

  def chiSquared(observed: Vector,

      expected: Vector = Vectors.dense(Array[Double]()),

      methodName: String = PEARSON.name): ChiSqTestResult = {

    // Validate input arguments

    val method = methodFromString(methodName)

    if (expected.size != 0 && observed.size != expected.size) {

      throw new IllegalArgumentException("observed and expected must be of the same size.")

    }

    val size = observed.size

    if (size > 1000) {

      logWarning("Chi-squared approximation may not be accurate due to low expected frequencies "

        + s" as a result of a large number of categories: $size.")

    }

    val obsArr = observed.toArray

　　// 假设expected值没有设置，默认取1.0 / size

    val expArr = if (expected.size == 0) Array.tabulate(size)(_ => 1.0 / size) else expected.toArray

　　/ 假设expected、observed值都必需要大于1

    if (!obsArr.forall(_ >= 0.0)) {

      throw new IllegalArgumentException("Negative entries disallowed in the observed vector.")

    }

    if (expected.size != 0 && ! expArr.forall(_ >= 0.0)) {

      throw new IllegalArgumentException("Negative entries disallowed in the expected vector.")

    }

    // Determine the scaling factor for expected

    val obsSum = obsArr.sum

    val expSum = if (expected.size == 0.0) 1.0 else expArr.sum

    val scale = if (math.abs(obsSum - expSum) < 1e-7) 1.0 else obsSum / expSum

    // compute chi-squared statistic

    val statistic = obsArr.zip(expArr).foldLeft(0.0) { case (stat, (obs, exp)) =>

      if (exp == 0.0) {

        if (obs == 0.0) {

          throw new IllegalArgumentException("Chi-squared statistic undefined for input vectors due"

            + " to 0.0 values in both observed and expected.")

        } else {

          return new ChiSqTestResult(0.0, size - 1, Double.PositiveInfinity, PEARSON.name,

            NullHypothesis.goodnessOfFit.toString)

        }

      }

　　// 计算(observed-expected)²/expected

      if (scale == 1.0) {

        stat + method.chiSqFunc(obs, exp)

      } else {

        stat + method.chiSqFunc(obs, exp * scale)

      }

    }

    val df = size - 1

    val pValue = chiSquareComplemented(df, statistic)

    new ChiSqTestResult(pValue, df, statistic, PEARSON.name, NullHypothesis.goodnessOfFit.toString)

  }

　　3、independence test实现

　　　　先通过以下的公式计算expected值，矩阵共同拥有 r 行 c 列

　　　　然后依据(observed-expected)²/expected计算卡方值

/*

   * Pearon's independence test on the input contingency matrix.

   * TODO: optimize for SparseMatrix when it becomes supported.

   */

  def chiSquaredMatrix(counts: Matrix, methodName:String = PEARSON.name): ChiSqTestResult = {

    val method = methodFromString(methodName)

    val numRows = counts.numRows

    val numCols = counts.numCols

    // get row and column sums

    val colSums = new Array[Double](numCols)

    val rowSums = new Array[Double](numRows)

    val colMajorArr = counts.toArray

    var i = 0

    while (i < colMajorArr.size) {

      val elem = colMajorArr(i)

      if (elem < 0.0) {

        throw new IllegalArgumentException("Contingency table cannot contain negative entries.")

      }

      colSums(i / numRows) += elem

      rowSums(i % numRows) += elem

      i += 1

    }

    val total = colSums.sum

    // second pass to collect statistic

    var statistic = 0.0

    var j = 0

    while (j < colMajorArr.size) {

      val col = j / numRows

      val colSum = colSums(col)

      if (colSum == 0.0) {

        throw new IllegalArgumentException("Chi-squared statistic undefined for input matrix due to"

          + s"0 sum in column [$col].")

      }

      val row = j % numRows

      val rowSum = rowSums(row)

      if (rowSum == 0.0) {

        throw new IllegalArgumentException("Chi-squared statistic undefined for input matrix due to"

          + s"0 sum in row [$row].")

      }

      val expected = colSum * rowSum / total

      statistic += method.chiSqFunc(colMajorArr(j), expected)

      j += 1

    }

    val df = (numCols - 1) * (numRows - 1)

    val pValue = chiSquareComplemented(df, statistic)

    new ChiSqTestResult(pValue, df, statistic, methodName, NullHypothesis.independence.toString)

  }

spark(1.1) mllib 源代码分析的更多相关文章

Spark机器学习之MLlib整理分析
友情提示: 本文档根据林大贵的<Python+Spark 2.0 + Hadoop机器学习与大数据实战>整理得到,代码均为书中提供的源码(python 2.X版本). 本文的可以利用pan ...
Spark里边：Worker源代码分析和架构
首先由Spark图表理解Worker于Spark中的作用和地位: watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvYW56aHNvZnQ=/font/5a6L ...
Spark SQL 源代码分析系列
从决定写Spark SQL文章的源代码分析,到现在一个月的时间,一个又一个几乎相同的结束很快,在这里也做了一个综合指数,方便阅读,下面是读取顺序 :) 第一章 Spark SQL源代码分析之核心流程 ...
Spark MLlib之线性回归源代码分析
1.理论基础线性回归(Linear Regression)问题属于监督学习(Supervised Learning)范畴,又称分类(Classification)或归纳学习(Inductive Le ...
Spark MLlib LDA 基于GraphX实现原理及源代码分析
LDA背景 LDA(隐含狄利克雷分布)是一个主题聚类模型,是当前主题聚类领域最火.最有力的模型之中的一个,它能通过多轮迭代把特征向量集合按主题分类.眼下,广泛运用在文本主题聚类中. LDA的开源实现有 ...
Spark SQL 源代码分析之 In-Memory Columnar Storage 之 in-memory query
/** Spark SQL源代码分析系列文章*/ 前面讲到了Spark SQL In-Memory Columnar Storage的存储结构是基于列存储的. 那么基于以上存储结构,我们查询cache ...
Spark SQL Catalyst源代码分析之TreeNode Library
/** Spark SQL源代码分析系列文章*/ 前几篇文章介绍了Spark SQL的Catalyst的核心执行流程.SqlParser,和Analyzer,本来打算直接写Optimizer的,可是发 ...
Spark SQL Catalyst源代码分析Optimizer
/** Spark SQL源代码分析系列*/ 前几篇文章介绍了Spark SQL的Catalyst的核心运行流程.SqlParser,和Analyzer 以及核心类库TreeNode,本文将具体解说S ...
Spark SQL源代码分析之核心流程
/** Spark SQL源代码分析系列文章*/ 自从去年Spark Submit 2013 Michael Armbrust分享了他的Catalyst,到至今1年多了,Spark SQL的贡献者从几 ...

随机推荐

CSS设计指南之浮动与清除
原文:CSS设计指南之浮动与清除浮动意思就是把元素从常规文档流中拿出来,浮动元素脱离了常规文档流之后,原来紧跟在其后的元素就会在空间允许的情况下,向上提升到与浮动元素平起平坐. 一.浮动 CSS设计 ...
POJ 3177 Redundant Paths POJ 3352 Road Construction（双连接）
POJ 3177 Redundant Paths POJ 3352 Road Construction 题目链接题意:两题一样的.一份代码能交.给定一个连通无向图,问加几条边能使得图变成一个双连通图 ...
maven配置文件里改动默认jre
方法一:打开%maven_home%\conf\setting.xml,仅仅会在新建项目时自己主动使用1.6的导入项目不会在<profiles>标签内加入�例如以下配置: <pro ...
BZOJ 1834 ZJOI2010 network 网络扩展 Dinic+EK费用流
标题效果:给定一个n积分m无向图边,每一方有一个扩展的成本c.代表扩张1费用的交通,寻求最大流量和扩大的最大流量k最小成本第一问直接运行的最大流量第二个问题将是连接到一个流的末端每个边缘的起点是正 ...
flashfxp3.41中文版注册码:(适合最新版本)
推荐(尚未被封的 Realkey) FLASHFXPvACq2ssbvAAAAAC1W7cJKQTzmx77zmqJICvA7d3WnU tWNXdrp8YuERRFdIvXfOPbcpABkVix2 ...
Eclipse4.4设备egit插件提交本地项目代码到远程仓库
一.设备egit 打开Eclipse的Marketplace.在搜索框中输入egit就可以,能够看到Eclipse4.4已经默认安装了egit,当然假设有新版本号的egit公布的话,也能够在下图上点击 ...
Minimum Sum LCM（uva10791+和最小的LCM+推理）
L - Minimum Sum LCM Time Limit:3000MS Memory Limit:0KB 64bit IO Format:%lld & %llu Submi ...
Java的socket服务UDP协议
练习1 接收类 package com.socket.demo; import java.io.IOException; import java.net.DatagramPacket; import ...
它们的定义PropertyPlaceHolder无法完成更换任务
Spring默认PropertyPlaceholderConfigurer只能加载properties格风格简介,现在,我们需要能够从类的完整支持允许似hadoop格风格xml配置文件读取配置信息,并 ...
【cocos2d-x-3.1.1列2】cocos2d-x3.1.1 安卓移植过程
Evernote的链接: http://app.yinxiang.com/l/AAXeIjFsjjFAC68i6hUQkiwFFZg3Maz-AkA/ cocos2d-x 3.1.1 win移植到a ...

spark(1.1) mllib 源代码分析

spark(1.1) mllib 源代码分析的更多相关文章

随机推荐

热门专题