二、MLlib统计指标之关联/抽样/汇总

汇总统计[Summary statistics]:

Summary statistics提供了基于列的统计信息，包括6个统计量：均值、方差、非零统计量个数、总数、最小值、最大值。

import org.apache.spark.api.java.JavaRDD;

import org.apache.spark.api.java.JavaSparkContext;

import org.apache.spark.mllib.linalg.Vector;

import org.apache.spark.mllib.stat.MultivariateStatisticalSummary;

import org.apache.spark.mllib.stat.Statistics;

JavaSparkContext jsc = ...

JavaRDD<Vector> mat = ... // an RDD of Vectors

// Compute column summary statistics.

MultivariateStatisticalSummary summary = Statistics.colStats(mat.rdd());

System.out.println(summary.mean()); // a dense vector containing the mean value for each column

System.out.println(summary.variance()); // column-wise variance

System.out.println(summary.numNonzeros()); // number of nonzeros in each column

关联[ Correlations]:

计算两个数据序列的相关度。相关系数是用以反映变量之间相关关系密切程度的统计指标。相关系数值越接近1或者-1，则表示数据越可进行线性拟合。目前Spark支持两种相关性系数：皮尔逊相关系数（pearson）和斯皮尔曼等级相关系数（spearman）。

import org.apache.spark.api.java.JavaDoubleRDD;

import org.apache.spark.api.java.JavaSparkContext;

import org.apache.spark.mllib.linalg.*;

import org.apache.spark.mllib.stat.Statistics;

JavaSparkContext jsc = ...

JavaDoubleRDD seriesX = ... // a series

JavaDoubleRDD seriesY = ... // must have the same number of partitions and cardinality as seriesX

// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a

// method is not specified, Pearson's method will be used by default.

Double correlation = Statistics.corr(seriesX.srdd(), seriesY.srdd(), "pearson");

JavaRDD<Vector> data = ... // note that each Vector is a row and not a column

// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.

// If a method is not specified, Pearson's method will be used by default.

Matrix correlMatrix = Statistics.corr(data.rdd(), "pearson");

分层抽样[ Stratified sampling]:

一个根据Key来抽样的功能，可以为每个key设置其被选中的概率。具体见代码以及注释和其他统计方法不同，sampleByKey 和 sampleByKeyExact方法可以在RDD键值对上被执行。key可以被想象成一个标签和作为实体属性的值。例如，key可以是男女、文件编号，实体属性可以使人口中的年龄、文件中的单词。sampleByKey方法通过随机方式决定某个观测值是否被采样，因此需要提供一个预期采样数量。sampleByKeyExact 方法比使用简单随机抽样的sampleByKey方法需要更多的资源，但是它可以保证采样大小的置信区间为99.99%。

import java.util.Map;

import org.apache.spark.api.java.JavaPairRDD;

import org.apache.spark.api.java.JavaSparkContext;

JavaSparkContext jsc = ...

JavaPairRDD<K, V> data = ... // an RDD of any key value pairs

Map<K, Object> fractions = ... // specify the exact fraction desired from each key

// Get an exact sample from each stratum

JavaPairRDD<K, V> approxSample = data.sampleByKey(false, fractions);

JavaPairRDD<K, V> exactSample = data.sampleByKeyExact(false, fractions);

假设检验[ Hypothesis testing]:

Spark目前支持皮尔森卡方检测（Pearson’s chi-squared tests），包括适配度检定和独立性检定。“适配度检定”验证一组观察值的次数分配是否异于理论上的分配。“独立性检定”验证从两个变数抽出的配对观察值组是否互相独立（例如：每次都从A国和B国各抽一个人，看他们的反应是否与国籍无关）。

import org.apache.spark.api.java.JavaRDD;

import org.apache.spark.api.java.JavaSparkContext;

import org.apache.spark.mllib.linalg.*;

import org.apache.spark.mllib.regression.LabeledPoint;

import org.apache.spark.mllib.stat.Statistics;

import org.apache.spark.mllib.stat.test.ChiSqTestResult;

JavaSparkContext jsc = ...

Vector vec = ... // a vector composed of the frequencies of events

// compute the goodness of fit. If a second vector to test against is not supplied as a parameter,

// the test runs against a uniform distribution.

ChiSqTestResult goodnessOfFitTestResult = Statistics.chiSqTest(vec);

// summary of the test including the p-value, degrees of freedom, test statistic, the method used,

// and the null hypothesis.

System.out.println(goodnessOfFitTestResult);

Matrix mat = ... // a contingency matrix

// conduct Pearson's independence test on the input contingency matrix

ChiSqTestResult independenceTestResult = Statistics.chiSqTest(mat);

// summary of the test including the p-value, degrees of freedom...

System.out.println(independenceTestResult);

JavaRDD<LabeledPoint> obs = ... // an RDD of labeled points

// The contingency table is constructed from the raw (feature, label) pairs and used to conduct

// the independence test. Returns an array containing the ChiSquaredTestResult for every feature

// against the label.

ChiSqTestResult[] featureTestResults = Statistics.chiSqTest(obs.rdd());

int i = 1;

for (ChiSqTestResult result : featureTestResults) {

    System.out.println("Column " + i + ":");

    System.out.println(result); // summary of the test

    i++;

}

import java.util.Arrays;

import org.apache.spark.api.java.JavaDoubleRDD;

import org.apache.spark.api.java.JavaSparkContext;

import org.apache.spark.mllib.stat.Statistics;

import org.apache.spark.mllib.stat.test.KolmogorovSmirnovTestResult;

JavaSparkContext jsc = ...JavaDoubleRDD data = jsc.parallelizeDoubles(Arrays.asList(0.2, 1.0, ...));

KolmogorovSmirnovTestResult testResult = Statistics.kolmogorovSmirnovTest(data, "norm", 0.0, 1.0);

// summary of the test including the p-value, test statistic,

// and null hypothesis

// if our p-value indicates significance, we can reject the null hypothesis

System.out.println(testResult);

显著性检验[Streaming Significance Testing]:

显著性检验就是事先对总体形式做出一个假设，然后用样本信息来判断这个假设（原假设）是否合理，即判断真实情况与原假设是否显著地有差异。或者说，显著性检验要判断样本与我们对总体所做的假设之间的差异是否纯属偶然，还是由我们所做的假设与总体真实情况不一致所引起的。spark.mllib 实现了一个在线测试用以支持类似A/B测试这样的用例。

随机数据生成[ Random data generation]:

随机数据对于随机算法、原型设计和检验十分有用。Random data generation用于随机数的生成。Random RDDs包下现支持正态分布、泊松分布和均匀分布三种分布方式。

RandomRDDs提供随机double RDDS或vector RDDS。下面的例子中生成一个随机double RDD，其值是标准正态分布N（0，1），然后将其映射到N（1，4）。

import org.apache.spark.SparkContext;

import org.apache.spark.api.JavaDoubleRDD;

import static org.apache.spark.mllib.random.RandomRDDs.*;

JavaSparkContext jsc = ...

// Generate a random double RDD that contains 1 million i.i.d. values drawn from the

// standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.

JavaDoubleRDD u = normalJavaRDD(jsc, 1000000L, 10);

// Apply a transform to get a random double RDD following `N(1, 4)`.

JavaDoubleRDD v = u.map(

  new Function<Double, Double>() {

    public Double call(Double x) {

      return 1.0 + 2.0 * x;

    }

  });

核密度估算[ Kernel density estimation]:

Spark ML 提供了一个工具类 KernelDensity 用于核密度估算，核密度估算的意思是根据已知的样本估计未知的密度，属於非参数检验方法之一。核密度估计的原理是。观察某一事物的已知分布，如果某一个数在观察中出现了，我们可以认为这个数的概率密度很大，和这个数比较近的数的概率密度也会比较大，而那些离这个数远的数的概率密度会比较小。并最终根据所有的数拟合出未知的全貌

import org.apache.spark.mllib.stat.KernelDensity;

import org.apache.spark.rdd.RDD;

RDD<Double> data = ... // an RDD of sample data

// Construct the density estimator with the sample data and a standard deviation for the Gaussian

// kernels

KernelDensity kd = new KernelDensity()

  .setSample(data)

  .setBandwidth(3.0);

// Find density estimates for the given values

double[] densities = kd.estimate(new double[] {-1.0, 2.0, 5.0});

二、MLlib统计指标之关联/抽样/汇总的更多相关文章

mybatis实战教程二：多对一关联查询(一对多)
多对一关联查询一.数据库关系.article表和user表示多对一的关系 CREATE TABLE `article` ( `id` ) NOT NULL AUTO_INCREMENT, `user ...
QC学习二：QC使用中问题点汇总
QC 使用中问题点汇总,包括以下方面: 1.不兼容IE7,IE8的问题(服务器端设置) 2.无法在Win 7下正常下载页面(客户端设置) 3.在QC中填写中文内容后无法正常提交到数据库(客户端设置) ...
Jmeter(二十五)_Xpath关联
在Jmeter中,除了正则表达式可以用作关联,还有一种方式也可以做关联,那就是 XPath Extractor.它是利用xpath提取出关键信息,传递变量. 具体用法添加一个后置处理器-XPath ...
Lua 学习之基础篇二<Lua 数据类型以及函数库汇总>
引言前面讲了运算符,这里主要对Lua的数据处理相关的数据类型和函数库进行总结归纳,后面会再接着单独分开讲解具体使用. 首先因为Lua 是动态类型语言,变量不要类型定义,只需要为变量赋值. 值可以存储 ...
Android IOS WebRTC 音视频开发总结（二五）-- webrtc优秀资源汇总
本文主要整理一些webrtc相关资料供学习(会持续更新),转载请说明出处,文章来自博客园RTC.Blacker,欢迎关注微信公众号:blackerteam ---------------------- ...
iOS二十种超酷时尚艺术滤镜汇总【附源码】
本文总结了20种ios滤镜都是基于GPUImage的,有3种滤镜是GPUImage库中包含的,还有17种是Instagram中的经典滤镜,集成在一个项目中.使用GPUImage可以非常容易创建我们自己 ...
MyBatis学习(二)---数据表之间关联
想要了解MyBatis基础的朋友可以通过传送门: MyBatis学习(一)---配置文件,Mapper接口和动态SQL http://www.cnblogs.com/ghq120/p/8322302. ...
ThinkPHP 关联模型（二十）
原文:ThinkPHP 关联模型(二十) ThinkPHP关联模型两表关联查询:Message 和 user 关联条件uid(参考手册:模型->关联模型) 步骤: 一:创建Message表 ...
SQLSERVER 使用 ROLLUP 汇总数据，实现分组统计，合计，小计
表结构: CREATE TABLE [dbo].[Students]( ,) NOT NULL, ) NULL, [Sex] [int] NOT NULL, ) NULL, ) NULL, , ) N ...

随机推荐

Android面试，与Service交互方式
五种交互方式,分别是:通过广播交互.通过共享文件交互.通过Messenger(信使)交互.通过自定义接口交互.通过AIDL交互.(可能更多) Service与Thread的区别 Thread:Thre ...
jQuery.each() 的5个案例
1.基本的jQuery.each实例看看 each() 函数是如何处理一个 jQuery 对象的.首先选取所有的a标签并且打印出他们的href属性. 需要注意的是, 在 each() 当中使用 j ...
tail和head命令
[root@rhel7 ~]# cat rusky --cat命令查看文件内容 line1 line2 line3 line4 line5 line6 line7 line8 line9 line10 ...
一个简单的Verilog计数器模型
一个简单的Verilog计数器模型功能说明: 向上计数向下计数预装载值一.代码 1.counter代码(counter.v) module counter( input clk, input ...
HTML标签练习
<html> <> <body> <h4>一个无序列表:</h4> <ul> <li><a href=&quo ...
简单水池&&迷宫问题
#include <iostream> #include <stdio.h> #include <cstring> using namespace std; int ...
ASPNET 5
1. 什么是APS.NET 5 ASP.NET 5是一个可构建基于云服务的Web应用的构架,并且它是开源的和跨平台的.我们提供了重新设计的一个可以部署在本地和云服务的优化框架.它由一个一个模块组成,因 ...
BetWeen和模糊查询
--区分大小写性能比较低select * from Students where Age >1 and Age <4select * from Students where Age bet ...
工作中部署使用MP平台的一些问题
1.首先先把项目导入到myeclipse中,如果没有.classpath和.mymetadata和.project等文件,就自己创建一个web项目,然后把里面的src覆盖,webroot等文件覆盖. ...
C程序第二章节：算法
1.主要讲了:算法,3种基本结构化的算法(顺序,选择,循环结构),N-S流程图表示算法,伪代码表示算法. 2.输入10个数,输出其中最大的一个数. #include <stdio.h>in ...

二、MLlib统计指标之关联/抽样/汇总

二、MLlib统计指标之关联/抽样/汇总的更多相关文章

随机推荐

热门专题