spark MLlib BasicStatistics 统计学基础

一， jar依赖，jsc创建。

package ML.BasicStatistics;

import com.google.common.collect.Lists;

import org.apache.spark.SparkConf;

import org.apache.spark.api.java.JavaDoubleRDD;

import org.apache.spark.api.java.JavaPairRDD;

import org.apache.spark.api.java.JavaRDD;

import org.apache.spark.api.java.JavaSparkContext;

import org.apache.spark.api.java.function.DoubleFlatMapFunction;

import org.apache.spark.api.java.function.Function;

import org.apache.spark.api.java.function.PairFunction;

import org.apache.spark.api.java.function.VoidFunction;

import org.apache.spark.mllib.linalg.Matrices;

import org.apache.spark.mllib.linalg.Matrix;

import org.apache.spark.mllib.linalg.Vector;

import org.apache.spark.mllib.linalg.Vectors;

import org.apache.spark.mllib.regression.LabeledPoint;

import org.apache.spark.mllib.stat.KernelDensity;

import org.apache.spark.mllib.stat.MultivariateStatisticalSummary;

import org.apache.spark.mllib.stat.Statistics;

import org.apache.spark.mllib.stat.test.ChiSqTestResult;

import org.apache.spark.mllib.stat.test.KolmogorovSmirnovTestResult;

import org.apache.spark.mllib.util.MLUtils;

import org.apache.spark.rdd.RDD;

import scala.Tuple2;

import scala.runtime.Statics;

import static org.apache.spark.mllib.random.RandomRDDs.*;

import java.util.*;

/**

 * TODO

 *

 * @ClassName: BasicStatistics

 * @author: DingH

 * @since: 2019/4/3 16:11

 */

public class BasicStatistics {

    public static void main(String[] args) {

        System.setProperty("hadoop.home.dir","E:\\hadoop-2.6.5");

        SparkConf conf = new SparkConf().setAppName("BasicStatistics").setMaster("local");

        JavaSparkContext jsc = new JavaSparkContext(conf);

二。Summary statistics

        /**

         * @Title: Statistics.colStats一个实例MultivariateStatisticalSummary，其中包含按列的max，min，mean，variance和非零数，以及总计数

         * Summary statistics：摘要统计

         */

        JavaRDD<Vector> parallelize = jsc.parallelize(Arrays.asList(

                Vectors.dense(1, 0, 3),

                Vectors.dense(2, 0, 4),

                Vectors.dense(3, 0, 5)

        ));

        MultivariateStatisticalSummary summary = Statistics.colStats(parallelize.rdd());

        System.out.println(summary.mean());

        System.out.println(summary.variance());

        System.out.println(summary.numNonzeros());

三。Correlations:相关性

        /**

         * @Title: Correlations:相关性

         */

        JavaRDD<Tuple2<String, String>> parallelize = jsc.parallelize(Lists.newArrayList(

                new Tuple2<String, String>("cat", "11"),

                new Tuple2<String, String>("dog", "22"),

                new Tuple2<String, String>("cat", "33"),

                new Tuple2<String, String>("pig", "44")

        ));

        JavaDoubleRDD seriesX  = parallelize.mapPartitionsToDouble(new DoubleFlatMapFunction<Iterator<Tuple2<String, String>>>() {

            public Iterable<Double> call(Iterator<Tuple2<String, String>> tuple2Iterator) throws Exception {

                ArrayList<Double> strings = new ArrayList<Double>();

                while (tuple2Iterator.hasNext()){

                    strings.add(Double.parseDouble(tuple2Iterator.next()._2));

                }

                return strings;

            }

        });

        JavaDoubleRDD seriesY  = parallelize.mapPartitionsToDouble(new DoubleFlatMapFunction<Iterator<Tuple2<String, String>>>() {

            public Iterable<Double> call(Iterator<Tuple2<String, String>> tuple2Iterator) throws Exception {

                ArrayList<Double> strings = new ArrayList<Double>();

                while (tuple2Iterator.hasNext()){

                    strings.add(Double.parseDouble(tuple2Iterator.next()._2)+1);

                }

                return strings;

            }

        });

         //compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a

         //method is not specified, Pearson's method will be used by default.

        double correlation = Statistics.corr(seriesX.srdd(), seriesY.srdd(), "pearson");

        JavaRDD<Vector> parallelize11 = jsc.parallelize(Arrays.asList(

                Vectors.dense(1, 0, 3),

                Vectors.dense(2, 0, 4),

                Vectors.dense(3, 0, 5)

        ));// note that each Vector is a row and not a column

        Matrix correlation2 = Statistics.corr(parallelize11.rdd(), "spearman");

        System.out.println(correlation2);

三，Stratified sampling：分层抽样

        /**

         * @Title: Stratified sampling：分层抽样

         */

        JavaRDD<Tuple2<String, String>> parallelize = jsc.parallelize(Lists.newArrayList(

                new Tuple2<String, String>("cat", "11"),

                new Tuple2<String, String>("dog", "22"),

                new Tuple2<String, String>("cat", "33"),

                new Tuple2<String, String>("pig", "44")

        ));

        JavaPairRDD data = parallelize.mapToPair(new PairFunction<Tuple2<String, String>, String, String>() {

            public Tuple2<String, String> call(Tuple2<String, String> stringStringTuple2) throws Exception {

                return new Tuple2<String, String>(stringStringTuple2._1, stringStringTuple2._2);

            }

        });    // an RDD of any key value pairs

        Map<String, Double> fractions = new HashMap<String, Double>(); // specify the exact fraction desired from each key

        fractions.put("cat",0.5);    //对于每个key取值的概率

        fractions.put("dog",0.8);

        fractions.put("pig",0.8);

        // Get an exact sample from each stratum

        JavaPairRDD approxSample  = data.sampleByKey(false, fractions);

        JavaPairRDD exactSample = data.sampleByKeyExact(false, fractions);

        approxSample.foreach(new VoidFunction() {

            public void call(Object o) throws Exception {

                System.out.println(o);

            }

        });

四。Hypothesis testing 假设检验

        /**

         * @Title: Hypothesis testing  假设检验

         */

        Vector vec = Vectors.dense(1,2,3,4); // a vector composed of the frequencies of events

        // compute the goodness of fit. If a second vector to test against is not supplied as a parameter,

        // the test runs against a uniform distribution.

        ChiSqTestResult goodnessOfFitTestResult = Statistics.chiSqTest(vec);

        // summary of the test including the p-value, degrees of freedom, test statistic, the method used,

        // and the null hypothesis.

        System.out.println(goodnessOfFitTestResult);

        Matrix mat = Matrices.dense(3,2,new double[]{1,2,3,4,5,6}); // a contingency matrix

        // conduct Pearson's independence test on the input contingency matrix

        ChiSqTestResult independenceTestResult = Statistics.chiSqTest(mat);

        // summary of the test including the p-value, degrees of freedom...

        System.out.println(independenceTestResult);

        JavaRDD<LabeledPoint> obs = MLUtils.loadLibSVMFile(jsc.sc(), "/data...").toJavaRDD(); // an RDD of labeled points

        // The contingency table is constructed from the raw (feature, label) pairs and used to conduct

        // the independence test. Returns an array containing the ChiSquaredTestResult for every feature

        // against the label.

        ChiSqTestResult[] featureTestResults = Statistics.chiSqTest(obs.rdd());

        int i = 1;

        for (ChiSqTestResult result : featureTestResults) {

            System.out.println("Column " + i + ":");

            System.out.println(result); // summary of the test

            i++;

        }

        JavaDoubleRDD data = jsc.parallelizeDoubles(Arrays.asList(0.2, 1.0,0.3));

        KolmogorovSmirnovTestResult testResult = Statistics.kolmogorovSmirnovTest(data,"norm");

        // summary of the test including the p-value, test statistic,

        // and null hypothesis

        // if our p-value indicates significance, we can reject the null hypothesis

        System.out.println(testResult);

五。Random data generation

         /**

         * @Title: Random data generation  :uniform, standard normal, or Poisson.

         */

        JavaDoubleRDD u = normalJavaRDD(jsc, 100,2);

        // Apply a transform to get a random double RDD following `N(1, 4)`.

        JavaRDD<Double> map = u.map(new Function<Double, Double>() {

            public Double call(Double aDouble) throws Exception {

                return 1.0 + 2.0 * aDouble;

            }

        });

        map.foreach(new VoidFunction<Double>() {

            public void call(Double aDouble) throws Exception {

                System.out.println(aDouble);

            }

        });

六。Kernel density estimation

        /**

         * @Title: Kernel density estimation

         */

        JavaRDD<Double> data = jsc.parallelize(Arrays.asList(1.0, 2.0, 3.0));// an RDD of sample data

        // Construct the density estimator with the sample data and a standard deviation for the Gaussian

        // kernels

        KernelDensity kd = new KernelDensity()

          .setSample(data)

          .setBandwidth(3.0);

        // Find density estimates for the given values

        double[] densities = kd.estimate(new double[] {-1.0, 2.0, 5.0});

        for (int i = 0; i < densities.length; i++) {

            System.out.println(densities[i]);

        }

spark MLlib BasicStatistics 统计学基础的更多相关文章

spark MLLib的基础统计部分学习
参考学习链接:http://www.itnose.net/detail/6269425.html 机器学习相关算法,建议初学者去看看斯坦福的机器学习课程视频:http://open.163.com/s ...
【原创 Hadoop&Spark 动手实践 12】Spark MLLib 基础、应用与信用卡欺诈检测系统动手实践
[原创 Hadoop&Spark 动手实践 12]Spark MLLib 基础.应用与信用卡欺诈检测系统动手实践
Spark MLlib 机器学习
本章导读机器学习(machine learning, ML)是一门涉及概率论.统计学.逼近论.凸分析.算法复杂度理论等多领域的交叉学科.ML专注于研究计算机模拟或实现人类的学习行为,以获取新知识.新 ...
《Spark MLlib机器学习实践》内容简介、目录
http://product.dangdang.com/23829918.html Spark作为新兴的.应用范围最为广泛的大数据处理开源框架引起了广泛的关注,它吸引了大量程序设计和开发人员进行相 ...
Spark MLlib 之 Basic Statistics
Spark MLlib提供了一些基本的统计学的算法,下面主要说明一下: 1.Summary statistics 对于RDD[Vector]类型,Spark MLlib提供了colStats的统计方法 ...
Spark MLlib - Decision Tree源码分析
http://spark.apache.org/docs/latest/mllib-decision-tree.html 以决策树作为开始,因为简单,而且也比较容易用到,当前的boosting或ran ...
Spark入门实战系列--8.Spark MLlib（上）--机器学习及SparkMLlib简介
[注]该系列文章以及使用到安装包/测试数据可以在<倾情大奉送--Spark入门实战系列>获取 .机器学习概念 1.1 机器学习的定义在维基百科上对机器学习提出以下几种定义: l“机器学 ...
Spark入门实战系列--8.Spark MLlib（下）--机器学习库SparkMLlib实战
[注]该系列文章以及使用到安装包/测试数据可以在<倾情大奉送--Spark入门实战系列>获取 .MLlib实例 1.1 聚类实例 1.1.1 算法说明聚类(Cluster analys ...
Spark MLlib知识点学习整理
MLlib的设计原理:把数据以RDD的形式表示,然后在分布式数据集上调用各种算法.MLlib就是RDD上一系列可供调用的函数的集合. 操作步骤: 1.用字符串RDD来表示信息. 2.运行MLlib中的 ...

随机推荐

python之常用模块二(hashlib logging configparser)
摘要:hashlib ***** logging ***** configparser * 一.hashlib模块 Python的hashlib提供了常见的摘要算法,如MD5,SHA1等等. 摘要算法 ...
Linux查看文件以及磁盘空间大小管理(转)
(1)查看文件大小查看当前文件夹下所有文件大小(包括子文件夹) du -sh # du -h15M ./package16K ./.fontconfig4.0K . ...
C++ 动态链接库 DLL 的一些笔记
DLL 文件源代码: // test.h #ifdef TEST_EXPORTS #define TEST_API __declspec(dllexport) #endif class TEST_AP ...
vue实战记录（一）- vue实现购物车功能之前提准备
vue实战,一步步实现vue购物车功能的过程记录,课程与素材来自慕课网,自己搭建了express本地服务器来请求数据作者:狐狸家的鱼本文链接:vue实战-实现购物车功能(一) GitHub:sue ...
【CH4302】Interval GCD
题目大意:给定一个长度为 N 的序列,M 个操作,支持区间加,区间查询最大公约数. 题解: 先来看一个子问题,若是单点修改,区间最大公约数,则可以发现,每次修改最多改变 $O(logn)$ 个答案 ...
lcd驱动框架
目录 lcd驱动框架框图程序分析入口打开open 读read 初始化registered_fb 注册小结程序设计测试方式一操作fb0 方式二操作tty 方式三操作终端完整程序 tit ...
SpringBoot使用消息队列RabbitMQ
RabbitMQ 即一个消息队列,主要是用来实现应用程序的异步和解耦,同时也能起到消息缓冲.消息分发的作用.RabbitMQ是实现AMQP(高级消息队列协议)的消息中间件的一种,AMQP,即Advan ...
第四节：跨域请求的解决方案和WebApi特有的处理方式
一. 简介前言: 跨域问题发生在Javascript发起Ajax调用,其根本原因是因为浏览器对于这种请求,所给予的权限是较低的,通常只允许调用本域中的资源, 除非目标服务器明确地告知它允许跨域调用. ...
[物理学与PDEs]第3章习题5 一维理想磁流体力学方程组的数学结构
试将一维理想磁流体力学方程组 (5. 10)-(5. 16) 化为一阶拟线性对称双曲组的形式. 解答: 由 (5. 12),(5. 16) 知 $$\beex \bea 0&=\cfrac{\ ...
AngularJS DI(依赖注入)实现推测
AngularJS DI(依赖注入) http://www.cnblogs.com/whitewolf/archive/2012/09/11/2680659.html 回到angularjs:在框架中 ...

spark MLlib BasicStatistics 统计学基础

spark MLlib BasicStatistics 统计学基础的更多相关文章

随机推荐

热门专题