Spark2-数据探索

 freqItems

 sampleBy

 cov

 crosstab

 approxQuantitle

 boolmFilter 布隆选择器

 corr 皮尔逊相关系数

 countMinSketch

Spark2为DataSet/DataFrame提供了一个stat方法，会返回一个DataFrameStatFunctins对象，可以调用其方法来实现数据的探索功能。

df.stat.freqIterms(Seq("age"))

1 freqItems

包含了4个重载方法：

freqItems(cols:Seq[string]):DataFrame

freqItems(cols:Seq[string],support:Double):DataFrame

freqItems(cols:Array[String]):DataFrame

freqItems(cols:Array[String],support:Double):DataFrame

查看字段中的频繁元素集合，返回每个字段保安一个数组，包含了所有去重后的元素。support表示最小的频繁项阀值，默认为1%，如果元素的频繁数小于1%那么就会被忽略

val rows = Seq.tabulate() { i =>

  if (i %  == ) (, -1.0) else (i, i * -1.0)

}

val df = spark.createDataFrame(rows).toDF("a", "b")

// find the items with a frequency greater than 0.4 (observed 40% of the time) for columns

// "a" and "b"

val freqSingles = df.stat.freqItems(Array("a", "b"), 0.4)

freqSingles.show()

+-----------+-------------+

|a_freqItems|  b_freqItems|

+-----------+-------------+

|    [, ]|[-1.0, -99.0]|

+-----------+-------------+

// find the pair of items with a frequency greater than 0.1 in columns "a" and "b"

val pairDf = df.select(struct("a", "b").as("a-b"))

val freqPairs = pairDf.stat.freqItems(Array("a-b"), 0.1)

freqPairs.select(explode($"a-b_freqItems").as("freq_ab")).show()

+----------+

|   freq_ab|

+----------+

|  [,-1.0]|

|   ...    |

+----------+

2 sampleBy

包含了两个重载方法：

sampleBy[T](col:String,fractions:Map[T,Double],seed:Long):DataFrame

sampleBy[T](col:String,fractions:Map[T,Double],seed:Long):DataFrame

根据某个字段进行分层抽样，根据给定的分层百分比返回不经过替换的分层样本。fractions如果不指定，会使用0.

val df = spark.createDataFrame(Seq((, ), (, ), (, ), (, ), (, ), (, ),

  (, ))).toDF("key", "value")

val fractions = Map( -> 1.0,  -> 0.5)

df.stat.sampleBy("key", fractions, 36L).show()

+---+-----+

|key|value|

+---+-----+

|  |    |

|  |    |

|  |    |

+---+-----+

3 cov

cov(col1:String,col2:String):String

计算两个字段之间的协方差。

val df = sc.parallelize( until ).toDF("id").withColumn("rand1", rand(seed=))

  .withColumn("rand2", rand(seed=))

df.stat.cov("rand1", "rand2")

res1: Double = 0.065...

4 crosstab

crosstab(col1:Stirng,col2:String):DataFrame

交叉列表为一组变量提供了频率分布表，在统计学中被经常用到。例如在对租车行业的数据进行分析时，需要分析每个客户（name）租用不同品牌车辆(brand)的次数。此时，就可以直接调用crosstab函数。如果同时按几个变量或特征，把数据分类列表时，这样的统计表叫作交叉分类汇总表，其主要用来检验两个变量之间是否存在关系，或者说是否独立。

计算给定列的分组频数表,也称为相关表。每一列的去重值的个数应该小于1e4.最多返回1e6个非零对.每一行的第一列会是col1的去重值，列名称是col2的去重值。第一列的名称是$col1_$col2. 没有出现的配对将以零作为计数。DataFrame.crosstab() and DataFrameStatFunctions.crosstab()类似。
参数：● col1 – 第一列的名称. 去重项作为每行的第一项。
　　 ● col2 – 第二列的名称. 去重项作为DataFrame的列名称。

val df = spark.createDataFrame(Seq((, ), (, ), (, ), (, ), (, ), (, ), (, )))

  .toDF("key", "value")

val ct = df.stat.crosstab("key", "value")

ct.show()

+---------+---+---+---+

|key_value|  |  |  |

+---------+---+---+---+

|        |  |  |  |

|        |  |  |  |

|        |  |  |  |

+---------+---+---+---+

5 approxQuantitle

approxQuantile(cols:Array[String],probabilities:Array[Double],relativeError:Double):Array[Array[Double]]

approxQuantile(col:String,probailities:Array[Double],relativeError:Double):Array[Double]

计算近似分位数，其中null和NaN将会在计算之前被忽略掉，如果该列为空，或者只包含null、NaN，那么将会返回一个空Array，也就是Nil

cols:需要计算分位数的Columns。
probabilities:分位数的位置，要求[0,1]之间，0是最小值，0.5是中位数，1是最大值。
relativeError：相对误差，数字越小，结果越准确，但是计算代价也越大。

6 boolmFilter 布隆选择器

比较经典的判断元素是否存在的方法，牺牲精确度换空间的方法。4个重载方法：

bloomFilter(col:Column,expecteNumItems:Long,numBits:Long):BloomFilter

bloomFilter(colName:String,expectedNumIterms:Long,numBits:Long):BloomFilter

bloomFilter(col:Column,expectedNumItems:Long,fpp:Double):BloomFilter

bloomFilter(colName:String,expectedNumItems:Long,fpp:Double):BoolmFilter

colName/col:需要进行构建布隆选择器的列

expectedNumItems：预计将要被放入布隆选择器中的元素数量；

numBits：布隆选择器的预期位数，也就是Bits的长度

fpp：过滤器的错误概率，假阳性概率，该值越大，那么被错误判断不存在的值被判断为存在的概率越大

资料1：http://lxw1234.com/archives/2015/12/580.htm

资料2：https://www.jianshu.com/p/b0c0edf7686e

7 corr 皮尔逊相关系数

两个重载方法：

corr(col1:String,col2:String):Double

corr(col1:String,col2:String,method:String):Double

val df = sc.parallelize( until ).toDF("id").withColumn("rand1", rand(seed=))

  .withColumn("rand2", rand(seed=))

df.stat.corr("rand1", "rand2", "pearson")

res1: Double = 0.613...

8 countMinSketch

用于统计大数据情况中的非精确数据频次。使用哈希原理，牺牲精确度换空间与实践，结果估算偏大，但是不会偏小，只需要固定大小的内存和计算实践，和需要统计的元素多少没有关系，对于低频次的元素，估算的相对误差可能比较大。

countMinSketch(col:Column,eps:Double,confidence:Double,seed:Int)：CountMinSketch

countMinSketch(col:Column,depth:Int,width:Int,seed:Int)：CountMinSketch

countMinSketch(colName:String,eps:Double,confidence:Double,seed:Int)：CountMinSketch

countMinSketch(colName:String,depth:Int,width:Int,seed:Int)：CountMinSketch

col：需要计算sketch的列

depth：sketch的深度

width：sketch的宽度

seed：随机种子

eps：相对误差

confidence：置信度？不确定