Spark2.0自定义累加器

Spark2.0 自定义累加器

在2.0中使用自定义累加器需要继承AccumulatorV2这个抽象类,同时必须对以下6个方法进行实现:

1.reset 方法: 将累加器进行重置;

abstract defreset(): Unit

Resets this accumulator, which is zero value.

2.add 方法: 向累加器中添加另一个值;

abstract defadd(v: IN): Unit

3.merge方法: 合并另一个类型相同的累加器;

abstract defmerge(other: AccumulatorV2[IN, OUT]): Unit

Merges another same-type accumulator into this one and update its state, i.e.

4.value 取值

abstract defvalue: OUT

Defines the current value of this accumulator

5.复制:Creates a new copy of this accumulator.

abstract defcopy(): AccumulatorV2[IN, OUT]

abstract defisZero: Boolean

Returns if this accumulator is zero value or not.

需要注意的是,对累加器的更新只有在action中生效,spark对累加器的每个task的更新只会应用一次,即重新启动的任务不会更新累加器的值.而在transform中需要注意,每个任务可能会多次进行更新,如果task或者job被重复执行.同时累加器不会改变spark的lazy策略.

由于业务需求经常要构造若干Dataframe间数据的映射关系,而使用collectionAccumulator又要有一定量的重复性的Map操作, 故写了这个生成Map的自定义累加器,IN为代表key和value的String 类型的tuple,最后生成Map, 如果累加器中已经含有了要添加的key且 key->value不重复则以字符串||对value进行分隔,并更新累加器的值;

代码如下:

/**
  * Created by Namhwik on 2016/12/27.
  */
class MapAccumulator extends AccumulatorV2[(String,String),mutable.Map[String, String]] {
  private  val mapAccumulator = mutable.Map[String,String]()
  def add(keyAndValue:((String,String))): Unit ={
    val key = keyAndValue._1
    val value = keyAndValue._2
    if (!mapAccumulator.contains(key))
      mapAccumulator += key->value
    else if(mapAccumulator.get(key).get!=value) {
      mapAccumulator += key->(mapAccumulator.get(key).get+"||"+value)
    }
  }
  def isZero: Boolean = {
    mapAccumulator.isEmpty
  }
  def copy(): AccumulatorV2[((String,String)),mutable.Map[String, String]] ={
    val newMapAccumulator = new  MapAccumulator()
    mapAccumulator.foreach(x=>newMapAccumulator.add(x))
    newMapAccumulator
  }
  def value: mutable.Map[String,String] = {
    mapAccumulator
  }
  def merge(other:AccumulatorV2[((String,String)),mutable.Map[String, String]]) = other match
  {
    case map:MapAccumulator => {
      other.value.foreach(x =>
        if (!this.value.contains(x._1))
          this.add(x)
        else
          x._2.split("\\|\\|").foreach(
            y => {
              if (!this.value.get(x._1).get.split("\\|\\|").contains(y))
                this.add(x._1, y)
            }
          )
      )
    }
    case _  =>
      throw new UnsupportedOperationException(
        s"Cannot merge ${this.getClass.getName} with ${other.getClass.getName}")
  }
  def reset(): Unit ={
    mapAccumulator.clear()
  }
}

参考 <http://spark.apache.org/docs/latest/programming-guide.html>

ps:使用的时候需要register.

Spark2.0自定义累加器的更多相关文章

初识Spark2.0之Spark SQL
内存计算平台spark在今年6月份的时候正式发布了spark2.0,相比上一版本的spark1.6版本,在内存优化,数据组织,流计算等方面都做出了较大的改变,同时更加注重基于DataFrame数据组织 ...
Spark2.0机器学习系列之4：Logistic回归及Binary分类（二分问题）结果评估
参数设置 α: 梯度上升算法迭代时候权重更新公式中包含 α : http://blog.csdn.net/lu597203933/article/details/38468303 为了更好理解 α和 ...
Spark中自定义累加器Accumulator
1. 自定义累加器自定义累加器需要继承AccumulatorParam,实现addInPlace和zero方法. 例1:实现Long类型的累加器 object LongAccumulatorPara ...
spark2.0新特性之DataSet
1.Spark SQL,DataFrame,DataSet的错误类型检测时机 spark SQL:其类型检测与语法检测是在运行时检测的 DataFrame:在spark2.0以前的版本中,DataFr ...
geotrellis使用（二十五）将Geotrellis移植到spark2.0
目录前言升级spark到2.0 将geotrellis最新版部署到spark2.0(CDH) 总结一.前言事情总是变化这么快,前面刚写了一篇博客介绍如何将geotrellis移植 ...
Ubuntu14.04或16.04下安装JDK1.8+Scala+Hadoop2.7.3+Spark2.0.2
为了将Hadoop和Spark的安装简单化,今日写下此帖. 首先,要看手头有多少机器,要安装伪分布式的Hadoop+Spark还是完全分布式的,这里分别记录. 1. 伪分布式安装伪分布式的Hadoo ...
maven+spark2.0.0最大连通分量
运用到了spark2.0.0的grarhx包,要手动的在pom.xml里面添加依赖包,要什么就在里面添加依赖,然后在run->maven install
Eclipse+maven+scala2.11.8+spark2.0.0的环境部署
主要在maven-for-scalaIDE纠结了,因为在eclipse版本是luna4.x 里面有自己带有的maven. 根据网上面无脑的下一步下一步,出现了错误,在此讲解各个插件的用途,以此新人看见 ...
spark2.0.1 安装配置
1. 官网下载 wget http://d3kbcqa49mib13.cloudfront.net/spark-2.0.1-bin-hadoop2.7.tgz 2. 解压 tar -zxvf spar ...

随机推荐

EasyUI ComboGrid的绑定，上下键和回车事件，输入条件查询
首先我们先看一下前台的绑定事件 1.先定义标签 <input id="cmbXm" type="text" style="width: 100p ...
hdu 1082, stack emulation, and how to remove redundancy 分类： hdoj 2015-07-16 02:24 86人阅读评论(0) 收藏
use fgets, and remove the potential '\n' in the string's last postion. (main point) remove redundanc ...
用Advanced Installer制作DotNetBar for Windows Forms 12.0.0.1_冰河之刃重打包版详解
关于 DotNetBar for Windows Forms 12.0.0.1_冰河之刃重打包版 --------------------11.8.0.8_冰河之刃重打包版-------------- ...
ASP.NET简单实现APP中用户个人头像上传和裁剪
最近有个微信项目的用户个人中心模块中,客户要求用户头像不仅仅只是上传图片,还需要能对图片进行裁剪.考虑到flash在IOS和Android上的兼容性问题,于是想着能从js这块入手,在网上发现了devo ...
jquery 显示弹出层可利用动画效果
1 show()方法和hide()方法 $("selector").show() 从display:none还原元素默认或已设置的display属性$("selecto ...
Birt使用总结
把report放到其他服务器要重新建立Data Source ,这是配置,拷贝项目时不会同时拷贝 (1)在EXTJs中利用Report实现报表的刷新 Ext.getCmp("showview ...
IOS 设置导航栏
//设置导航栏的标题 self.navigationItem setTitle:@"我的标题"; //设置导航条标题属性:字体大小/字体颜色…… /*设置头的属性:setTitle ...
EXCEL里面的数字显示为文本不用科学计数法显示
1. 在输入这一串数字前加撇号“'”(英文状态下的单引号)即可.2. 先将这一列设置为“文本”格式,然后直接输入这一串数字即可. 已经输入好了数字,那估计你这些数字的后三位都已经全变成“0”了,用 ...
学习Find函数和select
Find函数其实就类似于在excel按下Ctrl+F出现的查找功能:在某个区域中查找你要找的字符,一旦找到就定位到第一个对应的单元格.所以Find函数的返回值是个单元格,也就是个range值.举例,s ...
__autoload的小tip
__autoload魔法函数不仅在如 $a=new testClass();时可以触发.在 class a extends b时如果b类的定义在当前页未找到,也可以触发这个函数

Spark2.0自定义累加器

Spark2.0自定义累加器的更多相关文章

随机推荐

热门专题