前言

通过spark获取hbase数据的过程中，遇到了InputFormat。文章主要围绕InputFormat介绍。会牵扯到spark，mapreduce，hbase相关内容

InputFormat

InputFormat是mapreduce提供的数据源格式接口，也就是说，通过该接口可以支持读取各种各样的数据源（文件系统，数据库等），从而进行mapreduce计算。

在有这个概念的基础上分析InputFormat的源码。

public abstract class InputFormat<K, V> {

  /*

   * 获取数据的分区信息，每个分区包装成InputSplit，返回一个List<InputSplit>

   * 注意这里的分区是逻辑分区

   * 比如一个文件，一共有100个字符，假如安装每个分区10个字符，那么一共有10个分区

   */

  public abstract

    List<InputSplit> getSplits(JobContext context

                               ) throws IOException, InterruptedException;

  /*

   * 根据分区信息，获取RecordReader，RecordReader其实就是一个加强版的迭代器，只不过返回的是kv格式的数据

   * 可以看到，这里只有一个InputSplit，也就是只有一个分区，也就是说是分区内部的迭代

   */

  public abstract

    RecordReader<K,V> createRecordReader(InputSplit split,

                                         TaskAttemptContext context

                                        ) throws IOException,

                                                 InterruptedException;

}

这样大概就理解了这个接口的定位，一个是how to defined partition，一个是how to get data from partition，下面再实例化到spark的应用场景。

TableInputFormat

Spark篇

通过spark的mapreduce接口取hbase数据一定会用到下面的代码

//hbaseConfig            HBaseConfiguration

//TableInputFormat       InputFormat的子类 表示输入数据源

//ImmutableBytesWritable 数据源的key

//Result                 数据源的value

//如果写过mapreduce任务，这个方法和mapreduce的启动配置类似，只不过输出都是rdd，所以就不用声明了

val hBaseRDD = sc.newAPIHadoopRDD(hbaseConfig, classOf[TableInputFormat],

  classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],

  classOf[org.apache.hadoop.hbase.client.Result])

那这个是怎么个流程呢

首先，SparkContext会创建一个RDD

new NewHadoopRDD(this, fClass, kClass, vClass, jconf)

然后就over了…

这其实是spark的调度机制，只有遇到action操作的时候才会真正提交一个job，这里就不详述了。跳过这一段，直接看NewHadoopRDD中的方法，最关键的两个方法，compute()和getPartitions()，是和InputFormat的两个方法一一对应的。

·getPartitions()

override def getPartitions: Array[Partition] = {

  //实例化InputFormat对象 也就是我们传入的TableInputFormat（可能是其它InputFormat，这里只是举个例子）

  val inputFormat = inputFormatClass.newInstance

  inputFormat match {

    case configurable: Configurable =>

      configurable.setConf(_conf)

    case _ =>

  }

  val jobContext = new JobContextImpl(_conf, jobId)

  //拿到所有split

  val rawSplits = inputFormat.getSplits(jobContext).toArray

  //拿到总分区数，并转换为spark的套路

  val result = new Array[Partition](rawSplits.size)

  for (i <- 0 until rawSplits.size) {

    //把每个split封装成partition

    result(i) = new NewHadoopPartition(id, i, rawSplits(i).asInstanceOf[InputSplit with Writable])

  }

  result

}

·compute()

由于代码太多会引起不适，贴一点关键代码

//一样的，实例化InputFormat对象

private val format = inputFormatClass.newInstance

      format match {

        case configurable: Configurable =>

          configurable.setConf(conf)

        case _ =>

      }

      //满足mapreduce的一切要求...

      private val attemptId = new TaskAttemptID(jobTrackerId, id, TaskType.MAP, split.index, 0)

      private val hadoopAttemptContext = new TaskAttemptContextImpl(conf, attemptId)

      private var finished = false

      private var reader =

      try {

        //拿到关键的RecordReader

        val _reader = format.createRecordReader(

          split.serializableHadoopSplit.value, hadoopAttemptContext)

        _reader.initialize(split.serializableHadoopSplit.value, hadoopAttemptContext)

        _reader

      } catch {

        case e: IOException if ignoreCorruptFiles =>

          logWarning(

            s"Skipped the rest content in the corrupted file: ${split.serializableHadoopSplit}",

            e)

          finished = true

          null

  }

//喜闻乐见的hasNext和next

override def hasNext: Boolean = {

  if (!finished && !havePair) {

    try {

      finished = !reader.nextKeyValue

    } catch {

      case e: IOException if ignoreCorruptFiles =>

        logWarning(

          s"Skipped the rest content in the corrupted file: ${split.serializableHadoopSplit}",

          e)

        finished = true

    }

    if (finished) {

      // Close and release the reader here; close() will also be called when the task

      // completes, but for tasks that read from many files, it helps to release the

      // resources early.

      close()

    }

    havePair = !finished

  }

  !finished

}

override def next(): (K, V) = {

  if (!hasNext) {

    throw new java.util.NoSuchElementException("End of stream")

  }

  havePair = false

  if (!finished) {

    inputMetrics.incRecordsRead(1)

  }

  if (inputMetrics.recordsRead % SparkHadoopUtil.UPDATE_INPUT_METRICS_INTERVAL_RECORDS == 0) {

    updateBytesRead()

  }

  (reader.getCurrentKey, reader.getCurrentValue)

}

省略了无数代码，大概就是把RecordReader封装成Iterator（这坑爹的mapreduce不能直接拿Iterator作接口吗）

Spark做的大概就是这样事情，剩下的是hbase做的

Hbase篇

TableInputFormat是hbase提供的接口，用来兼容mapreduce，没想到被spark这个浓眉大眼的截去了。

直奔主题找TableInputFormat的关键代码

·getSplits()

RegionSizeCalculator sizeCalculator =

    new RegionSizeCalculator(getRegionLocator(), getAdmin());

TableName tableName = getTable().getName();

Pair<byte[][], byte[][]> keys = getStartEndKeys();

if (keys == null || keys.getFirst() == null ||

    keys.getFirst().length == 0) {

  HRegionLocation regLoc =

      getRegionLocator().getRegionLocation(HConstants.EMPTY_BYTE_ARRAY, false);

  if (null == regLoc) {

    throw new IOException("Expecting at least one region.");

  }

  List<InputSplit> splits = new ArrayList<>(1);

  //拿到region的数量，用来做为partitin的数量

  long regionSize = sizeCalculator.getRegionSize(regLoc.getRegionInfo().getRegionName());

  //创建TableSplit，也就是InputSplit

  TableSplit split = new TableSplit(tableName, scan,

      HConstants.EMPTY_BYTE_ARRAY, HConstants.EMPTY_BYTE_ARRAY, regLoc

          .getHostnamePort().split(Addressing.HOSTNAME_PORT_SEPARATOR)[0], regionSize);

  splits.add(split);

·createRecordReader()

final TableRecordReader trr =

    this.tableRecordReader != null ? this.tableRecordReader : new TableRecordReader();

Scan sc = new Scan(this.scan);

sc.setStartRow(tSplit.getStartRow());

sc.setStopRow(tSplit.getEndRow());

trr.setScan(sc);

trr.setTable(getTable());

return new RecordReader<ImmutableBytesWritable, Result>() {

  @Override

  public void close() throws IOException {

    trr.close();

    closeTable();

  }

  @Override

  public ImmutableBytesWritable getCurrentKey() throws IOException, InterruptedException {

    return trr.getCurrentKey();

  }

  @Override

  public Result getCurrentValue() throws IOException, InterruptedException {

    return trr.getCurrentValue();

  }

  @Override

  public float getProgress() throws IOException, InterruptedException {

    return trr.getProgress();

  }

  @Overrid

  public void initialize(InputSplit inputsplit, TaskAttemptContext context) throws IOException,

      InterruptedException {

    trr.initialize(inputsplit, context);

  }

  @Override

  public boolean nextKeyValue() throws IOException, InterruptedException {

    return trr.nextKeyValue();

  }

};

这个应该挺清楚的，花式创建RecordReader..

总结

Spark为了兼容mapreduce，给出了类似hadoopRDD()的接口，hbase为了兼容mapreduce，给出了TableInputFormat之类的接口。从而使得spark可以通过hbase获取数据，当然方法不只这一种。

spark（三）从hbase取数据的更多相关文章

Spark Streaming中向flume拉取数据
在这里看到的解决方法 https://issues.apache.org/jira/browse/SPARK-1729 请是个人理解,有问题请大家留言. 其实本身flume是不支持像KAFKA一样的发 ...
C# WebBrowser控件模拟登录抓取数据
参考博客:C#中的WebBrowser控件的使用参考博客:C#中利用WebBrowser控件,获得HTML源码一.问题点: 1.模拟登录后,如果带有嵌套的iframe嵌套,不好读取iframe内容 ...
HBase指定大量列集合的场景下并发拉取数据时卡住的问题排查
最近遇到一例,HBase 指定大量列集合的场景下,并发拉取数据,应用卡住不响应的情形.记录一下. 问题背景退款导出中,为了获取商品规格编码,需要从 HBase 表 T 里拉取对应的数据. T 对商品 ...
毕设三: spark与phoenix集成插入数据/解析json数组
需求:将前些日子采集的评论存储到hbase中思路: 先用fastjson解析评论,然后构造rdd,最后使用spark与phoenix交互,把数据存储到hbase中部分数据: [ { "r ...
HBase(三): Azure HDInsigt HBase表数据导入本地HBase
目录: hdfs 命令操作本地 hbase Azure HDInsight HBase表数据导入本地 hbase hdfs命令操作本地hbase: 参见 HDP2.4安装(五):集群及组件安装 , ...
量化派基于Hadoop、Spark、Storm的大数据风控架构--转
原文地址:http://www.csdn.net/article/2015-10-06/2825849 量化派是一家金融大数据公司,为金融机构提供数据服务和技术支持,也通过旗下产品“信用钱包”帮助个人 ...
Spark DataFrame写入HBase的常用方式
Spark是目前最流行的分布式计算框架,而HBase则是在HDFS之上的列式分布式存储引擎,基于Spark做离线或者实时计算,数据结果保存在HBase中是目前很流行的做法.例如用户画像.单品画像.推荐 ...
Spark Streaming从Flume Poll数据案例实战和内幕源码解密
本节课分成二部分讲解: 一.Spark Streaming on Polling from Flume实战二.Spark Streaming on Polling from Flume源码第一部分 ...
关系型数据库与HBase的数据储存方式差别
现在Bigtable型(列族)数据库应用越来越广,功能也非常强大. 可是非常多人还是把它当做关系型数据库在使用,用原来关系型数据库的思维建表.存储.查询. 本文以hbase举例讲述数据模式的变化. 传 ...

随机推荐

依赖注入框架Google Guice 对象图
GettingStarted · google/guice Wiki https://github.com/google/guice/wiki/GettingStarted sameb edited ...
ELK basic---http://udn.yyuap.com/doc/logstash-best-practice-cn/filter/grok.html
http://blog.csdn.net/lgnlgn/article/details/8053626 elasticsearch学习入门 input {stdin{}}filter { grok { ...
Spark源码分析 – Checkpoint
CP的步骤 1. 首先如果RDD需要CP, 调用RDD.checkpoint()来mark 注释说了, 这个需要在Job被执行前被mark, 原因后面看, 并且最好选择persist这个RDD, 否则 ...
wordcount（C语言）
写在前面上传的作业代码与测试代码放在GitHub上了 https://github.com/IHHHH/gitforwork 本次作业用的是C语言来完成,因为个人能力与时间关系,只完成了基本功能,扩 ...
Ckeditor事件绑定
最近有个需求是要在点击CKeditor的时候触发某个判断的事件.试了一些方法都不可行,自己写的onclick时间都会被编辑器屏蔽.可以对对象加载完成绑定事件代码如下. CKEDITOR.instanc ...
3.Github介绍
很多人都知道,Linus在1991年创建了开源的Linux.从此,Linux系统不断发展,已经成为最大的服务器系统软件了.Linus虽然创建了Linux,但Linux的壮大是靠全世界热心的志愿者参与的 ...
HttpRunner 参数化数据驱动
HttpRunner 2.0 参数化数据驱动案例,废话不说,直接上干货. 1.测试用例目录结构 api:接口集 testcases:测试用例 testsuites:测试套件 data: ...
centos7配置IP地址
有关于centos7获取IP地址的方法主要有两种,1:动态获取ip:2:设置静态IP地址在配置网络之前我们先要知道centos的网卡名称是什么,centos7不再使用ifconfig命令,可通过命令 ...
Linux系统——引导过程与服务控制
一.Linux开机启动原理(十步) (1)开机自检BIOS 开机检测,主板检测 (2)MBR引导硬盘512字节 (3)GRUB菜单操作系统菜单 (4)加载内核(kernel) 启动操作系统核心,根 ...
Linux系统——文件系统与LVM 逻辑
格式化命令 mkfs. mkswap mkfs格式化数据磁盘 # mkfs -t ext4 /dev/sdb1 # mkfs.ext4 /dev/sdb1 -t 指定格式化文件类型 -b 指定bloc ...

spark（三）从hbase取数据

前言

InputFormat

TableInputFormat

总结

spark（三）从hbase取数据的更多相关文章

随机推荐

热门专题