spark bulkload hbase笔记

1. 现有的三方包不能完全支持
- 官方：hbase-spark，不能设置 timestamp
- unicredit/hbase-rdd：接口太复杂，不能同时支持多个 family

2. HFile 得是有序的，排序依据 KeyValue.KVComparator，于是我们自定义一个 Comparator，内部调用 KeyValue.KVComparator

3. 如果没有自定义 partitioner，极有可能出现以下异常
ERROR: "java.io.IOException: Retry attempted 10 times without completing, bailing out"
https://community.hortonworks.com/content/supportkb/150138/error-javaioioexception-retry-attempted-10-times-w.html

自定义的方法，参考了：https://github.com/unicredit/hbase-rdd/blob/master/src/main/scala/unicredit/spark/hbase/HFileSupport.scala

4. 很多博客中有以下代码，一开始理解为可以用来对 rdd 分区，实际没有用。这是 mapreduce 的 job 参数，spark中不生效
val job = Job.getInstance(hbaseConfig)
HFileOutputFormat2.configureIncrementalLoad(job, table.getTableDescriptor, regionLocator)
job.getConfiguration

其他知识点：
1. scala 中实现 serializable 接口
2. HFilePartitioner，使用 hbase 的 regionLocator.getStartKeys，将 rdd 中的 put，按 rowkey 分割成不同的 partition，每个 partition 会产生一个 hfile，对应于 hbase region 的分区

代码，以后整理：

object BulkloadHelper {

  private val logger = Logger.getLogger(this.getClass)

  def bulkloadWrite(rdd: RDD[Put], hbaseConfig: Configuration, thisTableName: TableName): Unit = {

    val hbaseConnection = ConnectionFactory.createConnection(hbaseConfig)

    val regionLocator = hbaseConnection.getRegionLocator(thisTableName)

    val myPartitioner = HFilePartitioner.apply(hbaseConfig, regionLocator.getStartKeys, 1)

    logger.info(s"regionLocator.getStartKeys.length = ${regionLocator.getStartKeys.length}")

    regionLocator.getStartKeys.foreach(keys => logger.info("regionLocator.getStartKeys: " + new String(keys)))

    val hFilePath = getHFilePath()

    logger.info(s"bulkload, begin to write to hdfs path: $hFilePath")

    /**

      * HFile sort function -> KeyValue.KVComparator

      *                        CellComparator

      */

    rdd.flatMap(put => putToKeyValueList(put))

      .map(c => (c, 1))

      .repartitionAndSortWithinPartitions(myPartitioner) // repartition so each hfile can match the hbase region

      .map(tuple => (new ImmutableBytesWritable(tuple._1.row), tuple._1.getKeyValue()))

      .saveAsNewAPIHadoopFile(

        hFilePath,

        classOf[ImmutableBytesWritable],

        classOf[KeyValue],

        classOf[HFileOutputFormat2],

        hbaseConfig)

    //  Bulk load Hfiles to Hbase

    logger.info("bulkload, begin to load to hbase")

    val bulkLoader = new LoadIncrementalHFiles(hbaseConfig)

    bulkLoader.doBulkLoad(new Path(hFilePath), new HTable(hbaseConfig, thisTableName))

    logger.info("bulkload, delete hdfs path")

    val hadoopConf = new Configuration()

    val fileSystem = FileSystem.get(hadoopConf)

    fileSystem.delete(new Path(hFilePath), true)

    hbaseConnection.close()

    fileSystem.close()

    logger.info("bulkload, done")

  }

  def getHFilePath():String = "hdfs:///user/hadoop/hbase/bulkload/hfile/" + LocalDate.now().toString + "-" + UUID.randomUUID().toString

  /**

    * select one keyvalue from put

    * @param put

    */

  def putToKeyValueList(put: Put): Seq[MyKeyValue] = {

    put.getFamilyCellMap.asScala

      .flatMap(_._2.asScala) // list cells

      .map(cell => new MyKeyValue(put.getRow, cell.getFamily, cell.getQualifier, cell.getTimestamp, cell.getValue))

      .toSeq

  }

}

class MyKeyValue(var row: Array[Byte], var family: Array[Byte], var qualifier: Array[Byte], var timestamp: Long, var value: Array[Byte])

  extends Serializable with Ordered[MyKeyValue] {

  import java.io.IOException

  import java.io.ObjectInputStream

  import java.io.ObjectOutputStream

  var keyValue: KeyValue = _

  def getKeyValue(): KeyValue = {

    if (keyValue == null) {

      keyValue = new KeyValue(row, family, qualifier, timestamp, value)

    }

    keyValue

  }

  @throws[IOException]

  private def writeObject(out: ObjectOutputStream) {

    keyValue = null

    out.defaultWriteObject()

    out.writeObject(this)

  }

  @throws[IOException]

  @throws[ClassNotFoundException]

  private def readObject(in: ObjectInputStream) {

    in.defaultReadObject()

    val newKeyValue = in.readObject().asInstanceOf[MyKeyValue]

    this.row = newKeyValue.row

    this.family = newKeyValue.family

    this.qualifier = newKeyValue.qualifier

    this.timestamp = newKeyValue.timestamp

    this.value = newKeyValue.value

    getKeyValue()

  }

  class MyComparator extends KeyValue.KVComparator with Serializable {}

  val comparator = new MyComparator()

  override def compare(that: MyKeyValue): Int = {

    comparator.compare(this.getKeyValue(), that.getKeyValue())

  }

  override def toString: String = {

    getKeyValue().toString

  }

}

object HFilePartitionerHelper {

  object HFilePartitioner {

    def apply(conf: Configuration, splits: Array[Array[Byte]], numFilesPerRegionPerFamily: Int): HFilePartitioner = {

      if (numFilesPerRegionPerFamily == 1)

        new SingleHFilePartitioner(splits)

      else {

        val fraction = 1 max numFilesPerRegionPerFamily min conf.getInt(LoadIncrementalHFiles.MAX_FILES_PER_REGION_PER_FAMILY, 32)

        new MultiHFilePartitioner(splits, fraction)

      }

    }

  }

  protected abstract class HFilePartitioner extends Partitioner {

    def extractKey(n: Any): Array[Byte] = {

//      println(s"n = $n")

      n match {

        case kv: MyKeyValue => kv.row

      }

    }

  }

  private class MultiHFilePartitioner(splits: Array[Array[Byte]], fraction: Int) extends HFilePartitioner {

    override def getPartition(key: Any): Int = {

      val k = extractKey(key)

      val h = (k.hashCode() & Int.MaxValue) % fraction

      for (i <- 1 until splits.length)

        if (Bytes.compareTo(k, splits(i)) < 0) return (i - 1) * fraction + h

      (splits.length - 1) * fraction + h

    }

    override def numPartitions: Int = splits.length * fraction

  }

  private class SingleHFilePartitioner(splits: Array[Array[Byte]]) extends HFilePartitioner {

    override def getPartition(key: Any): Int = {

      val p = selfGetPartition(key)

//      println(s"p = $p")

      p

    }

    def selfGetPartition(key: Any): Int = {

      val k = extractKey(key)

      for (i <- 1 until splits.length)

        if (Bytes.compareTo(k, splits(i)) < 0) return i - 1

      splits.length - 1

    }

    override def numPartitions: Int = splits.length

  }

}

spark bulkload hbase笔记的更多相关文章

Spark、BulkLoad Hbase、单列、多列
背景之前的博客:Spark:DataFrame写HFile (Hbase)一个列族.一个列扩展一个列族.多个列用spark 1.6.0 和 hbase 1.2.0 版本实现过spark BulkL ...
Spark操作HBase问题：java.io.IOException: Non-increasing Bloom keys
1 问题描述在使用Spark BulkLoad数据到HBase时遇到以下问题: 17/05/19 14:47:26 WARN scheduler.TaskSetManager: Lost task ...
MapReduce和Spark写入Hbase多表总结
作者:Syn良子出处:http://www.cnblogs.com/cssdongl 转载请注明出处大家都知道用mapreduce或者spark写入已知的hbase中的表时,直接在mapreduc ...
spark 操作hbase
HBase经过七年发展,终于在今年2月底,发布了 1.0.0 版本.这个版本提供了一些让人激动的功能,并且,在不牺牲稳定性的前提下,引入了新的API.虽然 1.0.0 兼容旧版本的 API,不过还是应 ...
Spark操作hbase
于Spark它是一个计算框架,于Spark环境,不仅支持单个文件操作,HDFS档,同时也可以使用Spark对Hbase操作. 从企业的数据源HBase取出.这涉及阅读hbase数据,在本文中尽快为了尽 ...
大数据学习系列之九---- Hive整合Spark和HBase以及相关测试
前言在之前的大数据学习系列之七 ----- Hadoop+Spark+Zookeeper+HBase+Hive集群搭建中介绍了集群的环境搭建,但是在使用hive进行数据查询的时候会非常的慢,因为h ...
Spark 基本函数学习笔记一
Spark 基本函数学习笔记一¶ spark的函数主要分两类,Transformations和Actions. Transformations为一些数据转换类函数,actions为一些行动类函数: ...
Spark读Hbase优化 --手动划分region提高并行数
一. Hbase的region 我们先简单介绍下Hbase的架构和Hbase的region: 从物理集群的角度看,Hbase集群中,由一个Hmaster管理多个HRegionServer,其中每个HR ...
spark读写hbase性能对比
一.spark写入hbase hbase client以put方式封装数据,并支持逐条或批量插入.spark中内置saveAsHadoopDataset和saveAsNewAPIHadoopDatas ...

随机推荐

[Linux kali] Kali KDE桌面安装中文输入法不能登录系统
#开始第一次实体机上面安装kali的KDE桌面版本结果就遇到了很多的BUG 比如这次就是安装中文输入法有问题这次安装的是fcitx框架的尝试了谷歌输入法还有搜狗输入法都有这个问题也就是 ...
ES-Result window is too large
问题: Result window is too large 解决: PUT http://127.0.0.1:9200/catalog/_settings { "index": ...
洛谷 P1880 [NOI1995]石子合并（区间DP）
嗯... 题目链接:https://www.luogu.org/problem/P1880 这道题特点在于石子是一个环,所以让a[i+n] = a[i](两倍长度)即可解决环的问题,然后注意求区间最小 ...
四级技能修炼NPC脚本参考
[@main] #act CloseBigDialogBox #say \ \ <本服推出部分四级技能.升级四级技能很简单./SCOLOR=250>\ \ <只需要1个/SCOLOR ...
C语言：去除一个字符串中所有的空格。-函数fun传入形参m，求t=1/2-1/3+1/4.....+1/m的值。-判断形参a指定的矩阵是不是“幻方“。
//函数fun功能:判断形参a指定的矩阵是不是“幻方“,若是返回1.(”幻方”:每列,每行,对角线,反对角线相加都相等) #include <stdio.h> #define N 3 in ...
C++中使用sstream进行类型转换（数字字符串转数字、数字转数字字符串）
1.sstream知识 sstream即字符串流.在使用字符串流sstream时,需要先引入相应的头文件 "#include <sstream>" 基本操作 // 引入 ...
.NET Core快速入门教程 4、使用VS Code进行C#代码调试的技巧
一.前言什么是代码调试? 通过调试可以让我们了解代码运行过程中的代码执行信息,比如变量的值等等.通常调试代码是为了方便我们发现代码中的bug. 本篇开发环境 1.操作系统: Windows 10 X ...
leetcode菜鸡斗智斗勇系列（8）--- Find N Unique Integers Sum up to Zero
1.原题: https://leetcode.com/problems/find-n-unique-integers-sum-up-to-zero/ Given an integer n, retur ...
从数据库中取数据(Stalberg.TMS.Data)
using System; using System.Data; using System.Data.SqlClient; namespace Stalberg.TMS { //*********** ...
idea 启动java项目报错 java: 程序包javax.servlet.http不存在
File -- Project Structure

spark bulkload hbase笔记

spark bulkload hbase笔记的更多相关文章

随机推荐

热门专题