一、spark写入hbase

hbase client以put方式封装数据,并支持逐条或批量插入。spark中内置saveAsHadoopDataset和saveAsNewAPIHadoopDataset两种方式写入hbase。为此,将同样的数据插入其中对比性能。

依赖如下:

 <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->

    <dependency>

    <groupId>org.apache.spark</groupId>

    <artifactId>spark-core_2.11</artifactId>

    <version>2.3.1</version>

    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase-client -->

    <dependency>

    <groupId>org.apache.hbase</groupId>

    <artifactId>hbase-client</artifactId>

    <version>1.4.6</version>

    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase-common -->

    <dependency>

    <groupId>org.apache.hbase</groupId>

    <artifactId>hbase-common</artifactId>

    <version>1.4.6</version>

    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase-server -->

    <dependency>

    <groupId>org.apache.hbase</groupId>

    <artifactId>hbase-server</artifactId>

    <version>1.4.6</version>

    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase-protocol -->

    <dependency>

    <groupId>org.apache.hbase</groupId>

    <artifactId>hbase-protocol</artifactId>

    <version>1.4.6</version>

    </dependency>

    <!-- https://mvnrepository.com/artifact/commons-cli/commons-cli -->

    <dependency>

    <groupId>commons-cli</groupId>

    <artifactId>commons-cli</artifactId>

    <version>1.4</version>

    </dependency>

1. put逐条插入
1.1 hbase客户端建表

create 'keyword1',{NAME=>'info',BLOCKSIZE=>'16384',BLOCKCACHE=>'false'},{NUMREGIONS=>10,SPLITALGO=>'HexStringSplit'}

1.2 code

val start_time1 = new Date().getTime

        keyword.foreachPartition(records =>{

          HBaseUtils1x.init()

          records.foreach(f => {

            val keyword = f.getString(0)

            val app_id = f.getString(1)

            val catalog_name = f.getString(2)

            val keyword_catalog_pv = f.getString(3)

            val keyword_catalog_pv_rate = f.getString(4)

            val rowKey = MD5Hash.getMD5AsHex(Bytes.toBytes(keyword+app_id)).substring(0,8)

            val cols = Array(keyword,app_id,catalog_name,keyword_catalog_pv,keyword_catalog_pv_rate)

            HBaseUtils1x.insertData(tableName1, HBaseUtils1x.getPutAction(rowKey, cf, columns, cols))

          })

          HBaseUtils1x.closeConnection()

        })

        var end_time1 =new Date().getTime

        println("HBase逐条插入运行时间为:" + (end_time1 - start_time1))

2.put批量插入
2.1 建表

create 'keyword2',{NAME=>'info',BLOCKSIZE=>'16384',BLOCKCACHE=>'false'},{NUMREGIONS=>10,SPLITALGO=>'HexStringSplit'}

2.2 代码

 val start_time2 = new Date().getTime

        keyword.foreachPartition(records =>{

          HBaseUtils1x.init()

          val puts = ArrayBuffer[Put]()

          records.foreach(f => {

            val keyword = f.getString(0)

            val app_id = f.getString(1)

            val catalog_name = f.getString(2)

            val keyword_catalog_pv = f.getString(3)

            val keyword_catalog_pv_rate = f.getString(4)

            val rowKey = MD5Hash.getMD5AsHex(Bytes.toBytes(keyword+app_id)).substring(0,8)

            val cols = Array(keyword,app_id,catalog_name,keyword_catalog_pv,keyword_catalog_pv_rate)

            try{

              puts.append(HBaseUtils1x.getPutAction(rowKey,

                cf, columns, cols))

            }catch{

              case e:Throwable => println(f)

            }

          })

          import collection.JavaConverters._

          HBaseUtils1x.addDataBatchEx(tableName2, puts.asJava)

          HBaseUtils1x.closeConnection()

        })

        val end_time2 = new Date().getTime

        println("HBase批量插入运行时间为:" + (end_time2 - start_time2))

3. saveAsHadoopDataset写入

使用旧的Hadoop API将RDD输出到任何Hadoop支持的存储系统,为该存储系统使用Hadoop JobConf对象。JobConf设置一个OutputFormat和任何需要输出的路径,就像为Hadoop MapReduce作业配置那样。
3.1 建表

create 'keyword3',{NAME=>'info',BLOCKSIZE=>'16384',BLOCKCACHE=>'false'},{NUMREGIONS=>10,SPLITALGO=>'HexStringSplit'}

3.2 代码

val start_time3 = new Date().getTime

        keyword.rdd.map(f =>{

          val keyword = f.getString(0)

          val app_id = f.getString(1)

          val catalog_name = f.getString(2)

          val keyword_catalog_pv = f.getString(3)

          val keyword_catalog_pv_rate = f.getString(4)

          val rowKey = MD5Hash.getMD5AsHex(Bytes.toBytes(keyword+app_id)).substring(0,8)

          val cols = Array(keyword,app_id,catalog_name,keyword_catalog_pv,keyword_catalog_pv_rate)

          (new ImmutableBytesWritable, HBaseUtils1x.getPutAction(rowKey, cf, columns, cols))

        }).saveAsHadoopDataset(HBaseUtils1x.getJobConf(tableName3))

        val end_time3 = new Date().getTime

        println("saveAsHadoopDataset方式写入运行时间为:" + (end_time3 - start_time3))

4. saveAsNewAPIHadoopDataset写入

使用新的Hadoop API将RDD输出到任何Hadoop支持存储系统,为该存储系统使用Hadoop Configuration对象.Conf设置一个OutputFormat和任何需要的输出路径,就像为Hadoop MapReduce作业配置那样。
4.1 建表

create 'keyword4',{NAME=>'info',BLOCKSIZE=>'16384',BLOCKCACHE=>'false'},{NUMREGIONS=>10,SPLITALGO=>'HexStringSplit'}

4.2 code

val start_time4 = new Date().getTime

        keyword.rdd.map(f =>{

          val keyword = f.getString(0)

          val app_id = f.getString(1)

          val catalog_name = f.getString(2)

          val keyword_catalog_pv = f.getString(3)

          val keyword_catalog_pv_rate = f.getString(4)

          val rowKey = MD5Hash.getMD5AsHex(Bytes.toBytes(keyword+app_id)).substring(0,8)

          val cols = Array(keyword,app_id,catalog_name,keyword_catalog_pv,keyword_catalog_pv_rate)

          (new ImmutableBytesWritable, HBaseUtils1x.getPutAction(rowKey, cf, columns, cols))

        }).saveAsNewAPIHadoopDataset(HBaseUtils1x.getNewJobConf(tableName4,spark.sparkContext))

        val end_time4 = new Date().getTime

        println("saveAsNewAPIHadoopDataset方式写入运行时间为:" + (end_time4 - start_time4))

5. 性能对比
     

可以看出,saveAsHadoopDataset和saveAsNewAPIHadoopDataset方式要优于put逐条插入和批量插入。

二、spark读取hbase

newAPIHadoopRDD API可以将hbase表转化为RDD,具体使用如下:

val start_time1 = new Date().getTime

    val hbaseRdd = spark.sparkContext.newAPIHadoopRDD(HBaseUtils1x.getNewConf(tableName1), classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])

    println(hbaseRdd.count())

    hbaseRdd.foreach{

      case(_,result) => {

        // 获取行键

        val rowKey = Bytes.toString(result.getRow)

        val keyword = Bytes.toString(result.getValue(cf.getBytes(), "keyword".getBytes()))

        val keyword_catalog_pv_rate = Bytes.toDouble(result.getValue(cf.getBytes(), "keyword_catalog_pv_rate".getBytes()))

        println(rowKey + "," + keyword + "," + keyword_catalog_pv_rate)

      }

    }

三、完整代码

package com.sparkStudy.utils

    import java.util.Date

    import org.apache.hadoop.hbase.client.{Put, Result}

    import org.apache.hadoop.hbase.io.ImmutableBytesWritable

    import org.apache.hadoop.hbase.mapreduce.TableInputFormat

    import org.apache.hadoop.hbase.util.{Bytes, MD5Hash}

    import org.apache.spark.sql.SparkSession

    import scala.collection.mutable.ArrayBuffer

    /**
* @Author: JZ.lee
* @Description: TODO
* @Date: 18-8-28 下午4:28
* @Modified By:
*/ object SparkRWHBase { def main(args: Array[String]): Unit = { val spark = SparkSession.builder() .appName("SparkRWHBase") .master("local[2]") .config("spark.some.config.option", "some-value") .getOrCreate() val keyword = spark.read .format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat") .option("header",false) .option("delimiter",",") .load("file:/opt/data/keyword_catalog_day.csv") val tableName1 = "keyword1" val tableName2 = "keyword2" val tableName3 = "keyword3" val tableName4 = "keyword4" val cf = "info" val columns = Array("keyword", "app_id", "catalog_name", "keyword_catalog_pv", "keyword_catalog_pv_rate") val start_time1 = new Date().getTime keyword.foreachPartition(records =>{ HBaseUtils1x.init() records.foreach(f => { val keyword = f.getString(0) val app_id = f.getString(1) val catalog_name = f.getString(2) val keyword_catalog_pv = f.getString(3) val keyword_catalog_pv_rate = f.getString(4) val rowKey = MD5Hash.getMD5AsHex(Bytes.toBytes(keyword+app_id)).substring(0,8) val cols = Array(keyword,app_id,catalog_name,keyword_catalog_pv,keyword_catalog_pv_rate) HBaseUtils1x.insertData(tableName1, HBaseUtils1x.getPutAction(rowKey, cf, columns, cols)) }) HBaseUtils1x.closeConnection() }) var end_time1 =new Date().getTime println("HBase逐条插入运行时间为:" + (end_time1 - start_time1)) val start_time2 = new Date().getTime keyword.foreachPartition(records =>{ HBaseUtils1x.init() val puts = ArrayBuffer[Put]() records.foreach(f => { val keyword = f.getString(0) val app_id = f.getString(1) val catalog_name = f.getString(2) val keyword_catalog_pv = f.getString(3) val keyword_catalog_pv_rate = f.getString(4) val rowKey = MD5Hash.getMD5AsHex(Bytes.toBytes(keyword+app_id)).substring(0,8) val cols = Array(keyword,app_id,catalog_name,keyword_catalog_pv,keyword_catalog_pv_rate) try{ puts.append(HBaseUtils1x.getPutAction(rowKey, cf, columns, cols)) }catch{ case e:Throwable => println(f) } }) import collection.JavaConverters._ HBaseUtils1x.addDataBatchEx(tableName2, puts.asJava) HBaseUtils1x.closeConnection() }) val end_time2 = new Date().getTime println("HBase批量插入运行时间为:" + (end_time2 - start_time2)) val start_time3 = new Date().getTime keyword.rdd.map(f =>{ val keyword = f.getString(0) val app_id = f.getString(1) val catalog_name = f.getString(2) val keyword_catalog_pv = f.getString(3) val keyword_catalog_pv_rate = f.getString(4) val rowKey = MD5Hash.getMD5AsHex(Bytes.toBytes(keyword+app_id)).substring(0,8) val cols = Array(keyword,app_id,catalog_name,keyword_catalog_pv,keyword_catalog_pv_rate) (new ImmutableBytesWritable, HBaseUtils1x.getPutAction(rowKey, cf, columns, cols)) }).saveAsHadoopDataset(HBaseUtils1x.getJobConf(tableName3)) val end_time3 = new Date().getTime println("saveAsHadoopDataset方式写入运行时间为:" + (end_time3 - start_time3)) // val start_time4 = new Date().getTime keyword.rdd.map(f =>{ val keyword = f.getString(0) val app_id = f.getString(1) val catalog_name = f.getString(2) val keyword_catalog_pv = f.getString(3) val keyword_catalog_pv_rate = f.getString(4) val rowKey = MD5Hash.getMD5AsHex(Bytes.toBytes(keyword+app_id)).substring(0,8) val cols = Array(keyword,app_id,catalog_name,keyword_catalog_pv,keyword_catalog_pv_rate) (new ImmutableBytesWritable, HBaseUtils1x.getPutAction(rowKey, cf, columns, cols)) }).saveAsNewAPIHadoopDataset(HBaseUtils1x.getNewJobConf(tableName4,spark.sparkContext)) val end_time4 = new Date().getTime println("saveAsNewAPIHadoopDataset方式写入运行时间为:" + (end_time4 - start_time4)) val hbaseRdd = spark.sparkContext.newAPIHadoopRDD(HBaseUtils1x.getNewConf(tableName1), classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result]) println(hbaseRdd.count()) hbaseRdd.foreach{ case(_,result) => { // 获取行键 val rowKey = Bytes.toString(result.getRow) val keyword = Bytes.toString(result.getValue(cf.getBytes(), "keyword".getBytes())) val keyword_catalog_pv_rate = Bytes.toDouble(result.getValue(cf.getBytes(), "keyword_catalog_pv_rate".getBytes())) println(rowKey + "," + keyword + "," + keyword_catalog_pv_rate) } } } } package com.sparkStudy.utils import org.apache.hadoop.conf.Configuration import org.apache.hadoop.hbase.client.BufferedMutator.ExceptionListener import org.apache.hadoop.hbase.client._ import org.apache.hadoop.hbase.io.ImmutableBytesWritable import org.apache.hadoop.hbase.protobuf.ProtobufUtil import org.apache.hadoop.hbase.util.{Base64, Bytes} import org.apache.hadoop.hbase.{HBaseConfiguration, HColumnDescriptor, HTableDescriptor, TableName} import org.apache.hadoop.mapred.JobConf import org.apache.hadoop.mapreduce.Job import org.apache.spark.SparkContext import org.slf4j.LoggerFactory /**
* @Author: JZ.Lee
* @Description:HBase1x增删改查
* @Date: Created at 上午11:02 18-8-14
* @Modified By:
*/ object HBaseUtils1x { private val LOGGER = LoggerFactory.getLogger(this.getClass) private var connection:Connection = null private var conf:Configuration = null def init() = { conf = HBaseConfiguration.create() conf.set("hbase.zookeeper.quorum", "lee") connection = ConnectionFactory.createConnection(conf) } def getJobConf(tableName:String) = { val conf = HBaseConfiguration.create() val jobConf = new JobConf(conf) jobConf.set("hbase.zookeeper.quorum", "lee") jobConf.set("hbase.zookeeper.property.clientPort", "2181") jobConf.set(org.apache.hadoop.hbase.mapred.TableOutputFormat.OUTPUT_TABLE,tableName) jobConf.setOutputFormat(classOf[org.apache.hadoop.hbase.mapred.TableOutputFormat]) jobConf } def getNewConf(tableName:String) = { conf = HBaseConfiguration.create() conf.set("hbase.zookeeper.quorum", "lee") conf.set("hbase.zookeeper.property.clientPort", "2181") conf.set(org.apache.hadoop.hbase.mapreduce.TableInputFormat.INPUT_TABLE,tableName) val scan = new Scan() conf.set(org.apache.hadoop.hbase.mapreduce.TableInputFormat.SCAN,Base64.encodeBytes(ProtobufUtil.toScan(scan).toByteArray)) conf } def getNewJobConf(tableName:String) = { val conf = HBaseConfiguration.create()
conf.set("hbase.zookeeper.quorum", Constants.ZOOKEEPER_SERVER_NODE)
conf.set("hbase.zookeeper.property.clientPort", "2181")
conf.set("hbase.defaults.for.version.skip", "true")
conf.set(org.apache.hadoop.hbase.mapreduce.TableOutputFormat.OUTPUT_TABLE, tableName)
conf.setClass("mapreduce.job.outputformat.class", classOf[org.apache.hadoop.hbase.mapreduce.TableOutputFormat[String]],
classOf[org.apache.hadoop.mapreduce.OutputFormat[String, Mutation]])
new JobConf(conf)
} def closeConnection(): Unit = { connection.close() } def getGetAction(rowKey: String):Get = { val getAction = new Get(Bytes.toBytes(rowKey)); getAction.setCacheBlocks(false); getAction } def getPutAction(rowKey: String, familyName:String, column: Array[String], value: Array[String]):Put = { val put: Put = new Put(Bytes.toBytes(rowKey)); for (i <- 0 until(column.length)) { put.add(Bytes.toBytes(familyName), Bytes.toBytes(column(i)), Bytes.toBytes(value(i))); } put } def insertData(tableName:String, put: Put) = { val name = TableName.valueOf(tableName) val table = connection.getTable(name) table.put(put) } def addDataBatchEx(tableName:String, puts:java.util.List[Put]): Unit = { val name = TableName.valueOf(tableName) val table = connection.getTable(name) val listener = new ExceptionListener { override def onException (e: RetriesExhaustedWithDetailsException, bufferedMutator: BufferedMutator): Unit = { for(i <-0 until e.getNumExceptions){ LOGGER.info("写入put失败:" + e.getRow(i)) } } } val params = new BufferedMutatorParams(name) .listener(listener) .writeBufferSize(4*1024*1024) try{ val mutator = connection.getBufferedMutator(params) mutator.mutate(puts) mutator.close() }catch { case e:Throwable => e.printStackTrace() } } }

https://blog.csdn.net/baymax_007/article/details/82191188

spark读写hbase性能对比的更多相关文章

  1. Spark读写HBase

    Spark读写HBase示例 1.HBase shell查看表结构 hbase(main)::> desc 'SDAS_Person' Table SDAS_Person is ENABLED ...

  2. Spark读写Hbase的二种方式对比

    作者:Syn良子 出处:http://www.cnblogs.com/cssdongl 转载请注明出处 一.传统方式 这种方式就是常用的TableInputFormat和TableOutputForm ...

  3. Spark读写HBase时出现的问题--RpcRetryingCaller: Call exception

    问题描述 Exception in thread "main" org.apache.hadoop.hbase.client.RetriesExhaustedException: ...

  4. 顺序、随机IO和Java多种读写文件性能对比

    概述 对于磁盘的读写分为两种模式,顺序IO和随机IO. 随机IO存在一个寻址的过程,所以效率比较低.而顺序IO,相当于有一个物理索引,在读取的时候不需要寻找地址,效率很高. 基本流程 总体结构 我们编 ...

  5. Spark读写Hbase中的数据

    def main(args: Array[String]) { val sparkConf = new SparkConf().setMaster("local").setAppN ...

  6. Spark-读写HBase,SparkStreaming操作,Spark的HBase相关操作

    Spark-读写HBase,SparkStreaming操作,Spark的HBase相关操作 1.sparkstreaming实时写入Hbase(saveAsNewAPIHadoopDataset方法 ...

  7. HBase在单Column和多Column情况下批量Put的性能对比分析

    作者: 大圆那些事 | 文章可以转载,请以超链接形式标明文章原始出处和作者信息 网址: http://www.cnblogs.com/panfeng412/archive/2013/11/28/hba ...

  8. Spark实战之读写HBase

    1 配置 1.1 开发环境: HBase:hbase-1.0.0-cdh5.4.5.tar.gz Hadoop:hadoop-2.6.0-cdh5.4.5.tar.gz ZooKeeper:zooke ...

  9. Hadoop vs Spark性能对比

    http://www.cnblogs.com/jerrylead/archive/2012/08/13/2636149.html Hadoop vs Spark性能对比 基于Spark-0.4和Had ...

随机推荐

  1. Python爬虫入门教程 1-100 CentOS环境安装

    简介 你好,当你打开这个文档的时候,我知道,你想要的是什么! Python爬虫,如何快速的学会Python爬虫,是你最期待的事情,可是这个事情应该没有想象中的那么容易,况且你的编程底子还不一定好,这套 ...

  2. Nginx 优化静态文件访问

    简介 Web 开发中需要的静态文件有:CSS.JS.字体.图片,可以通过web框架进行访问,但是效率不是最优的. Nginx 对于处理静态文件的效率要远高于 Web 框架,因为可以使用 gzip 压缩 ...

  3. 修改sql数据库名称

    USE master; GO DECLARE @SQL VARCHAR(MAX); SET @SQL='' SELECT @SQL=@SQL+'; KILL '+RTRIM(SPID) FROM ma ...

  4. 如何理解git checkout -- file和git reset HEAD -- file

    http://www.liaoxuefeng.com/wiki/0013739516305929606dd18361248578c67b8067c8c017b000/001374831943254ee ...

  5. CAN总线学习记录之三:总线中主动错误和被动错误的通俗解释

    首先建议把广泛使用的"主动错误"和"被动错误"概念换成"主动报错"和"被动报错". 1. 主动报错站点 只要检查到错误, ...

  6. 【.NET Core项目实战-统一认证平台】第十章 授权篇-客户端授权

    [.NET Core项目实战-统一认证平台]开篇及目录索引 上篇文章介绍了如何使用Dapper持久化IdentityServer4(以下简称ids4)的信息,并实现了sqlserver和mysql两种 ...

  7. Sqlite操作帮助类

      sqlite帮助类 using System; using System.Collections.Generic; using System.Linq; using System.Text; us ...

  8. 【转】NotificationCopat.Builder全部设置

    1.方法:setContentTitle(CharSequence title)   功能:设置通知栏标题.   例子:setContentTitle("测试标题"). 2.方法: ...

  9. 上海启动5G试用!104页PPT,为你深度解析5G终端的创新和机遇

    文章发布于公号[数智物语] (ID:decision_engine),关注公号不错过每一篇干货. 来源:国泰君安证券 作者:分析师王聪.张阳.陈飞达 导读:2019年是5G元年,各大品牌将陆续推出5G ...

  10. 在 Apex 中得到 sObject 的信息

    Salesforce 的数据模型是基于 sObject 的.在 Apex 中,所有的标准对象.自定义对象都是继承自 sObject 的. 关于在 Apex 中得到 sObject 的信息,我们要基于两 ...