用spark导入数据到hbase

集群环境：一主三从，Spark为Spark On YARN模式

Spark导入hbase数据方式有多种

1.少量数据：直接调用hbase API的单条或者批量方法就可以

2.导入的数据量比较大，那就需要先生成hfile文件，在把hfile文件加载到hbase里面

下面主要介绍第二种方法：

该方法主要使用spark Java API的两个方法：

1.textFile：将本地文件或者HDFS文件转换成RDD

2.flatMapToPair：将每行数据的所有key-value对象合并成Iterator对象返回（针对多family，多column）

代码如下：

package scala;

import java.util.ArrayList;

import java.util.Iterator;

import java.util.List;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.hbase.HBaseConfiguration;

import org.apache.hadoop.hbase.KeyValue;

import org.apache.hadoop.hbase.TableName;

import org.apache.hadoop.hbase.client.Admin;

import org.apache.hadoop.hbase.client.Connection;

import org.apache.hadoop.hbase.client.ConnectionFactory;

import org.apache.hadoop.hbase.client.Table;

import org.apache.hadoop.hbase.io.ImmutableBytesWritable;

import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2;

import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles;

import org.apache.hadoop.hbase.util.Bytes;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.spark.SparkConf;

import org.apache.spark.api.java.JavaPairRDD;

import org.apache.spark.api.java.JavaRDD;

import org.apache.spark.api.java.JavaSparkContext;

import org.apache.spark.api.java.function.PairFlatMapFunction;

import org.apache.spark.storage.StorageLevel;

import util.HFileLoader;

public class HbaseBulkLoad {

    private static final String ZKconnect="slave1,slave2,slave3:2181";

    private static final String HDFS_ADDR="hdfs://master:8020";

    private static final String TABLE_NAME="DBSTK.STKFSTEST";//表名

    private static final String COLUMN_FAMILY="FS";//列族

    public static void run(String[] args) throws Exception {

        Configuration configuration = HBaseConfiguration.create();

        configuration.set("hbase.zookeeper.quorum", ZKconnect);

        configuration.set("fs.defaultFS", HDFS_ADDR);

        configuration.set("dfs.replication", "1");

        String inputPath = args[0];

        String outputPath = args[1];

        Job job = Job.getInstance(configuration, "Spark Bulk Loading HBase Table:" + TABLE_NAME);

        job.setInputFormatClass(TextInputFormat.class);

        job.setMapOutputKeyClass(ImmutableBytesWritable.class);//指定输出键类

        job.setMapOutputValueClass(KeyValue.class);//指定输出值类

        job.setOutputFormatClass(HFileOutputFormat2.class);

        FileInputFormat.addInputPaths(job, inputPath);//输入路径

        FileSystem fs = FileSystem.get(configuration);

        Path output = new Path(outputPath);

        if (fs.exists(output)) {

            fs.delete(output, true);//如果输出路径存在，就将其删除

        }

        fs.close();

        FileOutputFormat.setOutputPath(job, output);//hfile输出路径

        //初始化sparkContext

        SparkConf sparkConf = new SparkConf().setAppName("HbaseBulkLoad").setMaster("local[*]");

        JavaSparkContext jsc = new JavaSparkContext(sparkConf);

        //读取数据文件

        JavaRDD<String> lines = jsc.textFile(inputPath);

        lines.persist(StorageLevel.MEMORY_AND_DISK_SER());

        JavaPairRDD<ImmutableBytesWritable,KeyValue> hfileRdd =

                lines.flatMapToPair(new PairFlatMapFunction<String, ImmutableBytesWritable, KeyValue>() {

            private static final long serialVersionUID = 1L;

            @Override

            public Iterator<Tuple2<ImmutableBytesWritable, KeyValue>> call(String text) throws Exception {

                List<Tuple2<ImmutableBytesWritable, KeyValue>> tps = new ArrayList<Tuple2<ImmutableBytesWritable, KeyValue>>();

                if(null == text || text.length()<1){

                    return tps.iterator();//不能返回null

                }

                String[] resArr = text.split(",");

                if(resArr != null && resArr.length == 14){

                    byte[] rowkeyByte = Bytes.toBytes(resArr[0]+resArr[3]+resArr[4]+resArr[5])

                    byte[] columnFamily = Bytes.toBytes(COLUMN_FAMILY);

                    ImmutableBytesWritable ibw = new ImmutableBytesWritable(rowkeyByte);

                    //EP,HP,LP,MK,MT,SC,SN,SP,ST,SY,TD,TM,TQ,UX（字典顺序排序）

                    //注意，这地方rowkey、列族和列都要按照字典排序，如果有多个列族，也要按照字典排序，rowkey排序我们交给spark的sortByKey去管理

                    tps.add(new Tuple2<>(ibw,new KeyValue(rowkeyByte, columnFamily, Bytes.toBytes("EP"),Bytes.toBytes(resArr[9]))));

                    tps.add(new Tuple2<>(ibw,new KeyValue(rowkeyByte, columnFamily, Bytes.toBytes("HP"),Bytes.toBytes(resArr[7]))));

                    tps.add(new Tuple2<>(ibw,new KeyValue(rowkeyByte, columnFamily, Bytes.toBytes("LP"),Bytes.toBytes(resArr[8]))));

                    tps.add(new Tuple2<>(ibw,new KeyValue(rowkeyByte, columnFamily, Bytes.toBytes("MK"),Bytes.toBytes(resArr[13]))));

                    tps.add(new Tuple2<>(ibw,new KeyValue(rowkeyByte, columnFamily, Bytes.toBytes("MT"),Bytes.toBytes(resArr[4]))));

                    tps.add(new Tuple2<>(ibw,new KeyValue(rowkeyByte, columnFamily, Bytes.toBytes("SC"),Bytes.toBytes(resArr[0]))));

                    tps.add(new Tuple2<>(ibw,new KeyValue(rowkeyByte, columnFamily, Bytes.toBytes("SN"),Bytes.toBytes(resArr[1]))));

                    tps.add(new Tuple2<>(ibw,new KeyValue(rowkeyByte, columnFamily, Bytes.toBytes("SP"),Bytes.toBytes(resArr[6]))));

                    tps.add(new Tuple2<>(ibw,new KeyValue(rowkeyByte, columnFamily, Bytes.toBytes("ST"),Bytes.toBytes(resArr[5]))));

                    tps.add(new Tuple2<>(ibw,new KeyValue(rowkeyByte, columnFamily, Bytes.toBytes("SY"),Bytes.toBytes(resArr[2]))));

                    tps.add(new Tuple2<>(ibw,new KeyValue(rowkeyByte, columnFamily, Bytes.toBytes("TD"),Bytes.toBytes(resArr[3]))));

                    tps.add(new Tuple2<>(ibw,new KeyValue(rowkeyByte, columnFamily, Bytes.toBytes("TM"),Bytes.toBytes(resArr[11]))));

                    tps.add(new Tuple2<>(ibw,new KeyValue(rowkeyByte, columnFamily, Bytes.toBytes("TQ"),Bytes.toBytes(resArr[10]))));

                    tps.add(new Tuple2<>(ibw,new KeyValue(rowkeyByte, columnFamily, Bytes.toBytes("UX"),Bytes.toBytes(resArr[12]))));

                }

                return tps.iterator();

            }

        }).sortByKey();

        Connection connection = ConnectionFactory.createConnection(configuration);

        TableName tableName = TableName.valueOf(TABLE_NAME);

        HFileOutputFormat2.configureIncrementalLoad(job, connection.getTable(tableName), connection.getRegionLocator(tableName));

        //生成hfile文件

        hfileRdd.saveAsNewAPIHadoopFile(outputPath, ImmutableBytesWritable.class, KeyValue.class, HFileOutputFormat2.class, job.getConfiguration());

        // bulk load start

        Table table = connection.getTable(tableName);

        Admin admin = connection.getAdmin();

        LoadIncrementalHFiles load = new LoadIncrementalHFiles(configuration);

        load.doBulkLoad(new Path(outputPath), admin,table,connection.getRegionLocator(tableName));

        jsc.close();

    }

    public static void main(String[] args) {

        try {

            long start = System.currentTimeMillis();

            args = new String[]{"hdfs://master:8020/test/test.txt","hdfs://master:8020/test/hfile/test"};

            run(args);

            long end = System.currentTimeMillis();

            System.out.println("数据导入成功，总计耗时："+(end-start)/1000+"s");

        } catch(Exception e) {

            e.printStackTrace();

        }

    }

}

代码打包，上传到集群执行如下命令：

./spark-submit --master yarn-client --executor-memory 4G --driver-memory 1G --num-executors 100 --executor-cores 4 --total-executor-cores 400 
--conf spark.default.parallelism=1000 --class scala.HbaseBulkLoad /home/hadoop/app/hadoop/data/spark-hbase-test.jar

本次只测试导入了50000条数据，在测试导入15G（1.5亿条左右）数据时，导入速度没有MapReduce快

用spark导入数据到hbase的更多相关文章

批量导入数据到HBase
hbase一般用于大数据的批量分析,所以在很多情况下需要将大量数据从外部导入到hbase中,hbase提供了一种导入数据的方式,主要用于批量导入大量数据,即importtsv工具,用法如下: Us ...
通过phoenix导入数据到hbase出错记录
解决方法1 错误如下 -- ::, [hconnection-0x7b9e01aa-shared--pool11069-t114734] WARN org.apache.hadoop.hbase.ip ...
Hive导入数据到HBase,再与Phoenix映射同步
1. 创建HBase 表 create 'hbase_test','user' 2. 插入数据 put 'hbase_test','111','user:name','jack' put 'hbase ...
importTSV工具导入数据到hbase
1.建立目标表test,确定好列族信息. create'test','info','address' 2.建立文件编写要导入的数据并上传到hdfs上 touch a.csv vi a.csv 数据内容 ...
导入数据到HBase的方式选择
Choosing the Right Import Method If the data is already in an HBase table: To move the data from one ...
使用Sqoop从MySQL导入数据到Hive和HBase 及近期感悟
使用Sqoop从MySQL导入数据到Hive和HBase 及近期感悟 Sqoop 大数据 Hive HBase ETL 使用Sqoop从MySQL导入数据到Hive和HBase 及近期感悟基础环境 ...
Hbase 学习（十一）使用hive往hbase当中导入数据
我们可以有很多方式可以把数据导入到hbase当中,比如说用map-reduce,使用TableOutputFormat这个类,但是这种方式不是最优的方式. Bulk的方式直接生成HFiles,写入到文 ...
教程 | 使用Sqoop从MySQL导入数据到Hive和HBase
基础环境 sqoop:sqoop-1.4.5+cdh5.3.6+78, hive:hive-0.13.1+cdh5.3.6+397, hbase:hbase-0.98.6+cdh5.3.6+115 S ...
Spark实战之读写HBase
1 配置 1.1 开发环境: HBase:hbase-1.0.0-cdh5.4.5.tar.gz Hadoop:hadoop-2.6.0-cdh5.4.5.tar.gz ZooKeeper:zooke ...

随机推荐

ubuntu12.04：Mysql数据库：自动安装
打开终端,输入下面命令: 1 sudo apt-get install mysql-server 2 sudo apt-get install mysql-client 一旦安装完成,MySQL 服务 ...
恶补web之八:jQuery(3)
jquery和其他js框架.jQuery使用$作为jQuery的简写,但是还有很多js框架,比如: MooTools,Backbone,Sammy,Cappuccino,Knockout,JavaSc ...
RHEL 6.9 udev 将lv绑定raw devices
环境 RHEL6|RHEL7,LVM2,RAW device 用途使用LVM的lv逻辑卷绑定裸设备 1. 编辑 /etc/udev/rules.d/60-raw.rules 添加如下: ACTION ...
.net framework 4 线程安全概述
线程安全:如果你的代码所在的进程中有多个线程在同时运行,而这些线程可能会同时运行这段代码.如果每次运行结果和单线程运行的结果是一样的,而且其他的变量的值也和预期的是一样的,就是线程安全的.早期的时候, ...
<转>如何在iOS 7中设置barTintColor实现类似网易和 Facebook 的 navigationBar 效果
转自:i‘m Allen的博客先给代码:https://github.com/allenhsu/CRNavigationController 1. 问题的表现相信很多人在 iOS 7 的适配过程中 ...
深入理解springMVC思想
转载:http://elf8848.iteye.com/blog/875830 深入理解Spring MVC 思想目录一.前言二.spring mvc 核心类与接口三.spring mvc ...
Python import this : The Zen of Python
>>> import thisThe Zen of Python, by Tim Peters Beautiful is better than ugly.Explicit is b ...
Codeforces Round #479 (Div. 3) C. Less or Equal
题目地址:http://codeforces.com/contest/977/problem/C 题解:给一串数组,是否找到一个数x,找到k个数字<=x,找到输出x,不能输出-1.例如第二组,要 ...
AQS分析（AbstractQueuedSynchronizer）（三）
1.AQS是什么 AQS同步器是Java并发编程的基础,从资源共享的角度分成独占和共享两种模式,像ReentrantLock.ThreadPoolExecutor.CountDownLatch等都是基 ...
微信小程序中自定义函数的学习使用
新手,最近在给学校搞个党费计算器.需要自己定义函数来实现某个功能. 1.无参函数: 函数都是写在js文件里面的. Page({ data:{ income1:'0', }, cal:function( ...

用spark导入数据到hbase

用spark导入数据到hbase的更多相关文章

随机推荐

热门专题