Hive/Impala批量插入数据

问题描述

现有几千条数据，需要插入到对应的Hive/Impala表中。安排给了一个同事做，但是等了好久，反馈还没有插入完成……看到他的做法是：对每条数据进行处理转换为对应的insert语句，但是，实际执行起来，速度很慢，每条数据都要耗时1s左右。比在MySQL中批量插入数据慢多了，因而抱怨Impala不太好用

问题分析

首先，必须明确的是，把每条数据处理成insert语句的方式，肯定是最低效的，不管是在MySQL中，还是在分布式组件Hive、Impala中。

这种方式的资源消耗，更多的花在了连接、SQL语句的解析、执行计划生成上，实际插入数据的开销还是相对较少的。

所以，要提高批量数据的插入，关键是减少无谓的资源开销，提高一条SQL的吞吐率，即通过尽量少的SQL条数，插入更多的数据。

解决方案

测试数据：

aaa

bbb

ccc

ddd

eee

fff

ggg

hhh

iii

jjj

测试表：

create table if not exists test.test_batch_insert(

    f1 string

) comment 'test for batch insert'

row format delimited fields terminated by '\t' lines terminated by '\n'

stored as textfile;

方案1（最慢的）：数据转换为insert语句

step1：处理成sql语句

vim中：

%s/^/insert into test.test_batch_insert select '/g

%s/$/';/g

或者使用awk:

awk '{printf "insert into test.test_batch_insert select \"%s\";\n", $0}' test.txt > test.sql

生成的SQL脚本：

insert into test.test_batch_insert select "aaa";

insert into test.test_batch_insert select "bbb";

insert into test.test_batch_insert select "ccc";

insert into test.test_batch_insert select "ddd";

insert into test.test_batch_insert select "eee";

insert into test.test_batch_insert select "fff";

insert into test.test_batch_insert select "ggg";

insert into test.test_batch_insert select "hhh";

insert into test.test_batch_insert select "iii";

insert into test.test_batch_insert select "jjj";

step2：执行生成的SQL脚本

impala-shell -i data1 -f test.sql

一条条执行，比较慢……

方案2（相对快点）：一条SQL尽量插入多条数据

step1：转换成SQL

awk 'BEGIN{print "insert into test.test_batch_insert"; i=1; n=10} {if(i<n){ printf "select \"%s\" union\n", $0; i++} else {printf "select \"%s\";", $0}}' test.txt > test2.sql

vim %s 或者 sed也行

生成的SQL脚本：

insert into test.test_batch_insert

select "aaa" union

select "bbb" union

select "ccc" union

select "ddd" union

select "eee" union

select "fff" union

select "ggg" union

select "hhh" union

select "iii" union

select "jjj";

step2：执行生成的SQL

执行前，先清空表；

impala-shell -i data1 -f test2.sql

执行之后，会发现，不止快了一点点……

但是，这种方式有局限……

因为，一条SQL的长度是有限制的，数据量大了，只生成一条SQL，会导致超长，无法执行。此时，可以考虑分割文件：

split -l 500 test.txt test_split_

然后，编写脚本遍历每个文件分片，重复上述操作即可。

方案3（最快的，如果你没有更好的）

step1：首先查看下test.test_batch_insert的建表语句：

impala-shell -i data1 -B -q "show create table test.test_batch_insert"

建表语句如下：

Query: show create table test.test_batch_insert

"CREATE TABLE test.test_batch_insert (

  f1 STRING

)

 COMMENT 'test for batch insert'

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'

WITH SERDEPROPERTIES ('field.delim'='\t', 'line.delim'='\n', 'serialization.format'='\t')

STORED AS TEXTFILE

LOCATION 'hdfs://xxxxxx:8020/user/hive/warehouse/test.db/test_batch_insert'

"

关注一下LOCATION属性，在HDFS上查看下该路径：

hdfs dfs -ls /user/hive/warehouse/test.db/test_batch_insert

然后，看下文件内容：

hdfs dfs -cat /user/hive/warehouse/test.db/test_batch_insert/*data.0.

发现了吧，就是可读的纯文本文件，每行都是一条数据。因为前面建表的时候，就指定了用\n作为记录分隔符。

看到这里，聪明的你，应该知道我接下来要做什么了……

step2：上传数据文件

首先，再次清空test.test_batch_insert；

然后，上传文件：

hdfs dfs -put test.txt /user/hive/warehouse/test.db/test_batch_insert

此时，在hive表中，应该能直接查询到数据了，impala中还需要刷新下表：

impala-shell命令行窗口中执行：

refresh test.test_batch_insert;

然后，搞定了……

其实，hive/impla类似于MySQL，有对应的load data的语句……这里只是把load data语句实际干的事展示了一下……