kudu系列: Java API使用和效率测试

Kudu+Impala很适合数据分析, 但直接使用Insert values语句往Kudu表插入数据, 效率实在不好, 测试下来insert的速度仅为80笔/秒. 原因也是显然的, Kudu本身写入效率很高, 但是Impala并没有做这方面优化, 观察下来每次Impala语句执行的overhead都太大了, 导致频繁小批次写入效率非常差, Kudu官方推荐使用Java API或Python API完成数据写入工作. 下面是使用Java API的测试用例, 也可以看出Kudu API的大致用法.

=========================
准备测试Table
=========================

-- kudu table

CREATE TABLE kudu_testdb.tmp_test_perf

(

    id string ENCODING PLAIN_ENCODING COMPRESSION SNAPPY,

    int_value int ,

    bigint_value bigint  ,

    timestamp_value  timestamp ,

    boolean_value  int,

    PRIMARY KEY (id)

)

PARTITION BY HASH (id) PARTITIONS 6

STORED AS KUDU

TBLPROPERTIES (

'kudu.table_name' = 'testdb.tmp_test_perf',

'kudu.master_addresses' = '10.0.0.100:7051,10.0.0.101:7051,10.0.0.101:7051',

'kudu.num_tablet_replicas' = '1'

)

;

=========================
编写测试java程序
=========================
Kudu API 编码注意事项:

1. 尽管建表Impala DDL中,kudu表字段名大小写不敏感, 但在kudu层面, 字段名称已经转成为小写形式, 在Kudu API中, 字段名称必须是小写字母.
2. 建表Impala DDL表名称大小写会被完整地保留下来, 并没有被转成小写, 而且在Kudu API使用中, 表名是大小写敏感的, 必须和建表DDL完全一致.
3. Kudu API给字段赋值函数是不接受传入null, 所以如果在为字段赋值之前, 最好先判断一下取值是否为null. 例如下面两行代码会报错.

Long longTmp=null;
row.addLong("bigint_value",longTmp);

package kudu_perf_test;

import java.sql.Timestamp;

import java.util.UUID;

import org.apache.kudu.client.*;

public class Test {

    private final static int OPERATION_BATCH = 500;

    //同时支持三个模式的测试用例

    public static void insertTestGeneric(KuduSession session, KuduTable table, SessionConfiguration.FlushMode mode,

            int recordCount) throws Exception {

        // SessionConfiguration.FlushMode.AUTO_FLUSH_BACKGROUND

        // SessionConfiguration.FlushMode.AUTO_FLUSH_SYNC

        // SessionConfiguration.FlushMode.MANUAL_FLUSH

        session.setFlushMode(mode);

        if (SessionConfiguration.FlushMode.AUTO_FLUSH_SYNC != mode) {

            session.setMutationBufferSpace(OPERATION_BATCH);

        }

        int uncommit = 0;

        for (int i = 0; i < recordCount; i++) {

            Insert insert = table.newInsert();

            PartialRow row = insert.getRow();

            UUID uuid = UUID.randomUUID();

            row.addString("id", uuid.toString());

            row.addInt("int_value", 100);

            row.addLong("bigint_value", 10000L);

            Long gtmMillis;

            /* System.currentTimeMillis() 是从1970-01-01开始算的毫秒数(GMT), kudu API是采用纳秒数, 所以需要*1000

             另外, 考虑到我们是东8区时间, 所以转成Long型需要再加8个小时, 否则存到Kudu的时间是GTM, 比东8区晚8个小时

             */

            //方法1: 获取当前时间对应的GTM时区unix毫秒数

            gtmMillis=System.currentTimeMillis(); 

            //方法2: 将timestamp转成对应的GTM时区unix毫秒数

            Timestamp localTimestamp = new Timestamp(System.currentTimeMillis());

            gtmMillis=localTimestamp.getTime();   

            //将GTM的毫秒数转成东8区的毫秒数量

            Long shanghaiTimezoneMillis=gtmMillis+8*3600*1000;

            row.addLong("timestamp_value", shanghaiTimezoneMillis*1000);

            session.apply(insert);

            // 对于手工提交, 需要buffer在未满的时候flush,这里采用了buffer一半时即提交

            if (SessionConfiguration.FlushMode.MANUAL_FLUSH == mode) {

                uncommit = uncommit + 1;

                if (uncommit > OPERATION_BATCH / 2) {

                    session.flush();

                    uncommit = 0;

                }

            }

        }

        // 对于手工提交, 保证完成最后的提交

        if (SessionConfiguration.FlushMode.MANUAL_FLUSH == mode && uncommit > 0) {

            session.flush();

        }

        // 对于后台自动提交, 必须保证完成最后的提交, 并保证有错误时能抛出异常

        if (SessionConfiguration.FlushMode.AUTO_FLUSH_BACKGROUND == mode) {

            session.flush();

            RowErrorsAndOverflowStatus error = session.getPendingErrors();

            if (error.isOverflowed() || error.getRowErrors().length > 0) {

                if (error.isOverflowed()) {

                    throw new Exception("Kudu overflow exception occurred.");

                }

                StringBuilder errorMessage = new StringBuilder();

                if (error.getRowErrors().length > 0) {

                    for (RowError errorObj : error.getRowErrors()) {

                        errorMessage.append(errorObj.toString());

                        errorMessage.append(";");

                    }

                }

                throw new Exception(errorMessage.toString());

            }

        }

    }

    //仅支持手动flush的测试用例

    public static void insertTestManual(KuduSession session, KuduTable table, int recordCount) throws Exception {

        // SessionConfiguration.FlushMode.AUTO_FLUSH_BACKGROUND

        // SessionConfiguration.FlushMode.AUTO_FLUSH_SYNC

        // SessionConfiguration.FlushMode.MANUAL_FLUSH

        SessionConfiguration.FlushMode mode = SessionConfiguration.FlushMode.MANUAL_FLUSH;

        session.setFlushMode(mode);

        session.setMutationBufferSpace(OPERATION_BATCH);

        int uncommit = 0;

        for (int i = 0; i < recordCount; i++) {

            Insert insert = table.newInsert();

            PartialRow row = insert.getRow();

            UUID uuid = UUID.randomUUID();

            row.addString("id", uuid.toString());

            row.addInt("int_value", 100);

            row.addLong("bigint_value", 10000L);

            Long gtmMillis;

            /* System.currentTimeMillis() 是从1970-01-01开始算的毫秒数(GMT), kudu API是采用纳秒数, 所以需要*1000

             另外, 考虑到我们是东8区时间, 所以转成Long型需要再加8个小时, 否则存到Kudu的时间是GTM, 比东8区晚8个小时

             */

            //方法1: 获取当前时间对应的GTM时区unix毫秒数

            gtmMillis=System.currentTimeMillis(); 

            //方法2: 将timestamp转成对应的GTM时区unix毫秒数

            Timestamp localTimestamp = new Timestamp(System.currentTimeMillis());

            gtmMillis=localTimestamp.getTime();   

            //将GTM的毫秒数转成东8区的毫秒数量

            Long shanghaiTimezoneMillis=gtmMillis+8*3600*1000;

            row.addLong("timestamp_value", shanghaiTimezoneMillis*1000);

            session.apply(insert);

            // 对于手工提交, 需要buffer在未满的时候flush,这里采用了buffer一半时即提交

            uncommit = uncommit + 1;

            if (uncommit > OPERATION_BATCH / 2) {

                session.flush();

                uncommit = 0;

            }

        }

        // 对于手工提交, 保证完成最后的提交

        if (uncommit > 0) {

            session.flush();

        }

    }

    //仅支持自动flush的测试用例

    public static void insertTestInAutoSync(KuduSession session, KuduTable table, int recordCount) throws Exception {

        // SessionConfiguration.FlushMode.AUTO_FLUSH_BACKGROUND

        // SessionConfiguration.FlushMode.AUTO_FLUSH_SYNC

        // SessionConfiguration.FlushMode.MANUAL_FLUSH

        SessionConfiguration.FlushMode mode = SessionConfiguration.FlushMode.AUTO_FLUSH_SYNC;

        session.setFlushMode(mode);        

        for (int i = 0; i < recordCount; i++) {

            Insert insert = table.newInsert();

            PartialRow row = insert.getRow();

            UUID uuid = UUID.randomUUID();

            row.addString("id", uuid.toString());

            row.addInt("int_value", 100);

            row.addLong("bigint_value", 10000L);

            Long gtmMillis;

            /* System.currentTimeMillis() 是从1970-01-01开始算的毫秒数(GMT), kudu API是采用纳秒数, 所以需要*1000

             另外, 考虑到我们是东8区时间, 所以转成Long型需要再加8个小时, 否则存到Kudu的时间是GTM, 比东8区晚8个小时

             */

            //方法1: 获取当前时间对应的GTM时区unix毫秒数

            gtmMillis=System.currentTimeMillis(); 

            //方法2: 将timestamp转成对应的GTM时区unix毫秒数

            Timestamp localTimestamp = new Timestamp(System.currentTimeMillis());

            gtmMillis=localTimestamp.getTime();   

            //将GTM的毫秒数转成东8区的毫秒数量

            Long shanghaiTimezoneMillis=gtmMillis+8*3600*1000;

            row.addLong("timestamp_value", shanghaiTimezoneMillis*1000);

            //对于AUTO_FLUSH_SYNC模式, apply()将立即完成kudu写入

            session.apply(insert);

        }

    }

    public static void test() throws KuduException {

        KuduClient client = new KuduClient.KuduClientBuilder("10.0.0.100:7051,10.0.0.101:7051,10.0.0.101:7051")

                .build();

        KuduSession session = client.newSession();

        KuduTable table = client.openTable("testdb.tmp_test_perf");

        SessionConfiguration.FlushMode mode;

        Timestamp d1 = null;

        Timestamp d2 = null;

        long millis;

        long seconds;

        int recordCount = 0;

        try {

            mode = SessionConfiguration.FlushMode.AUTO_FLUSH_BACKGROUND;

            d1 = new Timestamp(System.currentTimeMillis());

            insertTestGeneric(session, table, mode, recordCount);

            d2 = new Timestamp(System.currentTimeMillis());

            millis = d2.getTime() - d1.getTime();

            seconds = millis / 1000 % 60;

            System.out.println(mode.name() + "耗时秒数:" + seconds);

            mode = SessionConfiguration.FlushMode.AUTO_FLUSH_SYNC;

            d1 = new Timestamp(System.currentTimeMillis());

            insertTestInAutoSync(session, table,  recordCount);

            d2 = new Timestamp(System.currentTimeMillis());

            millis = d2.getTime() - d1.getTime();

            seconds = millis / 1000 % 60;

            System.out.println(mode.name() + "耗时秒数:" + seconds);

            mode = SessionConfiguration.FlushMode.MANUAL_FLUSH;

            d1 = new Timestamp(System.currentTimeMillis());

            insertTestManual(session, table,  recordCount);

            d2 = new Timestamp(System.currentTimeMillis());

            millis = d2.getTime() - d1.getTime();

            seconds = millis / 1000 % 60;

            System.out.println(mode.name() + "耗时秒数:" + seconds);            

        } catch (Exception e) {

            // TODO Auto-generated catch block

            e.printStackTrace();

        } finally {

            if (!session.isClosed()) {

                session.close();

            }

        }

    }

    public static void main(String[] args) {

        try {

            test();

        } catch (KuduException e) {

            // TODO Auto-generated catch block

            e.printStackTrace();

        }

        System.out.println("Done");

    }

}

=========================
性能测试结果
=========================
MANUAL_FLUSH 模式:8000 row/second
AUTO_FLUSH_BACKGROUND 模式:8000 row/second
AUTO_FLUSH_SYNC 模式:1000 row/second
Impala SQL Insert 语句:80 row/second

=========================
Kudu API 使用总结
=========================
1. 尽量采用 MANUAL_FLUSH, 性能最好, 如果有写入kudu错误, flush()函数就会抛出异常, 逻辑非常清晰.
2. 在性能要求不高的情况下, AUTO_FLUSH_SYNC 也是一个好的选择.
3. 仅仅在demo场景下使用 AUTO_FLUSH_BACKGROUND, 不考虑异常处理时候代码可以很简单, 性能也很好. 在生产环境下, 不推荐的原因是: 插入数据可能会是乱序, 一旦考虑捕获异常代码就很拖沓.

kudu系列: Java API使用和效率测试的更多相关文章

HBase 系列（六）——HBase Java API 的基本使用
一.简述截至到目前 (2019.04),HBase 有两个主要的版本,分别是 1.x 和 2.x ,两个版本的 Java API 有所不同,1.x 中某些方法在 2.x 中被标识为 @depreca ...
ElasticSearch实战系列三: ElasticSearch的JAVA API使用教程
前言在上一篇中介绍了ElasticSearch实战系列二: ElasticSearch的DSL语句使用教程---图文详解,本篇文章就来讲解下 ElasticSearch 6.x官方Java API的 ...
kafka2.9.2的伪分布式集群安装和demo(java api)测试
目录: 一.什么是kafka? 二.kafka的官方网站在哪里? 三.在哪里下载?需要哪些组件的支持? 四.如何安装? 五.FAQ 六.扩展阅读一.什么是kafka? kafka是LinkedI ...
ubuntu12.04+kafka2.9.2+zookeeper3.4.5的伪分布式集群安装和demo(java api)测试
博文作者:迦壹博客地址:http://idoall.org/home.php?mod=space&uid=1&do=blog&id=547 转载声明:可以转载, 但必须以超链 ...
5 weekend01、02、03、04、05、06、07的分布式集群的HA测试 + hdfs--动态增加节点和副本数量管理 + HA的java api访问要点
weekend01.02.03.04.05.06.07的分布式集群的HA测试 1) weekend01.02的hdfs的HA测试 2) weekend03.04的yarn的HA测试 1) wee ...
[测试]java IO写入文件效率——几种方法比较
各类写入方法 /** *1 按字节写入 FileOutputStream * * @param count 写入循环次数 * @param str 写入字符串 */ public void outpu ...
Hadoop 系列（三）Java API
Hadoop 系列(三)Java API <dependency> <groupId>org.apache.hadoop</groupId> <artifac ...
Hadoop 系列（七）—— HDFS Java API
一. 简介想要使用 HDFS API,需要导入依赖 hadoop-client.如果是 CDH 版本的 Hadoop,还需要额外指明其仓库地址: <?xml version="1.0 ...
SuperMap iServer 扩展/JAVA API 系列博客整理
转载:http://blog.csdn.net/supermapsupport/article/details/70158940 SuperMap iServer为广大用户提供了整套 SDK,应用开发 ...

随机推荐

09 Zabbix4.0系统clone、mass update使用
点击返回:自学Zabbix之路点击返回:自学Zabbix4.0之路点击返回:自学zabbix集锦 09 Zabbix4.0系统clone.mass update使用 1. clone使用: clo ...
ftp文件共享服务详解
ftp 文件共享服务,文件的上传下载跨平台,tcp协议 21号(命令端口) 20号(数据端口,主动模式) 默认情况 ftp服务运行被动模式 vsftpd:软件非常安全的rpm -qi vsftp ...
Python3 与 C# 基础语法对比（List、Tuple、Dict、Set专栏）
Code:https://github.com/lotapp/BaseCode 多图旧版:https://www.cnblogs.com/dunitian/p/9156097.html 在线预览: ...
Windows cmd命令
运行操作 CMD命令锦集 1. gpedit.msc-----组策略 2. sndrec32-------录音机 3. Nslookup-------IP地址侦测器 ,是一个监测网络中DN ...
window无法启动mongodb服务：系统找不到指定的文件错误的解决方法
原文:http://www.phperz.com/article/15/0530/131534.html 错误描述错误2:系统找不到指定文件思考过程昨天做测试的时候,先后安装了两次MongoDB ...
遍历HTML DOM 树
 <!DOCTYPE html> <html> <head> <meta charset="u ...
c# 获取机器硬件信息（硬盘，cpu，内存等)
using System; using System.Collections.Generic; using System.Globalization; using System.Management; ...
（转）source insight的使用方法逆天整理
转载自:https://www.cnblogs.com/xunbu7/p/7067427.html A. why SI: 为什么要用Source Insight呢?因为她比完整的IDE要更快啊,比一般 ...
Echarts CPU监控（折线仪表盘，图例混搭）
https://blog.csdn.net/mengxiangfeiyang/article/details/44802939 CPU page <script type="tex ...
2018-2019 ACM-ICPC, Asia Nanjing Regional Contest
https://codeforces.com/gym/101981 Problem A. Adrien and Austin 贪心,注意细节 f[x]=1:先手必赢. f[x]: 分成两部分(或一部分 ...

kudu系列: Java API使用和效率测试

kudu系列: Java API使用和效率测试的更多相关文章

随机推荐

热门专题