1.概述

　　在存储业务数据的时候，随着业务的增长，Hive 表存储在 HDFS 的上的数据会随时间的增加而增加，而以 Text 文本格式存储在 HDFS 上，所消耗的容量资源巨大。那么，我们需要有一种方式来减少容量的成本。而在 Hive 中，有一种 ORC 文件格式可以极大的减少存储的容量成本。今天，笔者就为大家分享如何实现流式数据追加到 Hive ORC 表中。

2.内容

2.1 ORC

　　这里，我们首先需要知道 Hive 的 ORC 是什么。在此之前，Hive 中存在一种 RC 文件，而 ORC 的出现，对 RC 这种文件做了许多优化，这种文件格式可以提供一种高效的方式来存储 Hive 数据，使用 ORC 文件可以提供 Hive 的读写以及性能。其优点如下：

减少 NameNode 的负载
支持复杂数据类型（如 list，map，struct 等等）
文件中包含索引
块压缩
...

　　结构图（来源于 Apache ORC 官网）如下所示：

　　这里笔者就不一一列举了，更多详情，可以阅读官网介绍：[入口地址]

2.2 使用

　　知道了 ORC 文件的结构，以及相关作用，我们如何去使用 ORC 表，下面我们以创建一个处理 Stream 记录的表为例，其创建示例 SQL 如下所示：

create table alerts ( id int , msg string )

     partitioned by (continent string, country string)

     clustered by (id) into 5 buckets

     stored as orc tblproperties("transactional"="true"); // currently ORC is required for streaming

　　需要注意的是，在使用 Streaming 的时候，创建 ORC 表，需要使用分区分桶。

　　下面，我们尝试插入一下数据，来模拟 Streaming 的流程，代码如下所示：

String dbName = "testing";

String tblName = "alerts";

ArrayList<String> partitionVals = new ArrayList<String>(2);

partitionVals.add("Asia");

partitionVals.add("India");

String serdeClass = "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe";

HiveEndPoint hiveEP = new HiveEndPoint("thrift://x.y.com:9083", dbName, tblName, partitionVals);

　　如果，有多个分区，我们这里可以将分区存放在分区集合中，进行加载。这里，需要开启 metastore 服务来确保 Hive 的 Thrift 服务可用。

//-------   Thread 1  -------//

StreamingConnection connection = hiveEP.newConnection(true);

DelimitedInputWriter writer = new DelimitedInputWriter(fieldNames,",", endPt);

TransactionBatch txnBatch = connection.fetchTransactionBatch(10, writer);

///// Batch 1 - First TXN

txnBatch.beginNextTransaction();

txnBatch.write("1,Hello streaming".getBytes());

txnBatch.write("2,Welcome to streaming".getBytes());

txnBatch.commit();

if(txnBatch.remainingTransactions() > 0) {

///// Batch 1 - Second TXN

txnBatch.beginNextTransaction();

txnBatch.write("3,Roshan Naik".getBytes());

txnBatch.write("4,Alan Gates".getBytes());

txnBatch.write("5,Owen O’Malley".getBytes());

txnBatch.commit();

txnBatch.close();

connection.close();

}

txnBatch = connection.fetchTransactionBatch(10, writer);

///// Batch 2 - First TXN

txnBatch.beginNextTransaction();

txnBatch.write("6,David Schorow".getBytes());

txnBatch.write("7,Sushant Sowmyan".getBytes());

txnBatch.commit();

if(txnBatch.remainingTransactions() > 0) {

///// Batch 2 - Second TXN

txnBatch.beginNextTransaction();

txnBatch.write("8,Ashutosh Chauhan".getBytes());

txnBatch.write("9,Thejas Nair" getBytes());

txnBatch.commit();

txnBatch.close();

}

connection.close();

　　接下来，我们对 Streaming 数据进行写入到 ORC 表进行存储。实现结果如下图所示：

3.案例

　　下面，我们来完成一个完整的案例，有这样一个场景，每天有许多业务数据上报到指定服务器，然后有中转服务将各个业务数据按业务拆分后转发到各自的日志节点，再由 ETL 服务将数据入库到 Hive 表。这里，我们只说说入库 Hive 表的流程，拿到数据，处理后，入库到 Hive 的 ORC 表中。具体实现代码如下所示：

/**

 * @Date Nov 24, 2016

 *

 * @Author smartloli

 *

 * @Email smartdengjie@gmail.com

 *

 * @Note TODO

 */

public class IPLoginStreaming extends Thread {

    private static final Logger LOG = LoggerFactory.getLogger(IPLoginStreaming.class);

    private String path = "";

    public static void main(String[] args) throws Exception {

        String[] paths = SystemConfigUtils.getPropertyArray("hive.orc.path", ",");

        for (String str : paths) {

            IPLoginStreaming ipLogin = new IPLoginStreaming();

            ipLogin.path = str;

            ipLogin.start();

        }

    }

    @Override

    public void run() {

        List<String> list = FileUtils.read(this.path);

        long start = System.currentTimeMillis();

        try {

            write(list);

        } catch (Exception e) {

            LOG.error("Write PATH[" + this.path + "] ORC has error,msg is " + e.getMessage());

        }

        System.out.println("Path[" + this.path + "] spent [" + (System.currentTimeMillis() - start) / 1000.0 + "s]");

    }

    public static void write(List<String> list)

            throws ConnectionError, InvalidPartition, InvalidTable, PartitionCreationFailed, ImpersonationFailed, InterruptedException, ClassNotFoundException, SerializationError, InvalidColumn, StreamingException {

        String dbName = "default";

        String tblName = "ip_login_orc";

        ArrayList<String> partitionVals = new ArrayList<String>(1);

        partitionVals.add(CalendarUtils.getDay());

        String[] fieldNames = new String[] { "_bpid", "_gid", "_plat", "_tm", "_uid", "ip", "latitude", "longitude", "reg", "tname" };

        StreamingConnection connection = null;

        TransactionBatch txnBatch = null;

        try {

            HiveEndPoint hiveEP = new HiveEndPoint("thrift://master:9083", dbName, tblName, partitionVals);

            HiveConf hiveConf = new HiveConf();

            hiveConf.setBoolVar(HiveConf.ConfVars.HIVE_HADOOP_SUPPORTS_SUBDIRECTORIES, true);

            hiveConf.set("fs.hdfs.impl", "org.apache.hadoop.hdfs.DistributedFileSystem");

            connection = hiveEP.newConnection(true, hiveConf);

            DelimitedInputWriter writer = new DelimitedInputWriter(fieldNames, ",", hiveEP);

            txnBatch = connection.fetchTransactionBatch(10, writer);

            // Batch 1

            txnBatch.beginNextTransaction();

            for (String json : list) {

                String ret = "";

                JSONObject object = JSON.parseObject(json);

                for (int i = 0; i < fieldNames.length; i++) {

                    if (i == (fieldNames.length - 1)) {

                        ret += object.getString(fieldNames[i]);

                    } else {

                        ret += object.getString(fieldNames[i]) + ",";

                    }

                }

                txnBatch.write(ret.getBytes());

            }

            txnBatch.commit();

        } finally {

            if (txnBatch != null) {

                txnBatch.close();

            }

            if (connection != null) {

                connection.close();

            }

        }

    }

}

　　PS：建议使用多线程来处理数据。

4.预览

　　实现结果如下所示：

分区详情

该分区下记录数

5.总结

　　在使用 Hive Streaming 来实现 ORC 追加的时候，除了表本身需要分区分桶以外，工程本身的依赖也是复杂，会设计 Hadoop Hive 等项目的依赖包，推荐使用 Maven 工程来实现，由 Maven 工程去帮我们解决各个 JAR 包之间的依赖问题。

6.结束语

　　这篇博客就和大家分享到这里，如果大家在研究学习的过程当中有什么问题，可以加群进行讨论或发送邮件给我，我会尽我所能为您解答，与君共勉！

Hive Streaming 追加 ORC 文件的更多相关文章

Hive Hadoop 解析 orc 文件
解析 orc 格式为 json 格式: ./hive --orcfiledump -d <hdfs-location-of-orc-file> 把解析的 json 写入到文件 ./hi ...
大数据：Hive - ORC 文件存储格式
一.ORC File文件结构 ORC的全称是(Optimized Row Columnar),ORC文件格式是一种Hadoop生态圈中的列式存储格式,它的产生早在2013年初,最初产生自Apache ...
Hive - ORC 文件存储格式【转】
一.ORC File文件结构 ORC的全称是(Optimized Row Columnar),ORC文件格式是一种Hadoop生态圈中的列式存储格式,它的产生早在2013年初,最初产生自Apache ...
hive streaming 使用shell脚本
一.HIVE streaming 在Hive中,需要实现Hive中的函数无法实现的功能时,就可以用Streaming来实现.其原理可以理解成:用HQL语句之外的语言,如Python.Shell来实现这 ...
spark SQL读取ORC文件从Driver启动到开始执行Task(或stage)间隔时间太长（计算Partition时间太长）且产出orc单个文件中stripe个数太多问题解决方案
1.背景: 控制上游文件个数每天7000个,每个文件大小小于256M,50亿条+,orc格式.查看每个文件的stripe个数,500个左右,查询命令:hdfs fsck viewfs://hadoop ...
hive自定义函数——hive streaming
Hadoop Streaming提供了一个便于进行MapReduce编程的工具包,使用它可以基于一些可执行命令.脚本语言或其他编程语言来实现Mapper和 Reducer,Streaming方式是基于 ...
Hive存储格式之ORC File详解，什么是ORC File
目录概述文件存储结构 Stripe Index Data Row Data Stripe Footer 两个补充名词 Row Group Stream File Footer 条纹信息列统计元 ...
oracle数据库表空间追加数据库文件方法
oracle数据库表空间追加数据库文件方法针对非大文件方式表空间,允许追加文件进行表空间的扩展,单个文件最大大小是32G 第一种方式:表空间增加数据文件 www.2cto.com 1 ...
shell脚本实现覆盖写文件和追加写文件
1.覆盖写文件 ">" date > not_append_file.txt

随机推荐

Swing Note
2. Swing容器: 内容窗格.分层窗格.玻璃窗格和一个可选的菜单条.(这四个同时包含在根窗格里)(请分别向其中添加组件) ...
【译】UNIVERSAL IMAGE LOADER.PART 2---ImageLoaderConfiguration详解
ImageLoader类中包含了所有操作.他是一个单例,为了获取它的一个单一实例,你需要调用getInstance()方法.在使用ImageLoader来显示图片之前,你需要初始化它的配置-Image ...
实验三组合逻辑电路的VHDL设计
一.实验目的熟悉QuartusⅡ的VHDL文本设计过程,学习简单组合逻辑电路的设计.仿真和测试方法. 二.实验内容 1. 基本命题完成2选1多路选择器的文本编辑输入(mux21a.vhd)和仿真测 ...
Nginx学习笔记（三） Nginx基本数据结构
Nginx基本数据结构话说学习一种编程语言,例如C语言,我们首先学的也是数据结构,这是以后开发程序的关键.为了更好更方便的开发Nginx,Nginx自己实现了很多适合nginx的数据结构. Ngin ...
[51单片机] nRF24L01 无线模块测试按键-灯-远程控制
哈哈,穷吊死一个,自己做的一个超简单的板还没有电源提供,只得借助我的大开发板啦.其实这2个模块是完全可以分开的,无线嘛,你懂得!进入正题,这个实验的功能就是一个发送模块(大的那个板)连接4个按键,通过 ...
Neo4j：Data Model Transformation：From Relation To Graph
Here are some tips that help you with the transformation: Each entity table is represented by a labe ...
Gradle中使用idea插件的一些实践
如果你的项目使用了Gradle作为构建工具,那么你一定要使用Gradle来自动生成IDE的项目文件,无需再手动的将源代码导入到你的IDE中去了. 如果你使用的是eclipse,可以在build.gra ...
junit批量测试
引入一种“测试套件”的概念: package test; import org.junit.Test; public class Test1 { private int value = 1; publ ...
自制一个能显示helloworld的最简单OS
<自己动手写操作系统> org 07c00h mov ax,cs mov ds,ax mov es,ax call DispStr jmp $ DispStr: mov ax,BootMe ...
祸福相依，大难之后的O2O迎来新福报？
今天的O2O似乎已经成为了一个人人都不愿意提的名词,很多原本做O2O的创业者,如今都不提自己是O2O,只说是互联网+.创业者们实际上仍然是在干着O2O的事情,之所以不敢提不愿提,无非就是一提O2O,投 ...

Hive Streaming 追加 ORC 文件