Optimizing Hive queries for ORC formatted tables
Short Description:
Hive configuration settings to optimize your HiveQL when querying ORC formatted tables.
Article
SYNOPSIS
The Optimized Row Columnar (ORC) file is a columnar storage format for Hive. Specific Hive configuration settings for ORC formatted tables can improve query performance resulting in faster execution and reduced usage of computing resources. Some of these settings may already be turned on by default, whereas others require some educated guesswork.
The table below compares Tez job statistics for the same Hive query that was submitted without and with certain configuration settings. Notice the performance gains with optimization. This article will explain how the performance improvements were achieved.

QUERY EXECUTION
Source Data:
- 102,602,110 Clickstream page view records across 5 days of data for multiple countries
- Table is partitioned by date in the format YYYY-MM-DD.
- There are no indexes and table is not bucketed.
The HiveQL is ranking each page per user by how many times the user viewed that page for a specific date and within the United States. Breakdown of the query:
- Scan all the page views for each user.
- Filter for page views on 1 date partition and only include traffic in the United States.
- For each user, rank each page in terms of how many times it was viewed by that user.
- For example, I view Page A 3 times and Page B once. Page A would rank 1 and Page B would rank 2.
Without optimization

With optimization

Notice the change in reducers
- The final output size of all the reducers is 920 MB.
- For the first run, 73 reducers completed resulting in 73 output files. This is excessive. 920 MB into 73 reducers is around 12.5 MB per reducer output. This is unnecessary overhead resulting in too many small files. More parallelism does not always equate to better performance.
- The second run launched 10 reducers resulting in 10 reduce files. 920 MB into 10 reducers is about 92 MB per reducer output. Much less overhead and we don’t run into the small files problem. The maximum number of files in HDFS depends on the amount of memory available in the NameNode. Each block, file, and directory in HDFS is represented as an object in the NameNode’s memory each of which occupies about 150 Bytes.
OPTIMIZATION
- Always collect statistics on those tables for which data changes frequently. Schedule an automated ETL job to run at certain times:
ANALYZE TABLE page_views_orc COMPUTE STATISTICS FOR COLUMNS;
- Run the Hive query with the following settings:
SET hive.optimize.ppd=true;
SET hive.optimize.ppd.storage=true;
SET hive.vectorized.execution.enabled=true;
SET hive.vectorized.execution.reduce.enabled = true;
SET hive.cbo.enable=true;
SET hive.compute.query.using.stats=true;
SET hive.stats.fetch.column.stats=true;
SET hive.stats.fetch.partition.stats=true;
SET hive.tez.auto.reducer.parallelism=true;
SET hive.tez.max.partition.factor=20;
SET hive.exec.reducers.bytes.per.reducer=128000000;
- Partition your tables by date if you are storing a high volume of data per day. Table management becomes easier. You can easily drop partitions that are no longer needed or for which data has to be reprocessed.
SUMMARY
Let’s look at each of the Hive settings.
- Enable predicate pushdown (PPD) to filter at the storage layer:
SET hive.optimize.ppd=true;
SET hive.optimize.ppd.storage=true
- Vectorized query execution processes data in batches of 1024 rows instead of one by one:
SET hive.vectorized.execution.enabled=true;
SET hive.vectorized.execution.reduce.enabled=true;
- Enable the Cost Based Optimizer (COB) for efficient query execution based on cost and fetch table statistics:
SET hive.cbo.enable=true;
SET hive.compute.query.using.stats=true;
SET hive.stats.fetch.column.stats=true;
SET hive.stats.fetch.partition.stats=true;
Partition and column statistics from fetched from the metastsore. Use this with caution. If you have too many partitions and/or columns, this could degrade performance.
- Control reducer output:
SET hive.tez.auto.reducer.parallelism=true;
SET hive.tez.max.partition.factor=20;
SET hive.exec.reducers.bytes.per.reducer=128000000;
This last set is important. The first run produced 73 output files with each file being around 12.5 MB in size. This is inefficient as I explained earlier. With the above settings, we are basically telling Hive an approximate maximum number of reducers to run with the caveat that the size for each reduce output should be restricted to 128 MB. Let's examine this:
- The parameter hive.tez.max.partition.factor is telling Hive to launch up to 20 reducers. This is just a guess on my part and Hive will not necessarily enforce this. My job completed with only 10 reducers - 10 output files.
- Since I set a value of 128 MB for hive.exec.reducers.bytes.per.reducer, Hive will try to fit the reducer output into files that are come close to 128 MB each and not just run 20 reducers.
- If I did not set hive.exec.reducers.bytes.per.reducer, then Hive would have launched 20 reducers, because my query output would have allowed for this. I tested this and 20 reducers ran.
- 128 MB is an approximation for each reducer output when setting hive.exec.reducers.bytes.per.reducer. In this example the total size of the output files is 920 MB. Hive launched 10 reducers which is about 92 MB per reducer file. When I set this to 64 MB, then Hive launched the 20 reducers with each file being around 46 MB.
- If hive.exec.reducers.bytes.per.reducer is set to a very high value then you will have fewer reducers than if set to a lower value. Higher values result in fewer reducers being launched which can also degrade performance. You need just the right level of parallelism.
Optimizing Hive queries for ORC formatted tables的更多相关文章
- 5 Ways to Make Your Hive Queries Run Faster
5 Ways to Make Your Hive Queries Run Faster Technique #1: Use Tez Hive can use the Apache Tez execu ...
- hive orc压缩数据异常java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow
hive表在创建时候指定存储格式 STORED AS ORC tblproperties ('orc.compress'='SNAPPY'); 当insert数据到表时抛出异常 Caused by: ...
- Hive Bug修复:ORC表中array数据类型长度超过1024报异常
目前HVIE里查询如下语句报错: select * from dw.ticket_user_mtime limit 10; 错误如下: 17/07/06 16:45:38 [main]: DEBUG ...
- Oracle:ORA-01219:database not open:queries allowed on fixed tables/views only
Oracle:ORA-01219:database not open:queries allowed on fixed tables/views only 问: 解决 ORA-01219:databa ...
- 关于tez-ui的"All DAGs"和"Hive Queries"页面信息为空的问题解决过程
近段时间发现公司的HDP大数据平台的tez-ui页面不能用了,页面显示为空,导致通过hive提交的sql不能方便地查找到Yarn上对应的applicationId,只能通过beeline的屏幕输出信息 ...
- Hive存储格式之ORC File详解,什么是ORC File
目录 概述 文件存储结构 Stripe Index Data Row Data Stripe Footer 两个补充名词 Row Group Stream File Footer 条纹信息 列统计 元 ...
- Hive Streaming 追加 ORC 文件
1.概述 在存储业务数据的时候,随着业务的增长,Hive 表存储在 HDFS 的上的数据会随时间的增加而增加,而以 Text 文本格式存储在 HDFS 上,所消耗的容量资源巨大.那么,我们需要有一种方 ...
- Sqoop将MySQL表结构同步到hive(text、orc)
Sqoop将MySQL表结构同步到hive sqoop create-hive-table --connect jdbc:mysql://localhost:3306/sqooptest --user ...
- Hive Hadoop 解析 orc 文件
解析 orc 格式 为 json 格式: ./hive --orcfiledump -d <hdfs-location-of-orc-file> 把解析的 json 写入 到文件 ./hi ...
随机推荐
- python基础学习(六)函数基础
函数的基本使用 函数的定义 def 函数名(): 函数封装的代码 …… def 是英文 define 的缩写 函数名称 应该能够表达 函数封装代码 的功能,方便后续的调用 函数名称 的命名应该 符合 ...
- JavaWeb学习日记----表单提交方式
1.表单提交方式 (1) 使用input控件中的submit提交 代码如下: <!DOCTYPE html> <html lang="en"> <he ...
- 怎么打开在.bashrc文件以及设置颜色
打开/etc/bashrc,加入如下一行: alias ls="ls --color" 下次启动bash时就可以像在Slackware里那样显示彩色的目录列表了,其中不同颜 ...
- Math.max()/min()
返回一组数中最大值: 找到数组中的最大值,有两种方法,一种是apply,一种使用拓展运算符. 释义: 由于max()里面参数不能为数组,所以借助apply(funtion,args)方法调用Math. ...
- sql片段
1):定义sql片段 <!-- 定义sql片段 --> <!-- id: sql片段的标识 经验:1:基于单表来定义sql片段,这样的话sql片段的可重用性才高 2:sql片段中不要 ...
- 使用CSS实现无滚动条滚动
我们都知道,撸页面的时候当我们的内容超出了我们的div,往往会出现滚动条,影响美观. 尤其是当我们在做一些导航菜单的时候.滚动条一出现就破坏了UI效果. 我们不希望出现滚动条,也不希望超出去的内容被放 ...
- CA 工作流程
散列函数 Hash 常见的有 MD5, SHA1, SHA256, 该类函数特点是函数单向不可逆,对输入非常敏感,输出长度固定,针对数据的任何修改都会改变散列函数的结果,用于防止信息篡改并验证数据的完 ...
- js 从一个对象中找到属性值相等的集合
getobjs: function(objs, key, value) { var result = []; for (var i in objs) { var obj = $(objs[i]); i ...
- JS实现数组去重方法整理
前言 我们先来看下面的例子,当然来源与网络,地址<删除数组中多个不连续的数组元素的正确姿势> 我们现在将数组中所有的‘ a’ 元素删除: var arr = ['a', 'a', 'b', ...
- 2017-11-28 中文编程语言之Z语言初尝试: ZLOGO 4
"中文编程"知乎专栏原文. 作者为本人. @TKT2016 开发的Z语言(ZLOGO是它的一个部分)是本人至今看到的唯一一个仍活跃开发的开源且比较完整的中文编程语言项目. 它的源码 ...