Short Description:

ORC Creation Best Practices with examples and references.

Article

Synopsis.

ORC is a columnar storage format for Hive.

This document is to explain how creation of ORC data files can improve read/scan performance when querying the data. TEZ execution engine provides different ways to optimize the query, but it will do the best with correctly created ORC files.

ORC Creation Strategy.

Example:

  1. CREATE [EXTERNAL] TABLE OrcExampleTable
  2. (clientid int, name string, address string, age int)
  3. stored as orc
  4. TBLPROPERTIES (
  5. "orc.compress"="ZLIB",
  6. "orc.compress.size"="262144",
  7. "orc.create.index"="true",
  8. "orc.stripe.size"="268435456",
  9. "orc.row.index.stride"="3000",
  10. "orc.bloom.filter.columns"="clientid,age,name");

Ingesting data into Hive tables heavily depends on usage patterns. In order to make queries running efficiently, ORC files should be created to support those patterns.

  • -Identify most important/frequent queries that will be running against your data set (based on filter or JOIN conditions)
  • -Configure optimal data file size
  • -Configure stripe and stride size
  • -Distribute and sort data during ingestion
  • -Run “analyze” table in order to keep statistics updated

Usage Patterns.

Filters are mainly used in “WHERE” clause and “JOIN … ON”. An information about the fields being used in filters should be used as well for choosing correct strategy for ORC files creation.

Example:

  1. select * from orcexampletable
  2. where clientid=100 and age between 25 and 45;

Does size matter?

As known, small files are a pain in HDFS. ORC files aren’t different than others. Even worse.

First of all, small files will impact NameNode memory and performance. But more importantly is response time from the query. If ingestion jobs generate small files, it means there will be large number of the files in total.

When query is submitted, TEZ will need an information about the files in order to build an execution plan and allocate resources from YARN.

So, before TEZ engine starts a job:

  • -TEZ gets an information from HCat about table location and partition keys. Based on this information TEZ will have exact list of directories (and subdirectories) where data files can be found.
  • -TEZ reads ORC footers and stripe level indices in each file in order to determine how many blocks of data it will need to process. This is where the problem of large number of files will impact the job submission time.
  • -TEZ requests containers based on number of input splits. Again, small files will cause less flexibility in configuring input split size, and as result, larger number of containers will need to be allocated

Note, if query submit stage time-outs, check the number of ORC files (also, see below how ORC split strategy (ETL vs BI) can affect query submission time).

There is always a trade-off between ingestion query performance. Keep to a minimum number of ORC files being created, but to satisfy acceptable level of ingestion performance and data latency.

For transactional data being ingested continuously during the day, set up daily table/partition re-build process to optimize number of files and data distribution.

Stripes and Strides.

ORC files are splittable on a stripe level. Stripe size is configurable and should depend on average length (size) of records and on how many unique values of those sorted fields you can have. If search-by field is unique (or almost unique), decrease stripe size, if heavily repeated – increase. While default is 64 MB, keep stripe size in between ¼ of block-size to 4 blocks-size (default ORC block size is 256 MB). Along with that you can play with input split size per job to decrease number of containers required for a job. Sometimes it’s even worth to reconsider HDFS block size (default HDFS block size if 128 MB).

Stride is a set of records for which range index (min/max and some additional stats) will be created. Stride size (number of records, default 10K): for unique values combinations of fields in bloom filter (or close to unique) – go with 3-7 K records. Non-unique 7-15 K records or even more. If bloom filter contains unsorted fields, that will also make you go with smaller number of records in stride.

Bloom filter can be used on sorted field in combination with additional fields that can participate in search-by clause.

Sorting and Distribution.

Most important for efficient search within the data set is how this set stored.

Since TEZ utilize ORC file level information (min/max range index per field, bloom filter, etc.), it is important that those ranges will give the best reference to the exact block of data having desired values.

Here is an example:

This example shows that with unsorted data, TEZ will request 4 containers and up to full table scan, while with sorted data – only single container for single stripe read.

The best result can be achieved by globally sorting the data in a table (or partition).

Global sorting in Hive (“ORDER BY”) enforces single reducer to sort final data set. It can be inefficient. That’s when “DISTRIBUTE BY” comes in help.

For example, let’s say we have daily partition with 200 GB and field “clientid” that we would like to sort by. Assuming we have enough power (cores) to run 20 parallel reducers, we can:

1. Limit number of reducers to 20 (mapred.reduce.tasks)

2. Distribute all the records to 20 reducers equally:

  1. insert into …
  2. select … from …
  3. distribute by floor(clientid/((<max(clientid)> – <min(clientid)> + 1)/ 20 )
  4. sort by clientid.
  • Note, this will work well if client ID values are distributed evenly on scale between min and max values. If that’s not the case, find better distribution function, but ensure that ranges of values going to different reducers aren’t intersecting.

3. Alternatively, use PIG to order by client id (with parallel 20).

Usage.

There is a good article on query optimization:

https://community.hortonworks.com/articles/68631/optimizing-hive-queries-for-orc-formatted-tables.html

I would only add to that following items to consider when working with ORC:

Analyze table.

Once the data is ingested and ready, run:

  1. analyze table t [partition p] compute
  2. statistics for [columns c,...];

Refer to https://cwiki.apache.org/confluence/display/Hive/Column+Statistics+in+Hive for more details.

Note, ANALYZE statement is time consuming. More columns are defined to be analyzed – longer time it takes to complete.

Let me know if you have more tips in this area!

ORC Creation Best Practices的更多相关文章

  1. Async/Await - Best Practices in Asynchronous Programming z

    These days there’s a wealth of information about the new async and await support in the Microsoft .N ...

  2. Best MVC Practices(最优的MVC布局)

    Best MVC Practices 最优的MVC布局策略 Model View Controller 1.数据层 2.视图层 3.控制器层 Although Model-View-Controlle ...

  3. Game Development Patterns and Best Practices (John P. Doran / Matt Casanova 著)

    https://github.com/PacktPublishing/Game-Development-Patterns-and-Best-Practices https://github.com/m ...

  4. jmeter Best Practices

    性能测试最佳实践之JMeter 16. Best Practices 16.1 Always use latest version of JMeter The performance of JMete ...

  5. .NET Best Practices

    Before starting with best practices tobe followed, it is good to have clear understanding of how mem ...

  6. What is Web Application Architecture? How It Works, Trends, Best Practices and More

    At Stackify, we understand the amount of effort that goes into creating great applications. That’s w ...

  7. 12c Data guard Switchover Best Practices using SQLPLUS (Doc ID 1578787.1)

    12c Data guard Switchover Best Practices using SQLPLUS (Doc ID 1578787.1) APPLIES TO: Oracle Databas ...

  8. 11.2 Data Guard Physical Standby Switchover Best Practices using SQL*Plus (Doc ID 1304939.1)

    11.2 Data Guard Physical Standby Switchover Best Practices using SQL*Plus (Doc ID 1304939.1) APPLIES ...

  9. C# Coding Conventions, Coding Standards & Best Practices

    C# Coding Conventions, Coding Standards & Best Practices Cui, Chikun Overview Introduction This ...

随机推荐

  1. Asp.net 获取访问者IP

    using System.Web; namespace Wxlhjcyy.Public { public class GetIp { public static string IPAddress { ...

  2. 阿里分布式服务框架Dubbo的架构总结

    Dubbo是Alibaba开源的分布式服务框架,它最大的特点是按照分层的方式来架构,使用这种方式可以使各个层之间解耦合(或者最大限度地松耦合).从服务模型的角度来看,Dubbo采用的是一种非常简单的模 ...

  3. MySQL事务(学习笔记)

    MySQL事务主要用于处理操作量大,复杂度高的数据.比如说,在人员管理系统中,你删除一个人员,你即需要人员的基本资料,也要删除和该人员相关的信息,如信箱,文章等等,这样,这些数据库操作语句就构成一个事 ...

  4. 洛谷P1792 [国家集训队]种树(链表 贪心)

    题意 题目链接 Sol 最直观的做法是wqs二分+dp.然而还有一种神仙贪心做法. 不难想到我们可以按权值从大到小依次贪心,把左右两边的打上标记,但这显然是错的比如\(1\ 231\ 233\ 232 ...

  5. chrome 开发者工具,查看元素 hover 样式

    在web开发中,浏览器开发者工具是我们常用的调试工具.我们经常会有这样的需求,就是查看元素的时候需要查看它的hover样式.相信有很多小伙伴都遇到过这样的情形,始终选不中hover后的元素状态.其实在 ...

  6. yarn安装ant-报错

    异常现象: 使用react引用antd的库时报错 yarn add antd Trace: Error: connect ETIMEDOUT 114.55.80.225:80 at Object._e ...

  7. Python Python实现批量安装android apk包

    基于Python实现批量安装android apk包 by:授客 QQ:1033553122 1.相关软件包及文件下载 下载地址:adb软件包及批量安装apk包的py文件.zip 2.测试环境 Win ...

  8. C# 利用ZXing.Net来生成条形码和二维码

    本文是利用ZXing.Net在WinForm中生成条形码,二维码的小例子,仅供学习分享使用,如有不足之处,还请指正. 什么是ZXing.Net? ZXing是一个开放源码的,用Java实现的多种格式的 ...

  9. git 入门教程之备忘录[译]

    备忘录[译] 创建 | Create 克隆一个已存在的仓库 | Clone an existing repository git clone git@github.com:snowdreams1006 ...

  10. SAP 销售条件表增强栏位

    有时遇到一个比较特殊的业务,比如公司间免费订单,既要让价格为0,不读取VK11里创建的价格, 又要让公司间的价格读取VK11,这实际上是有矛盾的,也就是说一个订单里面的两行,物料一样,客户一样,就会出 ...