ORCFILE IN HDP 2: BETTER COMPRESSION, BETTER PERFORMANCE

by

Carter Shanklin
 

The upcoming Hive 0.12 is set to bring some great new advancements in the storage layer in the forms of higher compression and better query performance.

HIGHER COMPRESSION

ORCFile was introduced in Hive 0.11 and offered excellent compression, delivered through a number of techniques including run-length encoding, dictionary encoding for strings and bitmap encoding.

This focus on efficiency leads to some impressive compression ratios. This picture shows the sizes of the TPC-DS dataset at Scale 500 in various encodings. This dataset contains randomly generated data including strings, floating point and integer data.

We’ve already seen customers whose clusters are maxed out from a storage perspective moving to ORCFile as a way to free up space while being 100% compatible with existing jobs.

Data stored in ORCFile can be read or written through HCatalog, so any Pig or Map/Reduce process can play along seamlessly. Hive 12 builds on these impressive compression ratios and delivers deep integration at the Hive and execution layers to accelerate queries, both from the point of view of dealing with larger datasets and lower latencies.

PREDICATE PUSHDOWN

SQL queries will generally have some number of WHERE conditions which can be used to easily eliminate rows from consideration. In older versions of Hive, rows are read out of the storage layer before being later eliminated by SQL processing. There’s a lot of wasteful overhead and Hive 12 optimizes this by allowing predicates to be pushed down and evaluated in the storage layer itself. It’s controlled by the setting hive.optimize.ppd=true.

This requires a reader that is smart enough to understand the predicates. Fortunately ORC has had the corresponding improvements to allow predicates to be pushed into it, and takes advantages of its inline indexes to deliver performance benefits.

For example if you have a SQL query like:

SELECT COUNT(*) FROM CUSTOMER WHERE CUSTOMER.state = ‘CA’;

The ORCFile reader will now only return rows that actually match the WHERE predicates and skip customers residing in any other state. The more columns you read from the table, the more data marshaling you avoid and the greater the speedup.

A WORD ON ORCFILE INLINE INDEXES

Before we move to the next section we need to spend a moment talking about how ORCFile breaks rows into row groups and applies columnar compression and indexing within these row groups.

TURNING PREDICATE PUSHDOWN TO 11

ORC’s Predicate Pushdown will consult the Inline Indexes to try to identify when entire blocks can be skipped all at once. Some times your dataset will naturally facilitate this. For instance if your data comes as a time series with a monotonically increasing timestamp, when you put a where condition on this timestamp, ORC will be able to skip a lot of row groups.

In other instances you may need to give things a kick by sorting data. If a column is sorted, relevant records will get confined to one area on disk and the other pieces will be skipped very quickly.

Skipping works for number types and for string types. In both instances it’s done by recording a min and max value inside the inline index and determining if the lookup value falls outside that range.

Sorting can lead to very nice speedups. There is a trade-off in that you need to decide what columns to sort on in advance. The decision making process is somewhat similar to deciding what columns to index in traditional SQL systems. The best payback is when you have a column that is frequently used and accessed with very specific conditions and is used in a lot of queries. Remember that you can force Hive to sort on a column by using the SORT BY keyword when creating the table and setting hive.enforce.sorting to true before inserting into the table.

ORCFile is an important piece of our Stinger Initiative to improve Hive performance 100x. To show the impact we ran a modified TPC-DS Query 27 query with a modified data schema. Query 27 does a star schema join on a large fact table, accessing 4 separate dimension tables. In the modified schema, the state in which the sale is made is denormalized into the fact table and the resulting table is sorted by state. In this way, when the query scans the fact table, it can skip entire blocks of rows because the query filters based on the state. This results in some incremental speedup as you can see from the chart below.

This feature gives you the best bang for the buck when:

  1. You frequently filter a large fact table in a precise way on a column with moderate to large cardinality.
  2. You select a large number of columns, or wide columns. The more data marshaling you save, the greater your speedup will be.

USING ORCFILE

Using ORCFile or converting existing data to ORCFile is simple. To use it just add STORED AS orc to the end of your create table statements like this:

CREATE TABLE mytable (
...
) STORED AS orc;

To convert existing data to ORCFile create a table with the same schema as the source table plus stored as orc, then you can use issue a query like:

INSERT INTO TABLE orctable SELECT * FROM oldtable;

Hive will handle all the details of conversion to ORCFile and you are free to delete the old table to free up loads of space.

When you create an ORC table there are a number of table properties you can use to further tune the way ORC works.

Key Default Notes
orc.compress ZLIB Compression to use in addition to columnar compression (one of NONE, ZLIB, SNAPPY)
orc.compress.size 262,144 (= 256KiB) Number of bytes in each compression chunk
orc.stripe.size 268,435,456 (=256 MiB) Number of bytes in each stripe
orc.row.index.stride 10,000 Number of rows between index entries (must be >= 1,000)
orc.create.index true Whether to create inline indexes

For example let’s say you wanted to use snappy compression instead of zlib compression. Here’s how:

CREATE TABLE mytable (
...
) STORED AS orc tblproperties ("orc.compress"="SNAPPY");

TRY IT OUT

All these features are available in our HDP 2 Beta and we encourage you to download, try them out and give us your feedback.

ORCFILE IN HDP 2: BETTER COMPRESSION, BETTER PERFORMANCE的更多相关文章

  1. 译:ORCFILE IN HDP 2:更好的压缩,更高的性能

    原文地址: https://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/ ORCFILE I ...

  2. MongoDB 3.0 WiredTiger Compression and Performance

    MongoDB3.0中的压缩选项 在MongoDB 3.0中,WiredTiger为集合提供三个压缩选项: 无压缩 Snappy(默认启用) – 很不错的压缩,有效利用资源 zlib(类似gzip) ...

  3. SolrPerformanceFactors--官方文档

    原文地址:http://wiki.apache.org/solr/SolrPerformanceFactors Contents Schema Design Considerations indexe ...

  4. 官方文档 恢复备份指南六 Configuring the RMAN Environment: Advanced Topics

    RMAN高级设置. 本章内容: Configuring Advanced Channel Options  高级通道选项 Configuring Advanced Backup Options 高级备 ...

  5. Linux中ext2文件系统的结构

    1.ext2产生的历史 最早的Linux内核是从MINIX系统过渡发展而来的.Linux最早的文件系统就是MINIX文件系统.MINIX文件系统几乎到处都是bug,采用的是16bit偏移量,最大容量为 ...

  6. oracle 表压缩技术

    压缩表是我们维护管理中常常会用到的.以下我们看都oracle给我们提供了哪些压缩方式. 文章摘自"Oracle® Database Administrator's Guide11g Rele ...

  7. mongodb压缩——snappy、zlib块压缩,btree索引前缀压缩

    MongoDB 3.0 WiredTiger Compression and Performance One of the most exciting developments over the li ...

  8. 3.4-3.6 Hive Storage Format

    一.file format ORCFile在HDP 2:更好的压缩,更好的性能: https://zh.hortonworks.com/blog/orcfile-in-hdp-2-better-com ...

  9. HIVE的几种优化

    5 WAYS TO MAKE YOUR HIVE QUERIES RUN FASTER 今天看了一篇[文章] (http://zh.hortonworks.com/blog/5-ways-make-h ...

随机推荐

  1. redis实现高并发下的抢购/秒杀功能

    之前写过一篇文章,高并发的解决思路(点此进入查看),今天再次抽空整理下实际场景中的具体代码逻辑实现吧:抢购/秒杀是如今很常见的一个应用场景,那么高并发竞争下如何解决超抢(或超卖库存不足为负数的问题)呢 ...

  2. MyBatis学习笔记(二) Executor

    一.概述 当我们打开一个SqlSession的时候,我们就完成了操作数据库的第一步,那MyBatis是如何执行Sql的呢?其实MyBatis的增删改查都是通过Executor执行的,Executor和 ...

  3. Android Service解析

    Android Service是一个可以在后台执行长时间运行操作而不提供用户界面的应用组件,它分为两种工作状态,一种是启动状态,主要用于执行后台计算:另一种是绑定状态,主要用于其他组件和Service ...

  4. 异步是javascript的精髓

    最近做了一个智能家居的APP,目前纯JS代码已经4000多行,不包括任何引入的库.还在不断升级改造中...这个项目到处都是异步.大多数都是3-4层调用.给我的感觉就是异步当你习惯了,你会发现很爽.下面 ...

  5. C# 程序运行进度显示Lable

    public void test() { Thread.Sleep(); string vvv = ""; ; i < ;i++ ) { vvv = vvv +i.ToStr ...

  6. es6 语法 (set 和 map)

    { let list = new Set(); list.add(5); list.add(7); console.log('size', list, list.size); //{5, 7} 2 } ...

  7. TS学习随笔(二)->类型推论,联合类型

    这篇内容指南:        -----类型推论  -----联合类型 类型推论 第一篇中我们看了TS的基本使用和基本数据类型的使用,知道了变量在使用的时候都得加一个类型,那我们可不可以不加呢,这个嘛 ...

  8. 13张动图助你彻底看懂马尔科夫链、PCA和条件概率!

    13张动图助你彻底看懂马尔科夫链.PCA和条件概率! https://mp.weixin.qq.com/s/ll2EX_Vyl6HA4qX07NyJbA [ 导读 ] 马尔科夫链.主成分分析以及条件概 ...

  9. iOS----------关于UDID和UUID的一些理解

    一.UDID(Unique Device Identifier)  UDID是Unique Device Identifier的缩写,中文意思是设备唯一标识. 在很多需要限制一台设备一个账号的应用中经 ...

  10. PMS 修改禅道默认首页元素及展示

    修改禅道默认首页元素及展示 by:授客 QQ:1033553122 测试环境: 禅道项目管理软件ZenTaoPMS.9.5.1.win64 需求描述 如下,安装禅道后访问默认首页,展示如下,我们希望它 ...