译:ORCFILE IN HDP 2:更好的压缩,更高的性能
原文地址:
https://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/
ORCFILE IN HDP 2: BETTER COMPRESSION, BETTER PERFORMANCE
Carter Shanklin
ORCFILE IN HDP 2:更好的压缩,更高的性能
The upcoming Hive 0.12 is set to bring some great new advancements in the storage layer in the forms of higher compression and better query performance.
HIGHER COMPRESSION
ORCFile was introduced in Hive 0.11 and offered excellent compression, delivered through a number of techniques including run-length encoding, dictionary encoding for strings and bitmap encoding.
This focus on efficiency leads to some impressive compression ratios. This picture shows the sizes of the TPC-DS dataset at Scale 500 in various encodings. This dataset contains randomly generated data including strings, floating point and integer data.

We’ve already seen customers whose clusters are maxed out from a storage perspective moving to ORCFile as a way to free up space while being 100% compatible with existing jobs.
Data stored in ORCFile can be read or written through HCatalog, so any Pig or Map/Reduce process can play along seamlessly. Hive 12 builds on these impressive compression ratios and delivers deep integration at the Hive and execution layers to accelerate queries, both from the point of view of dealing with larger datasets and lower latencies.
PREDICATE PUSHDOWN
SQL queries will generally have some number of WHERE conditions which can be used to easily eliminate rows from consideration. In older versions of Hive, rows are read out of the storage layer before being later eliminated by SQL processing. There’s a lot of wasteful overhead and Hive 12 optimizes this by allowing predicates to be pushed down and evaluated in the storage layer itself. It’s controlled by the setting hive.optimize.ppd=true.
This requires a reader that is smart enough to understand the predicates. Fortunately ORC has had the corresponding improvements to allow predicates to be pushed into it, and takes advantages of its inline indexes to deliver performance benefits.
For example if you have a SQL query like:
SELECT COUNT(*) FROM CUSTOMER WHERE CUSTOMER.state = ‘CA’;
The ORCFile reader will now only return rows that actually match the WHERE predicates and skip customers residing in any other state. The more columns you read from the table, the more data marshaling you avoid and the greater the speedup.
A WORD ON ORCFILE INLINE INDEXES
Before we move to the next section we need to spend a moment talking about how ORCFile breaks rows into row groups and applies columnar compression and indexing within these row groups.

TURNING PREDICATE PUSHDOWN TO 11
ORC’s Predicate Pushdown will consult the Inline Indexes to try to identify when entire blocks can be skipped all at once. Some times your dataset will naturally facilitate this. For instance if your data comes as a time series with a monotonically increasing timestamp, when you put a where condition on this timestamp, ORC will be able to skip a lot of row groups.
In other instances you may need to give things a kick by sorting data. If a column is sorted, relevant records will get confined to one area on disk and the other pieces will be skipped very quickly.
Skipping works for number types and for string types. In both instances it’s done by recording a min and max value inside the inline index and determining if the lookup value falls outside that range.
Sorting can lead to very nice speedups. There is a trade-off in that you need to decide what columns to sort on in advance. The decision making process is somewhat similar to deciding what columns to index in traditional SQL systems. The best payback is when you have a column that is frequently used and accessed with very specific conditions and is used in a lot of queries. Remember that you can force Hive to sort on a column by using the SORT BY keyword when creating the table and setting hive.enforce.sorting to true before inserting into the table.
ORCFile is an important piece of our Stinger Initiative to improve Hive performance 100x. To show the impact we ran a modified TPC-DS Query 27 query with a modified data schema. Query 27 does a star schema join on a large fact table, accessing 4 separate dimension tables. In the modified schema, the state in which the sale is made is denormalized into the fact table and the resulting table is sorted by state. In this way, when the query scans the fact table, it can skip entire blocks of rows because the query filters based on the state. This results in some incremental speedup as you can see from the chart below.

This feature gives you the best bang for the buck when:
- You frequently filter a large fact table in a precise way on a column with moderate to large cardinality.
- You select a large number of columns, or wide columns. The more data marshaling you save, the greater your speedup will be.
USING ORCFILE
Using ORCFile or converting existing data to ORCFile is simple. To use it just add STORED AS orc to the end of your create table statements like this:
CREATE TABLE mytable (
...
) STORED AS orc;
To convert existing data to ORCFile create a table with the same schema as the source table plus stored as orc, then you can use issue a query like:
INSERT INTO TABLE orctable SELECT * FROM oldtable;
Hive will handle all the details of conversion to ORCFile and you are free to delete the old table to free up loads of space.
When you create an ORC table there are a number of table properties you can use to further tune the way ORC works.
| Key | Default | Notes |
orc.compress |
ZLIB |
Compression to use in addition to columnar compression (one of NONE, ZLIB, SNAPPY) |
orc.compress.size |
262,144 (= 256KiB) |
Number of bytes in each compression chunk |
| orc.stripe.size | 268,435,456 (=256 MiB) |
Number of bytes in each stripe |
orc.row.index.stride |
10,000 |
Number of rows between index entries (must be >= 1,000) |
orc.create.index |
true |
Whether to create inline indexes |
For example let’s say you wanted to use snappy compression instead of zlib compression. Here’s how:
CREATE TABLE mytable (
...
) STORED AS orc tblproperties ("orc.compress"="SNAPPY");
TRY IT OUT
All these features are available in our HDP 2 Beta and we encourage you to download, try them out and give us your feedback.
译:ORCFILE IN HDP 2:更好的压缩,更高的性能的更多相关文章
- ORCFILE IN HDP 2: BETTER COMPRESSION, BETTER PERFORMANCE
ORCFILE IN HDP 2: BETTER COMPRESSION, BETTER PERFORMANCE by Carter Shanklin The upcoming Hive 0.12 ...
- 价格更低、SLA 更强的全新 Azure SQL 数据库服务等级将于 9 月正式发布
继上周公告之后,很高兴向大家宣布更多好消息,作为我们更广泛的数据平台的一部分, 我们将在 Azure 上提供丰富的在线数据服务.9 月,我们将针对 Azure SQL 数据库推出新的服务等级.Azur ...
- Quality Over Quantity: 更少一些,更好一些_第1页_福布斯中文网
Quality Over Quantity: 更少一些,更好一些_第1页_福布斯中文网 Quality Over Quantity: 更少一些,更好一些 2013年04月09日 ...
- Clear Linux 为脚本语言提供更高的性能
导读 Clear Linux的领先性能不仅限于C/C++应用程序,而且PHP,R和Python等脚本语言也有很大的提升速度.在一篇新的博客文章中,英特尔的一位开发人员概述了他们对Python的一些性能 ...
- IntelliJ IDEA 2019.2最新解读:性能更好,体验更优,细节处理更完美!
idea 2019.2 准备 idea 2019.2正式版是在2019年7月24号发布的,本篇文章,我将根据官方博客以及自己的理解来进行说明,总体就是:性能更好,体验更优,细节处理更完美! 支持jdk ...
- vue3.0和2.0的区别,Vue-cli3.0于 8月11日正式发布,更快、更小、更易维护、更易于原生、让开发者更轻松
vue3.0和2.0的区别Vue-cli3.0于 8月11日正式发布,看了下评论,兼容性不是很好,命令有不少变化,不是特别的乐观vue3.0 的发布与 vue2.0 相比,优势主要体现在:更快.更小. ...
- Nvidia发布更快、功耗更低的新一代图形加速卡
导读 不出意外的,Nvidia在其举行的Supercomputing 19大会上公布了很多新闻,这些我们将稍后提到.但被忽略的一条或许是其中最有趣的:一张更快.功耗更低的新一代图形加速卡. 多名与会者 ...
- 玩转 .NET Core 3.0:逐浪CMS新版发布,建站更简单、网站更安全
2019年11月11日,在大家都忙于网上体会“双11 ”的热闹气氛的时候,逐浪CMS开发者团队正在做着新版本发布的最后工作.此次更新是基本于 .NET Core 3.0开发,也是全国首个基于 .NET ...
- 会议更流畅,表情更生动!视频生成编码 VS 国际最新 VVC 标准
阿里云视频云的标准与实现团队与香港城市大学联合开发了基于 AI 生成的人脸视频压缩体系,相比于 VVC 标准,两者质量相当时可以取得 40%-65% 的码率节省,旨在用最前沿的技术,普惠视频通话.视频 ...
随机推荐
- 性能是.NET Core的一个关键特性
关键要点1).NET Core是跨平台的,可运行在Windows.Linux.Mac OS X和更多平台上:与.NET相比,发布周期要短得多.大多数.NET Core都是通过NuGet软件包交付的,可 ...
- 一文理解JS的节流、防抖及使用场景
函数防抖(debounce):在事件被触发n秒后再执行回调,如果在这n秒内又被触发,则重新计时. 看一个
- AngularJS学习 之 安装
1. 安装好Node.js 2. 安装好Git 3. 安装好Yeoman 以管理员身份打开cmd 输入 npm install -g yo 回车即可开始安装Yeoman,具体的安装行为最好看官网的介绍 ...
- opencv3.2.0形态学滤波之开运算、闭运算
/* 一.开运算: (1)开运算,其实就是先腐蚀后膨胀的过程. (2)数学表达式:dst = open(src,element) = dilate(erode(src,element)) (3)作用: ...
- EventBus 3.0源码解析
现在网上讲解EventBus的文章大多数都是针对2.x版本的,比较老旧,本篇文章希望可以给大家在新版本上面带来帮助. EventBus 是专门为Android设计的用于订阅,发布总线的库,用到这个库的 ...
- 润乾报表新功能–导出excel支持锁定表头
在以往的报表设计中,锁定表头是会经常被用到的一个功能,这个功能不仅能使浏览的页面更加直观,信息对应的更加准确,而且也提高了报表的美观程度.但是,很多客户在将这样的报表导出excel时发现exce ...
- Android深入理解Context(二)Activity和Service的Context创建过程
前言 上一篇文章我们学习了Context关联类和Application Context的创建过程,这一篇我们接着来学习Activity和Service的Context创建过程.需要注意的是,本篇的知识 ...
- 从Azure上构建Windows应用程序映像
从Azure上构建windows应用程序映像同构建Linux应用程序映像总体流程比较类似,可以参考上图Linux映像的制作发布等流程,具体细节又有所差别. 具体步骤如下: 从Azure管理平台上申请W ...
- ubuntu-15.04-desktop-amd64想要安装KDE桌面,结果出现如下问题
The following packages have unmet dependencies: kubuntu-desktop : Depends: ark but it is not going t ...
- Oracle GoldenGate DDL 详细说明 使用手册(较早资料)
一. 概述 DDL 相关的参数包括:DDL.DDLERROR.DDLOPTIONS.DDLSUBST.DDLTABLE.GGSCHEMA. PURGEDDLHISTORY.PURGEMARKERHIS ...