原文地址：

https://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/

ORCFILE IN HDP 2: BETTER COMPRESSION, BETTER PERFORMANCE

Carter Shanklin

ORCFILE IN HDP 2:更好的压缩，更高的性能

The upcoming Hive 0.12 is set to bring some great new advancements in the storage layer in the forms of higher compression and better query performance.

即将推出的Hive 0.12，将在存储层面带来一些大的进步，包括提供更高的压缩、以及更好的查询性能。

HIGHER COMPRESSION

更高的压缩

ORCFile was introduced in Hive 0.11 and offered excellent compression, delivered through a number of techniques including run-length encoding, dictionary encoding for strings and bitmap encoding.

ORCFile格式在Hive 0.11中就已经被引入了，并且提供了出色的压缩性能。主要是基于以下几种技术： run-length 编码、字符串dictionary编码、bitmap编码。

This focus on efficiency leads to some impressive compression ratios. This picture shows the sizes of the TPC-DS dataset at Scale 500 in various encodings. This dataset contains randomly generated data including strings, floating point and integer data.

对效率的追求，产生了出色的压缩比。下图展示了在各种编码格式中，TPC-DS Scale 500数据集的大小。该数据集包含了随机生成的字符串、浮点、整型数据。

【Columnar format arranges columns adjacent within the file for compression & fast access 为了压缩和快速访问，纵列柱状格式将文件内相邻的列整理在一起】

We’ve already seen customers whose clusters are maxed out from a storage perspective moving to ORCFile as a way to free up space while being 100% compatible with existing jobs.

我们看到，那些集群存储爆满的客户们，在保证100%兼容现有作业的前提下，已经采用ORCFile格式来释放存储空间。

Data stored in ORCFile can be read or written through HCatalog, so any Pig or Map/Reduce process can play along seamlessly. Hive 12 builds on these impressive compression ratios and delivers deep integration at the Hive and execution layers to accelerate queries, both from the point of view of dealing with larger datasets and lower latencies.

以ORCFile格式存储的数据，可以通过HCatalog读取或写入。因此，任何Pig程序或Map/Reduce程序，都能无缝衔接运行。从数据集处理和低延迟的角度来看，Hive 12基于其出色的压缩比，并在Hive和执行层面提供了深度集成，从而加速了查询。

PREDICATE PUSHDOWN

谓词下推

SQL queries will generally have some number of WHERE conditions which can be used to easily eliminate rows from consideration. In older versions of Hive, rows are read out of the storage layer before being later eliminated by SQL processing. There’s a lot of wasteful overhead and Hive 12 optimizes this by allowing predicates to be pushed down and evaluated in the storage layer itself. It’s controlled by the setting hive.optimize.ppd=true.

SQL查询通常会有一些where条件，用于简单排除一些数据行。在老版本Hive中，数据行先从存储层中读出，然后才进行过滤排除（执行where条件），因此造成了很多开销的浪费。在Hive 12中进行了优化，允许将谓词下推到存储层，并在存储层中进行计算。该优化可以通过参数进行设置：hive.optimize.ppd=true

This requires a reader that is smart enough to understand the predicates. Fortunately ORC has had the corresponding improvements to allow predicates to be pushed into it, and takes advantages of its inline indexes to deliver performance benefits.

这需要一个足够聪明的读者才能理解谓词。不过幸运的是，ORC已经进行了相应的改进，以便将谓词推入其中，并利用其内联索引的长处来提供性能优势。

For example if you have a SQL query like:

SELECT COUNT(*) FROM CUSTOMER WHERE CUSTOMER.state = ‘CA’;

The ORCFile reader will now only return rows that actually match the WHERE predicates and skip customers residing in any other state. The more columns you read from the table, the more data marshaling you avoid and the greater the speedup.

例如，假设你有这么一个查询SQL：

SELECT COUNT(*) FROM CUSTOMER WHERE CUSTOMER.state = ‘CA’;

现在,ORCFile reader将只返回实际匹配where条件的行数，而忽略那些where条件之外的数据。

因此，你从Table中读出的列越多，则避免传输的数据量就越大，速度提升越明显。

A WORD ON ORCFILE INLINE INDEXES

ORC文件内联索引中的单词

Before we move to the next section we need to spend a moment talking about how ORCFile breaks rows into row groups and applies columnar compression and indexing within these row groups.

在开始讨论下一节之前，我们需要花一点时间来讨论ORCFile如何将行rows划分成行组row groups，并在这些行组内进行柱状压缩和索引。

TURNING PREDICATE PUSHDOWN TO 11

转向谓词下推到11

ORC’s Predicate Pushdown will consult the Inline Indexes to try to identify when entire blocks can be skipped all at once. Some times your dataset will naturally facilitate this. For instance if your data comes as a time series with a monotonically increasing timestamp, when you put a where condition on this timestamp, ORC will be able to skip a lot of row groups.

ORC的谓词下推参照了内联索引，以确定何时可以一次性地忽略整个块。有时，你的数据集恰好自然地促成这一点。比如，如果你的数据以单调增长的时间戳作为时间序列，那么，当你以此时间戳作为过滤条件时，ORC将会跨过很多行组。

In other instances you may need to give things a kick by sorting data. If a column is sorted, relevant records will get confined to one area on disk and the other pieces will be skipped very quickly.

在其它情况下，你可能需要通过对数据进行排序，从而给出一个依据。如果对列进行了排序，那么，相应的记录将被限定在磁盘的一个区域内，而其余部分将会被快速跳过。

Skipping works for number types and for string types. In both instances it’s done by recording a min and max value inside the inline index and determining if the lookup value falls outside that range.

跳过忽略适用于数值类型和字符串类型。在这两种类型的场景下，都会在内联索引中记录最小值/最大值，从而确定要查询的值是否超出该范围。

Sorting can lead to very nice speedups. There is a trade-off in that you need to decide what columns to sort on in advance. The decision making process is somewhat similar to deciding what columns to index in traditional SQL systems. The best payback is when you have a column that is frequently used and accessed with very specific conditions and is used in a lot of queries. Remember that you can force Hive to sort on a column by using the SORT BY keyword when creating the table and setting hive.enforce.sorting to true before inserting into the table.

排序能带来相当好的速度提升。此处有一个权衡，需要你提前确定参与排序的列。决策的过程有些类似于在传统SQL数据库中给哪些列建立索引。最佳实践是：那个被频繁使用、会以明确条件访问、且用于大量查询的列。请记住：你可以在创建表时使用SORT BY keyword强制Hive按某一个列进行排序，然后在向表中插入数据之前将hive.enforce.sorting设置为true。

ORCFile is an important piece of our Stinger Initiative to improve Hive performance 100x. To show the impact we ran a modified TPC-DS Query 27 query with a modified data schema. Query 27 does a star schema join on a large fact table, accessing 4 separate dimension tables. In the modified schema, the state in which the sale is made is denormalized into the fact table and the resulting table is sorted by state. In this way, when the query scans the fact table, it can skip entire blocks of rows because the query filters based on the state. This results in some incremental speedup as you can see from the chart below.

ORCFile是Stinger Initiative （一个彻底提升Hive效率的工具）的重要组成部分，旨在将Hive性能提高100倍。为了展示效果，我们运行一个改造了数据模型的 TPC-DS Query 27查询。Query 27在一个大的fact table上做了星形连接，来访问4个单独的维度表。在修改后的模型中的，销售状态被非规范化到fact table中，并且结束以状态排序。以这种方式，当查询扫描fact table时，它可以跳过整个的行块，因为查询基于状态列进行过滤，这会产生一些增量加速，如下图所示。

This feature gives you the best bang for the buck when:

You frequently filter a large fact table in a precise way on a column with moderate to large cardinality.
You select a large number of columns, or wide columns. The more data marshaling you save, the greater your speedup will be.

该功能为你提供了最佳选择：

1、在一个大型事实表中，你经常在某个中等到大基数列上，以精确方式进行过滤。

2、你在查询大量的列、或宽列时，传输的数据量越大，速度提升越明显。

【关于Fact table:

Fact table和dimension table这是数据仓库的两个概念，是数据仓库的两种类型表。从保存数据的角度来说，本质上没区别，都是表。区别在于，Fact表用来存fact 数据，就是一些可以计量的数据和可加性数据，数据数量，金额等。dimension table用来存描述性的数据，用来描述fact的数据，如区域，销售代表，产品等。star schema 就是一个fact表有多个维表（dimension table）关联。

星形模式的基本特点是由多个维表(Dimension Table)和事实表(Fact Table)组成。维表代表了相关的分析维度，事实表则是这些不同维度上的计算结果。星形模式存储了系统在不同维度上的预处理结果，因此进行多维分析时速度很快，但如果分析范围超出了预定义的维度，那么增加维度将是很困难的事情。】

USING ORCFILE

使用ORCFILE

Using ORCFile or converting existing data to ORCFile is simple. To use it just add STORED AS orc to the end of your create table statements like this:

使用ORCFile或将现有数据转换成ORCFile很简单。要使用它，仅需要将 STORED AS orc添加到 create table 语句的后面，如下所示：

CREATE TABLE mytable ( ... ) STORED AS orc;

To convert existing data to ORCFile create a table with the same schema as the source table plus stored as orc, then you can use issue a query like:

要将现有数据转换成ORCFile，需要创建与源表相同schema的一个表并加上stored as orc，然后就可以如下查询：

INSERT INTO TABLE orctable SELECT * FROM oldtable;

Hive will handle all the details of conversion to ORCFile and you are free to delete the old table to free up loads of space.

Hive将处理转换成ORCFile的全部细节，然后你可以自由删除旧表来释放空间。

When you create an ORC table there are a number of table properties you can use to further tune the way ORC works.

创建ORC表时，可以使用多个表属性来进一步调整ORC的工作方式。

Key	Default	Notes
`orc.compress`	`ZLIB`	Compression to use in addition to columnar compression (one of NONE, ZLIB, SNAPPY)
`orc.compress.size`	`262,144 (= 256KiB)`	Number of bytes in each compression chunk
orc.stripe.size	`268,435,456 (=256 MiB)`	Number of bytes in each stripe
`orc.row.index.stride`	`10,000`	Number of rows between index entries (must be >= 1,000)
`orc.create.index`	`true`	Whether to create inline indexes

For example let’s say you wanted to use snappy compression instead of zlib compression. Here’s how:

例如，如果你想使用snappy压缩来代替zlib压缩，可以这样做：

CREATE TABLE mytable ( ... ) STORED AS orc tblproperties ("orc.compress"="SNAPPY");

TRY IT OUT

All these features are available in our HDP 2 Beta and we encourage you to download, try them out and give us your feedback.

以上所有功能都可以在HDP2 Beta版中使用。

译：ORCFILE IN HDP 2:更好的压缩，更高的性能的更多相关文章

ORCFILE IN HDP 2: BETTER COMPRESSION, BETTER PERFORMANCE
ORCFILE IN HDP 2: BETTER COMPRESSION, BETTER PERFORMANCE by Carter Shanklin The upcoming Hive 0.12 ...
价格更低、SLA 更强的全新 Azure SQL 数据库服务等级将于 9 月正式发布
继上周公告之后,很高兴向大家宣布更多好消息,作为我们更广泛的数据平台的一部分, 我们将在 Azure 上提供丰富的在线数据服务.9 月,我们将针对 Azure SQL 数据库推出新的服务等级.Azur ...
Quality Over Quantity: 更少一些，更好一些_第1页_福布斯中文网
Quality Over Quantity: 更少一些,更好一些_第1页_福布斯中文网 Quality Over Quantity: 更少一些,更好一些 2013年04月09日 ...
Clear Linux 为脚本语言提供更高的性能
导读 Clear Linux的领先性能不仅限于C/C++应用程序,而且PHP,R和Python等脚本语言也有很大的提升速度.在一篇新的博客文章中,英特尔的一位开发人员概述了他们对Python的一些性能 ...
IntelliJ IDEA 2019.2最新解读：性能更好，体验更优，细节处理更完美！
idea 2019.2 准备 idea 2019.2正式版是在2019年7月24号发布的,本篇文章,我将根据官方博客以及自己的理解来进行说明,总体就是:性能更好,体验更优,细节处理更完美! 支持jdk ...
vue3.0和2.0的区别，Vue-cli3.0于 8月11日正式发布，更快、更小、更易维护、更易于原生、让开发者更轻松
vue3.0和2.0的区别Vue-cli3.0于 8月11日正式发布,看了下评论,兼容性不是很好,命令有不少变化,不是特别的乐观vue3.0 的发布与 vue2.0 相比,优势主要体现在:更快.更小. ...
Nvidia发布更快、功耗更低的新一代图形加速卡
导读不出意外的,Nvidia在其举行的Supercomputing 19大会上公布了很多新闻,这些我们将稍后提到.但被忽略的一条或许是其中最有趣的:一张更快.功耗更低的新一代图形加速卡. 多名与会者 ...
玩转 .NET Core 3.0:逐浪CMS新版发布，建站更简单、网站更安全
2019年11月11日,在大家都忙于网上体会“双11 ”的热闹气氛的时候,逐浪CMS开发者团队正在做着新版本发布的最后工作.此次更新是基本于 .NET Core 3.0开发,也是全国首个基于 .NET ...
会议更流畅，表情更生动！视频生成编码 VS 国际最新 VVC 标准
阿里云视频云的标准与实现团队与香港城市大学联合开发了基于 AI 生成的人脸视频压缩体系,相比于 VVC 标准,两者质量相当时可以取得 40%-65% 的码率节省,旨在用最前沿的技术,普惠视频通话.视频 ...

随机推荐

.NET异常处理的动作策略（Action Policy）
SQL Server 2008基于策略的管理,基于策略的管理(Policy Based Management),使DBA们可以制定管理策略,并将这些策略应用到服务器.数据库以及数据环境中的其他对象上去 ...
MQ之如何做到消息幂等（转优秀）
一.缘起 MQ消息必达,架构上有两个核心设计点: (1)消息落地 (2)消息超时.重传.确认再次回顾消息总线核心架构,它由发送端.服务端.固化存储.接收端四大部分组成. 为保证消息的可达性,超时 ...
一步一步学Python-基础篇
1.安装地址:https://www.python.org/downloads/windows/ 安装完成过后,配置环境变量,比如:path后面计入;C:\Python27(可能需要重启一下) 然后 ...
python学习之老男孩python全栈第九期_数据库day002知识点总结 —— MySQL数据库day2（全部）
一. 复习1. MySQL: - 服务端 - 客户端2. 通信交流 - 授权 - SQL语句 - 数据库创建数据库: create database db1 default charset utf8 ...
Scala + Thrift+ Zookeeper+Flume+Kafka配置笔记
1. 开发环境 1.1. 软件包下载 1.1.1. JDK下载地址 http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downl ...
Spring Boot—16日志设置
application.properties # server.address=0.0.0.0 server.port=8080 server.servlet.context-path=/test s ...
ASP.Net与JSP如何共享Session值
思路: ASP.NET中序列化Session以二进制数据保存到数据库,然后由JSP读取数据库中的二进制数据反序列化成Session对象,再强制转化成JAVA的Session对象登录的ASPX文件 ...
带你从零学ReactNative开发跨平台App开发-[react native 仿boss直聘]（十三）
ReactNative跨平台开发系列教程: 带你从零学ReactNative开发跨平台App开发(一) 带你从零学ReactNative开发跨平台App开发(二) 带你从零学ReactNative开发 ...
Django 路由系统URL 视图views
一.Django URL (路由系统) URL配置(URLconf)就像Django 所支撑网站的目录.它的本质是URL模式以及要为该URL模式调用的视图函数之间的映射表:你就是以这种方式告诉Djan ...
精华阅读第 12 期 | 最新 App Store 审核指南与10大被拒理由？
很多时候,我们对技术的追求是没有止境的,我们需要不断的学习,进步,再学习,再进步!本文系移动精英开发俱乐部的第12期文章推荐阅读整理,其中涉及到了 Android 数据库框架,架构设计中的循环引用,同 ...

译：ORCFILE IN HDP 2:更好的压缩，更高的性能