[Hive - LanguageManual] Archiving for File Count Reduction
Archiving for File Count Reduction
Note: Archiving should be considered an advanced command due to the caveats involved.
Overview
Due to the design of HDFS, the number of files in the filesystem directly affects the memory consumption(消费) in the namenode. While normally not a problem for small clusters, memory usage may hit the limits of accessible memory on a single machine when there are >50-100 million files. In such situations, it is advantageous(有利的) to have as few files as possible.
The use of Hadoop Archives is one approach(途径) to reducing the number of files in partitions. (减少分区里面的文件数量)Hive has built-in support to convert files in existing partitions to a Hadoop Archive (HAR) so that a partition that may once have consisted of 100's of files can occupy just ~3 files (depending on settings). However, the trade-off(交易,权衡) is that queries may be slower due to the additional overhead in reading from the HAR. (但是读数据的时候可能会稍稍变慢)
Note that archiving does NOT compress the files – HAR is analogous to the Unix tar command.
Archiving 并非压缩文件,非常类似与Unix系统的tar命令 (按我的理解是:仅打包,不压缩)
tar -zcvf /tmp/etc.tar.gz /etc <==打包后,以 gzip 压缩
tar -jcvf /tmp/etc.tar.bz2 /etc <==打包后,以 bzip2 压缩
tar -zxvf /tmp/etc.tar.gz 解压
tar -jxvf /tmp/etc.tar.bz2 解压
Settings
There are 3 settings that should be configured before archiving is used. (Example values are shown.)
hive> set hive.archive.enabled=true;hive> set hive.archive.har.parentdir.settable=true;hive> set har.partfile.size=1099511627776; |
hive.archive.enabled controls whether archiving operations are enabled.
hive.archive.har.parentdir.settable informs Hive whether the parent directory can be set while creating the archive. In recent versions of Hadoop the -p option can specify the root directory of the archive. For example, if /dir1/dir2/file is archived with /dir1 as the parent directory, then the resulting archive file will contain the directory structure dir2/file. In older versions of Hadoop (prior to 2011), this option was not available and therefore Hive must be configured to accommodate(适应) this limitation.
har.partfile.size controls the size of the files that make up the archive. The archive will contain size_of_partition/har.partfile.size files, rounded up. Higher values mean fewer files, but will result in longer archiving times due to the reduced number of mappers.
Usage
Archive
Once the configuration values are set, a partition can be archived with the command:
ALTER TABLE table_name ARCHIVE PARTITION (partition_col = partition_col_value, partition_col = partiton_col_value, ...) |
For example:
ALTER TABLE srcpart ARCHIVE PARTITION(ds='2008-04-08', hr='12') |
Once the command is issued, a mapreduce job will perform the archiving. Unlike Hive queries, there is no output on the CLI to indicate process.
Unarchive
The partition can be reverted back to its original files with the unarchive command:
ALTER TABLE srcpart UNARCHIVE PARTITION(ds='2008-04-08', hr='12') |
Cautions and Limitations 警告和限制
- In some older versions of Hadoop, HAR had a few bugs that could cause data loss or other errors. Be sure that these patches are integrated into your version of Hadoop:
https://issues.apache.org/jira/browse/HADOOP-6591 (fixed in Hadoop 0.21.0)
https://issues.apache.org/jira/browse/MAPREDUCE-1548 (fixed in Hadoop 0.22.0)
https://issues.apache.org/jira/browse/MAPREDUCE-2143 (fixed in Hadoop 0.22.0)
https://issues.apache.org/jira/browse/MAPREDUCE-1752 (fixed in Hadoop 0.23.0)
- The HarFileSystem class still has a bug that has yet to be fixed:
https://issues.apache.org/jira/browse/MAPREDUCE-1877 (moved to https://issues.apache.org/jira/browse/HADOOP-10906 in 2014)
Hive comes with the HiveHarFileSystem class that addresses some of these issues, and is by default the value for fs.har.impl. Keep this in mind if you're rolling your own version of HarFileSystem:
- The default HiveHarFileSystem.getFileBlockLocations() has no locality. That means it may introduce higher network loads or reduced performance.
- Archived partitions cannot be overwritten with INSERT OVERWRITE. The partition must be unarchived first.
- If two processes attempt to archive the same partition at the same time, bad things could happen. (Need to implement concurrency support.)
Under the Hood
Internally, when a partition is archived, a HAR is created using the files from the partition's original location (such as /warehouse/table/ds=1). The parent directory of the partition is specified to be the same as the original location and the resulting archive is named 'data.har'. The archive is moved under the original directory (such as /warehouse/table/ds=1/data.har), and the partition's location is changed to point to the archive.
[Hive - LanguageManual] Archiving for File Count Reduction的更多相关文章
- Hive:org.apache.hadoop.hdfs.protocol.NSQuotaExceededException: The NameSpace quota (directories and files) of directory /mydir is exceeded: quota=100000 file count=100001
集群中遇到了文件个数超出限制的错误: 0)昨天晚上spark 任务突然抛出了异常:org.apache.hadoop.hdfs.protocol.NSQuotaExceededException: T ...
- [Hive - LanguageManual] Alter Table/Partition/Column
Alter Table/Partition/Column Alter Table Rename Table Alter Table Properties Alter Table Comment Add ...
- [Hive - LanguageManual] DML: Load, Insert, Update, Delete
LanguageManual DML Hive Data Manipulation Language Hive Data Manipulation Language Loading files int ...
- [Hive - LanguageManual] GroupBy
Group By Syntax Simple Examples Select statement and group by clause Advanced Features Multi-Group-B ...
- [HIve - LanguageManual] Hive Operators and User-Defined Functions (UDFs)
Hive Operators and User-Defined Functions (UDFs) Hive Operators and User-Defined Functions (UDFs) Bu ...
- [Hive - LanguageManual ] Explain (待)
EXPLAIN Syntax EXPLAIN Syntax Hive provides an EXPLAIN command that shows the execution plan for a q ...
- [Hive - LanguageManual ] Windowing and Analytics Functions (待)
LanguageManual WindowingAndAnalytics Skip to end of metadata Added by Lefty Leverenz, last edi ...
- [Hive - LanguageManual] VirtualColumns
Virtual Columns Simple Examples Virtual Columns Hive 0.8.0 provides support for two virtual columns: ...
- Hive LanguageManual DDL
hive语法规则LanguageManual DDL SQL DML 和 DDL 数据操作语言 (DML) 和 数据定义语言 (DDL) 一.数据库 增删改都在文档里说得也很明白,不重复造车轮 二.表 ...
随机推荐
- NSDictionary 初始化
NSDictionary *dic1=[NSDictionary dictionaryWithObject:@"1" forKey:@"a"]; ...
- NC / Netcat - 反弹Shell
原理 实验环境: 攻击机:windows机器,IP:192.168.12.109 受害机:linux机器,IP:192.168.79.1 攻击机:设置本地监听端口2222 C:\netcat>n ...
- Buffer数据结构和new IO的Memory-mapped files
一.Buffer类 java.nio.Buffer这个类是用来干什么的?有怎样的结构? "Core Java"中是这样定义的“A buffer is array of values ...
- jps
jps(Java Virtual Machine Process Status Tool)是JDK 1.5提供的一个显示当前所有java进程pid的命令,简单实用,非常适合在linux/unix平台上 ...
- Android开发之获取状态栏高度、屏幕的宽和高
转自:http://blog.csdn.net/guolin_blog/article/details/16919859 获取状态栏的高度. private static int statusBarH ...
- abstract的method是否可同时是static,是否可同时是native,是否可同时是synchronized?
abstract的method不可以是static的,因为抽象的方法是要被子类实现的,而static与子类扯不上关系! native方法表示该方法要用另外一种依赖平台的编程语言实现的,不存在着被子类实 ...
- 在win7系统下使用TortoiseGit(乌龟git)简单操作Git@OSC
非常感谢OSC提供了这么好的一个国内的免费的git托管平台.这里简单说下TortoiseGit操作的流程.很傻瓜了 首先你要准备两个软件,分别是msysgit和tortoisegit,乌龟还可以在下载 ...
- acdream 1408 "Money, Money, Money" (水)
题意:给出一个自然数x,要求找两个自然数a和b,任意多个a和b的组合都不能等于x,且要可以组合成大于x的任何自然数.如果找不到就输出两个0.(输出任意一个满足条件的答案) 思路:x=偶数时,无法构成, ...
- Swift入门篇-Hello World
提示:如果您使用手机和平板电脑看到这篇文章,您请在WIFI的环境下阅读,里面有很多图片, 会浪费很多流量. 博主语文一直都不好(如有什么错别字,请您在下评论)望您谅解,没有上过什么学的 最近这2天主要 ...
- 省常中模拟 Test4
prime 数论 题意:分别求 1*n.2*n.3*n.... n*n 关于模 p 的逆元.p 是质数,n < p. 初步解法:暴力枚举.因为 a 关于模 p 的逆元 b 满足 ab mod p ...