[Hive - LanguageManual] Archiving for File Count Reduction
Archiving for File Count Reduction
Note: Archiving should be considered an advanced command due to the caveats involved.
Overview
Due to the design of HDFS, the number of files in the filesystem directly affects the memory consumption(消费) in the namenode. While normally not a problem for small clusters, memory usage may hit the limits of accessible memory on a single machine when there are >50-100 million files. In such situations, it is advantageous(有利的) to have as few files as possible.
The use of Hadoop Archives is one approach(途径) to reducing the number of files in partitions. (减少分区里面的文件数量)Hive has built-in support to convert files in existing partitions to a Hadoop Archive (HAR) so that a partition that may once have consisted of 100's of files can occupy just ~3 files (depending on settings). However, the trade-off(交易,权衡) is that queries may be slower due to the additional overhead in reading from the HAR. (但是读数据的时候可能会稍稍变慢)
Note that archiving does NOT compress the files – HAR is analogous to the Unix tar command.
Archiving 并非压缩文件,非常类似与Unix系统的tar命令 (按我的理解是:仅打包,不压缩)
tar -zcvf /tmp/etc.tar.gz /etc <==打包后,以 gzip 压缩
tar -jcvf /tmp/etc.tar.bz2 /etc <==打包后,以 bzip2 压缩
tar -zxvf /tmp/etc.tar.gz 解压
tar -jxvf /tmp/etc.tar.bz2 解压
Settings
There are 3 settings that should be configured before archiving is used. (Example values are shown.)
hive> set hive.archive.enabled=true;hive> set hive.archive.har.parentdir.settable=true;hive> set har.partfile.size=1099511627776; |
hive.archive.enabled controls whether archiving operations are enabled.
hive.archive.har.parentdir.settable informs Hive whether the parent directory can be set while creating the archive. In recent versions of Hadoop the -p option can specify the root directory of the archive. For example, if /dir1/dir2/file is archived with /dir1 as the parent directory, then the resulting archive file will contain the directory structure dir2/file. In older versions of Hadoop (prior to 2011), this option was not available and therefore Hive must be configured to accommodate(适应) this limitation.
har.partfile.size controls the size of the files that make up the archive. The archive will contain size_of_partition/har.partfile.size files, rounded up. Higher values mean fewer files, but will result in longer archiving times due to the reduced number of mappers.
Usage
Archive
Once the configuration values are set, a partition can be archived with the command:
ALTER TABLE table_name ARCHIVE PARTITION (partition_col = partition_col_value, partition_col = partiton_col_value, ...) |
For example:
ALTER TABLE srcpart ARCHIVE PARTITION(ds='2008-04-08', hr='12') |
Once the command is issued, a mapreduce job will perform the archiving. Unlike Hive queries, there is no output on the CLI to indicate process.
Unarchive
The partition can be reverted back to its original files with the unarchive command:
ALTER TABLE srcpart UNARCHIVE PARTITION(ds='2008-04-08', hr='12') |
Cautions and Limitations 警告和限制
- In some older versions of Hadoop, HAR had a few bugs that could cause data loss or other errors. Be sure that these patches are integrated into your version of Hadoop:
https://issues.apache.org/jira/browse/HADOOP-6591 (fixed in Hadoop 0.21.0)
https://issues.apache.org/jira/browse/MAPREDUCE-1548 (fixed in Hadoop 0.22.0)
https://issues.apache.org/jira/browse/MAPREDUCE-2143 (fixed in Hadoop 0.22.0)
https://issues.apache.org/jira/browse/MAPREDUCE-1752 (fixed in Hadoop 0.23.0)
- The HarFileSystem class still has a bug that has yet to be fixed:
https://issues.apache.org/jira/browse/MAPREDUCE-1877 (moved to https://issues.apache.org/jira/browse/HADOOP-10906 in 2014)
Hive comes with the HiveHarFileSystem class that addresses some of these issues, and is by default the value for fs.har.impl. Keep this in mind if you're rolling your own version of HarFileSystem:
- The default HiveHarFileSystem.getFileBlockLocations() has no locality. That means it may introduce higher network loads or reduced performance.
- Archived partitions cannot be overwritten with INSERT OVERWRITE. The partition must be unarchived first.
- If two processes attempt to archive the same partition at the same time, bad things could happen. (Need to implement concurrency support.)
Under the Hood
Internally, when a partition is archived, a HAR is created using the files from the partition's original location (such as /warehouse/table/ds=1). The parent directory of the partition is specified to be the same as the original location and the resulting archive is named 'data.har'. The archive is moved under the original directory (such as /warehouse/table/ds=1/data.har), and the partition's location is changed to point to the archive.
[Hive - LanguageManual] Archiving for File Count Reduction的更多相关文章
- Hive:org.apache.hadoop.hdfs.protocol.NSQuotaExceededException: The NameSpace quota (directories and files) of directory /mydir is exceeded: quota=100000 file count=100001
集群中遇到了文件个数超出限制的错误: 0)昨天晚上spark 任务突然抛出了异常:org.apache.hadoop.hdfs.protocol.NSQuotaExceededException: T ...
- [Hive - LanguageManual] Alter Table/Partition/Column
Alter Table/Partition/Column Alter Table Rename Table Alter Table Properties Alter Table Comment Add ...
- [Hive - LanguageManual] DML: Load, Insert, Update, Delete
LanguageManual DML Hive Data Manipulation Language Hive Data Manipulation Language Loading files int ...
- [Hive - LanguageManual] GroupBy
Group By Syntax Simple Examples Select statement and group by clause Advanced Features Multi-Group-B ...
- [HIve - LanguageManual] Hive Operators and User-Defined Functions (UDFs)
Hive Operators and User-Defined Functions (UDFs) Hive Operators and User-Defined Functions (UDFs) Bu ...
- [Hive - LanguageManual ] Explain (待)
EXPLAIN Syntax EXPLAIN Syntax Hive provides an EXPLAIN command that shows the execution plan for a q ...
- [Hive - LanguageManual ] Windowing and Analytics Functions (待)
LanguageManual WindowingAndAnalytics Skip to end of metadata Added by Lefty Leverenz, last edi ...
- [Hive - LanguageManual] VirtualColumns
Virtual Columns Simple Examples Virtual Columns Hive 0.8.0 provides support for two virtual columns: ...
- Hive LanguageManual DDL
hive语法规则LanguageManual DDL SQL DML 和 DDL 数据操作语言 (DML) 和 数据定义语言 (DDL) 一.数据库 增删改都在文档里说得也很明白,不重复造车轮 二.表 ...
随机推荐
- Maven那点事儿(Eclipse版)
Maven那点事儿(Eclipse版) 前言: 由于最近工作学习,总是能碰到Maven的源码.虽然平时工作并不使用Maven,但是为了学习一些源码,还是必须要了解下.这篇文章不是一个全面的Mave ...
- asp.net开源CMS推荐
随着网络技术的发展,目前国内CMS的开发商越来越多,各自都有其独特的优势,大家在选择的时候觉得眼花缭乱,不知道选择哪个比较好,我个人认为开源的CMS还是适合我们学习及研究使用,下边就几个国内的asp. ...
- 用matlab训练数字分类的深度神经网络Training a Deep Neural Network for Digit Classification
This example shows how to use Neural Network Toolbox™ to train a deep neural network to classify ima ...
- 什么是I帧,P帧,B帧
视频压缩中,每帧代表一幅静止的图像.而在实际压缩时,会采取各种算法减少数据的容量,其中IPB就是最常见的. 简单地说,I帧是关键帧,属于帧内压缩.就是和AVI的压缩是一样的. P是向前搜索的意思.B ...
- weblogic11g 安装集群 —— win2003 系统、单台主机
weblogic11g 安装集群 —— win2003 系统.单台主机 注意:此为weblogic11g 在win2003系统下(一台主机)的安装集群,linux.hpux.aix及多个主机下原理一 ...
- (六)Ireport制作一个规范的报表,处理数据格式
转载:http://frankco.iteye.com/blog/1686651 删除注释信息,Report Respector面板中按住Ctrl鼠标选中位于报表每个部分的组件,使用键盘的方向键可以左 ...
- WebService只能在本地使用,无法通过网络访问的解决办法
问题描述:WebService只能在本地使用,无法通过网络访问. 解决方案:在web.config的<system.web></system.web>中间加入如下配置节内容: ...
- UVa 11584 Partitioning by Palindromes【DP】
题意:给出一个字符串,问最少能够划分成多少个回文串 dp[i]表示以第i个字母结束最少能够划分成的回文串的个数 dp[i]=min(dp[i],dp[j]+1)(如果从第j个字母到第i个字母是回文串) ...
- ASP.NET MVC @helper使用说明
简单的 @helper 方法应用场景 Razor中的@helper语法让您能够轻松创建可重用的方法,此方法可以在您的视图模板中封装输出功能.他们使代码能更好地重用,也使代码更具有可读性. 在我们定义@ ...
- Java Web编程的主要组件技术——Struts核心组件
参考书籍:<J2EE开源编程精要15讲> Struts配置文件struts-config.xml Struts核心文件,可配置各种组件,包括Form Beans.Actions.Actio ...