Archiving for File Count Reduction

Note: Archiving should be considered an advanced command due to the caveats involved.

Overview

Due to the design of HDFS, the number of files in the filesystem directly affects the memory consumption(消费) in the namenode. While normally not a problem for small clusters, memory usage may hit the limits of accessible memory on a single machine when there are >50-100 million files. In such situations, it is advantageous(有利的) to have as few files as possible.

The use of Hadoop Archives is one approach(途径) to reducing the number of files in partitions. (减少分区里面的文件数量)Hive has built-in support to convert files in existing partitions to a Hadoop Archive (HAR) so that a partition that may once have consisted of 100's of files can occupy just ~3 files (depending on settings). However, the trade-off(交易,权衡) is that queries may be slower due to the additional overhead in reading from the HAR. (但是读数据的时候可能会稍稍变慢)

Note that archiving does NOT compress the files – HAR is analogous to the Unix tar command.

Archiving 并非压缩文件,非常类似与Unix系统的tar命令 (按我的理解是:仅打包,不压缩)

tar -zcvf /tmp/etc.tar.gz  /etc  <==打包后,以 gzip 压缩
tar -jcvf /tmp/etc.tar.bz2 /etc <==打包后,以 bzip2 压缩
tar -zxvf /tmp/etc.tar.gz 解压
tar -jxvf /tmp/etc.tar.bz2 解压

Settings

There are 3 settings that should be configured before archiving is used. (Example values are shown.)

hive> set hive.archive.enabled=true;
hive> set hive.archive.har.parentdir.settable=true;
hive> set har.partfile.size=1099511627776;

hive.archive.enabled controls whether archiving operations are enabled.

hive.archive.har.parentdir.settable informs Hive whether the parent directory can be set while creating the archive. In recent versions of Hadoop the -p option can specify the root directory of the archive. For example, if /dir1/dir2/file is archived with /dir1 as the parent directory, then the resulting archive file will contain the directory structure dir2/file. In older versions of Hadoop (prior to 2011), this option was not available and therefore Hive must be configured to accommodate(适应) this limitation.

har.partfile.size controls the size of the files that make up the archive. The archive will contain size_of_partition/har.partfile.size files, rounded up. Higher values mean fewer files, but will result in longer archiving times due to the reduced number of mappers.

Usage

Archive

Once the configuration values are set, a partition can be archived with the command:

ALTER TABLE table_name ARCHIVE PARTITION (partition_col = partition_col_value, partition_col = partiton_col_value, ...)

For example:

ALTER TABLE srcpart ARCHIVE PARTITION(ds='2008-04-08', hr='12')

Once the command is issued, a mapreduce job will perform the archiving. Unlike Hive queries, there is no output on the CLI to indicate process.

Unarchive

The partition can be reverted back to its original files with the unarchive command:

ALTER TABLE srcpart UNARCHIVE PARTITION(ds='2008-04-08', hr='12')

Cautions and Limitations 警告和限制

  • In some older versions of Hadoop, HAR had a few bugs that could cause data loss or other errors. Be sure that these patches are integrated into your version of Hadoop:

https://issues.apache.org/jira/browse/HADOOP-6591 (fixed in Hadoop 0.21.0)

https://issues.apache.org/jira/browse/MAPREDUCE-1548 (fixed in Hadoop 0.22.0)

https://issues.apache.org/jira/browse/MAPREDUCE-2143 (fixed in Hadoop 0.22.0)

https://issues.apache.org/jira/browse/MAPREDUCE-1752 (fixed in Hadoop 0.23.0)

  • The HarFileSystem class still has a bug that has yet to be fixed:

https://issues.apache.org/jira/browse/MAPREDUCE-1877 (moved to https://issues.apache.org/jira/browse/HADOOP-10906 in 2014)

Hive comes with the HiveHarFileSystem class that addresses some of these issues, and is by default the value for fs.har.impl. Keep this in mind if you're rolling your own version of HarFileSystem:

  • The default HiveHarFileSystem.getFileBlockLocations() has no locality. That means it may introduce higher network loads or reduced performance.
  • Archived partitions cannot be overwritten with INSERT OVERWRITE. The partition must be unarchived first.
  • If two processes attempt to archive the same partition at the same time, bad things could happen. (Need to implement concurrency support.)

Under the Hood

Internally, when a partition is archived, a HAR is created using the files from the partition's original location (such as /warehouse/table/ds=1). The parent directory of the partition is specified to be the same as the original location and the resulting archive is named 'data.har'. The archive is moved under the original directory (such as /warehouse/table/ds=1/data.har), and the partition's location is changed to point to the archive.

[Hive - LanguageManual] Archiving for File Count Reduction的更多相关文章

  1. Hive:org.apache.hadoop.hdfs.protocol.NSQuotaExceededException: The NameSpace quota (directories and files) of directory /mydir is exceeded: quota=100000 file count=100001

    集群中遇到了文件个数超出限制的错误: 0)昨天晚上spark 任务突然抛出了异常:org.apache.hadoop.hdfs.protocol.NSQuotaExceededException: T ...

  2. [Hive - LanguageManual] Alter Table/Partition/Column

    Alter Table/Partition/Column Alter Table Rename Table Alter Table Properties Alter Table Comment Add ...

  3. [Hive - LanguageManual] DML: Load, Insert, Update, Delete

    LanguageManual DML Hive Data Manipulation Language Hive Data Manipulation Language Loading files int ...

  4. [Hive - LanguageManual] GroupBy

    Group By Syntax Simple Examples Select statement and group by clause Advanced Features Multi-Group-B ...

  5. [HIve - LanguageManual] Hive Operators and User-Defined Functions (UDFs)

    Hive Operators and User-Defined Functions (UDFs) Hive Operators and User-Defined Functions (UDFs) Bu ...

  6. [Hive - LanguageManual ] Explain (待)

    EXPLAIN Syntax EXPLAIN Syntax Hive provides an EXPLAIN command that shows the execution plan for a q ...

  7. [Hive - LanguageManual ] Windowing and Analytics Functions (待)

    LanguageManual WindowingAndAnalytics     Skip to end of metadata   Added by Lefty Leverenz, last edi ...

  8. [Hive - LanguageManual] VirtualColumns

    Virtual Columns Simple Examples Virtual Columns Hive 0.8.0 provides support for two virtual columns: ...

  9. Hive LanguageManual DDL

    hive语法规则LanguageManual DDL SQL DML 和 DDL 数据操作语言 (DML) 和 数据定义语言 (DDL) 一.数据库 增删改都在文档里说得也很明白,不重复造车轮 二.表 ...

随机推荐

  1. Photoshop:模拟钢笔压力

    预期效果: 方法: 按B选择画笔工具,设置"钢笔压力" 路径描边:选择路径,右击路径,选择“描边路径” 或在路径面板,选择: 即可得到预期效果!

  2. MSSQLServer基础07(事务,存储过程,分页的存储过程,触发器)

    事务 事务:保证多个操作全部成功,否则全部失败,这处机制就是事务 思考:下了个订单,但是在保存详细信息时出错了,这样可以成功吗? 数据库中的事务:代码全都成功则提交,如果有某一条语句失败则回滚,整体失 ...

  3. 深入探索 Java 热部署

    在 Java 开发领域,热部署一直是一个难以解决的问题,目前的 Java 虚拟机只能实现方法体的修改热部署,对于整个类的结构修改,仍然需要重启虚拟机,对类重新加载才能完成更新操作.对于某些大型的应用来 ...

  4. OpenVAS漏洞扫描基础教程之OpenVAS概述及安装及配置OpenVAS服务

    OpenVAS漏洞扫描基础教程之OpenVAS概述及安装及配置OpenVAS服务   1.  OpenVAS基础知识 OpenVAS(Open Vulnerability Assessment Sys ...

  5. javascript中===与==

    == equality 等同,=== identity 恒等. ==, 两边值类型不同的时候,要先进行类型转换,再比较. ===,不做类型转换,类型不同的一定不等. 类型转换规则:Boolean> ...

  6. .net MVC APi调用

    常用的调用方法为Get/Post Get方法: 服务器 public string Get(int id) { return "value"; } 这个直接在网页就可以测试,用 h ...

  7. js、javascript正则表达式验证身份证号码

    function isCardNo(card) { // 身份证号码为15位或者18位,15位时全为数字,18位前17位为数字,最后一位是校验位,可能为数字或字符X var reg = /(^\d{1 ...

  8. window注册表

    打开注册表: 可以用快捷键 win + r  ,然后输入 Regedit 回车,会打开注册表. 注册表添加一个键值对到 操作如下: 1.先创建一个 .reg 后缀的文件. 2.文件内容如下: Wind ...

  9. Mysql 临时变量的 定义 和 赋值 Set 和 Into 赋值; Swith Mysql版本 Case When的用法

    一:临时变量的定义和赋值 DECLARE spot SMALLINT; -- 分隔符的位置 DECLARE tempId VARCHAR(64); -- 循环 需要用到的临时的Cid DECLARE ...

  10. UVA 11478 Halum(用bellman-ford解差分约束)

    对于一个有向带权图,进行一种操作(v,d),对以点v为终点的边的权值-d,对以点v为起点的边的权值+d.现在给出一个有向带权图,为能否经过一系列的(v,d)操作使图上的每一条边的权值为正,若能,求最小 ...