splittability

 
 

Skip to end of metadata

 

Go to start of metadata

 

Compressed Data Storage

Keeping data compressed in Hive tables has, in some cases, been known to give better performance than uncompressed storage; both in terms of disk usage and query performance.

You can import text files compressed with Gzip or Bzip2 directly into a table stored as TextFile. The compression will be detected automatically and the file will be decompressed on-the-fly during query execution. For example:

CREATE TABLE raw (line STRING)
   ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
 
LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log.gz' INTO TABLE raw;

The table 'raw' is stored as a TextFile, which is the default storage. However, in this case Hadoop will not be able to split your file into chunks/blocks and run multiple maps in parallel. This can cause underutilization of your cluster's 'mapping' power.

【 A SequenceFile can be split by Hadoop and distributed across map jobs whereas a GZIP file cannot be.】

The recommended practice is to insert data into another table, which is stored as a SequenceFile. A SequenceFile can be split by Hadoop and distributed across map jobs whereas a GZIP file cannot be. For example:

CREATE TABLE raw (line STRING)
   ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
 
CREATE TABLE raw_sequence (line STRING)
   STORED AS SEQUENCEFILE;
 
LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log.gz' INTO TABLE raw;
 
SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK; -- NONE/RECORD/BLOCK (see below)
INSERT OVERWRITE TABLE raw_sequence SELECT * FROM raw;

The value for io.seqfile.compression.type determines how the compression is performed. Record compresses each value individually while BLOCK buffers up 1MB (default) before doing compression.

LZO Compression

See LZO Compression for information about using LZO with Hive.

 

splittability A SequenceFile can be split by Hadoop and distributed across map jobs whereas a GZIP file cannot be.的更多相关文章

  1. 初次启动hive,解决 ls: cannot access /home/hadoop/spark-2.2.0-bin-hadoop2.6/lib/spark-assembly-*.jar: No such file or directory问题

    >>提君博客原创  http://www.cnblogs.com/tijun/  << 刚刚安装好hive,进行第一次启动 提君博客原创 [hadoop@ltt1 bin]$ ...

  2. hadoop输入分片计算(Map Task个数的确定)

    作业从JobClient端的submitJobInternal()方法提交作业的同时,调用InputFormat接口的getSplits()方法来创建split.默认是使用InputFormat的子类 ...

  3. Hadoop 使用Combiner提高Map/Reduce程序效率

    众所周知,Hadoop框架使用Mapper将数据处理成一个<key,value>键值对,再网络节点间对其进行整理(shuffle),然后使用Reducer处理数据并进行最终输出. 在上述过 ...

  4. C#、JAVA操作Hadoop(HDFS、Map/Reduce)真实过程概述。组件、源码下载。无法解决:Response status code does not indicate success: 500。

    一.Hadoop环境配置概述 三台虚拟机,操作系统为:Ubuntu 16.04. Hadoop版本:2.7.2 NameNode:192.168.72.132 DataNode:192.168.72. ...

  5. 【hadoop】如何向map和reduce脚本传递参数,加载文件和目录

    本文主要讲解三个问题:       1 使用Java编写MapReduce程序时,如何向map.reduce函数传递参数.       2 使用Streaming编写MapReduce程序(C/C++ ...

  6. Hadoop 2.4.1 Map/Reduce小结【原创】

    看了下MapReduce的例子.再看了下Mapper和Reducer源码,理清了参数的意义,就o了. public class Mapper<KEYIN, VALUEIN, KEYOUT, VA ...

  7. hadoop 2.2.0 编译报错: [ERROR] class file for org.mortbay.component.AbstractLifeCycle not found

    [ERROR]  class file for org.mortbay.component.AbstractLifeCycle not found 错误堆栈如下: [ERROR] COMPILATIO ...

  8. hadoop用mutipleInputs实现map读取不同格式的文件

    mapmap读取不同格式的文件这个问题一直就有,之前的读取方式是在map里获取文件的名称,依照名称不同分不同的方式读取,比如以下的方式 //取文件名 InputSplit inputSplit = c ...

  9. Weekly Contest 78-------->811. Subdomain Visit Count (split string with space and hash map)

    A website domain like "discuss.leetcode.com" consists of various subdomains. At the top le ...

随机推荐

  1. 标准C程序设计七---103

    Linux应用             编程深入            语言编程 标准C程序设计七---经典C11程序设计    以下内容为阅读:    <标准C程序设计>(第7版) 作者 ...

  2. [Bzoj4942][Noi2017]整数(线段树)

    4942: [Noi2017]整数 Time Limit: 50 Sec  Memory Limit: 512 MBSubmit: 363  Solved: 237[Submit][Status][D ...

  3. [CQOI2018] 社交网络

    题目背景 当今社会,在社交网络上看朋友的消息已经成为许多人生活的一部分.通常,一个用户在社交网络上发布一条消息(例如微博.状态.Tweet等) 后,他的好友们也可以看见这条消息,并可能转发.转发的消息 ...

  4. BZOJ3295动态逆序对

    一道比较傻的CDQ分治 CDQ: 主要用于解决三位偏序的问题 #include<cstdio> #include<cctype> #include<algorithm&g ...

  5. xamarin android 获取根证书代码

    Java.Security.KeyStore keyStore = Java.Security.KeyStore.GetInstance("AndroidCAStore"); ke ...

  6. linux删除空行操作:awk、grep、tr、sed

    如下:如何删除空行 shen\nshen\\n sen seh sehe she she 真正删除空行,可以使用vim: 通过命令模式删除空行.vim在命令模式下(在vim里输入英文字符:进入命令模式 ...

  7. 这一篇里面有很多关于scala的list的操作的好的知识

    https://www.cnblogs.com/weilunhui/p/5658860.html 1.++[B]   在A元素后面追加B元素 1 2 3 4 5 6 7 8 9 10 11 12 13 ...

  8. iOS上如何让按钮(UIbutton)文本左对齐展示

    // button.titleLabel.textAlignment = NSTextAlignmentLeft; 这句无效 button.contentHorizontalAlignment = U ...

  9. 如何将MID音乐转换成MP3

    1 使用Direct MIDI to MP3 Converter这个软件,你可以从下面这个网站下载:http://www.hanzify.org/index.php?Go=Show::List& ...

  10. d3js 画布 概念

    HTML 5 提供两种强有力的“画布”:SVG 和 Canvas. SVG 有如下特点: SVG 绘制的是矢量图,因此对图像进行放大不会失真. 基于 XML,可以为每个元素添加 JavaScript ...