splittability A SequenceFile can be split by Hadoop and distributed across map jobs whereas a GZIP file cannot be.
splittability
- Created by Confluence Administrator, last modified by Lefty Leverenz on Sep 19, 2017
Compressed Data Storage
Keeping data compressed in Hive tables has, in some cases, been known to give better performance than uncompressed storage; both in terms of disk usage and query performance.
You can import text files compressed with Gzip or Bzip2 directly into a table stored as TextFile. The compression will be detected automatically and the file will be decompressed on-the-fly during query execution. For example:
CREATE TABLE raw (line STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log.gz' INTO TABLE raw; |
The table 'raw' is stored as a TextFile, which is the default storage. However, in this case Hadoop will not be able to split your file into chunks/blocks and run multiple maps in parallel. This can cause underutilization of your cluster's 'mapping' power.
【 A SequenceFile can be split by Hadoop and distributed across map jobs whereas a GZIP file cannot be.】
The recommended practice is to insert data into another table, which is stored as a SequenceFile. A SequenceFile can be split by Hadoop and distributed across map jobs whereas a GZIP file cannot be. For example:
CREATE TABLE raw (line STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';CREATE TABLE raw_sequence (line STRING) STORED AS SEQUENCEFILE;LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log.gz' INTO TABLE raw;SET hive.exec.compress.output=true;SET io.seqfile.compression.type=BLOCK; -- NONE/RECORD/BLOCK (see below)INSERT OVERWRITE TABLE raw_sequence SELECT * FROM raw; |
The value for io.seqfile.compression.type determines how the compression is performed. Record compresses each value individually while BLOCK buffers up 1MB (default) before doing compression.
LZO Compression
See LZO Compression for information about using LZO with Hive.
splittability A SequenceFile can be split by Hadoop and distributed across map jobs whereas a GZIP file cannot be.的更多相关文章
- 初次启动hive,解决 ls: cannot access /home/hadoop/spark-2.2.0-bin-hadoop2.6/lib/spark-assembly-*.jar: No such file or directory问题
>>提君博客原创 http://www.cnblogs.com/tijun/ << 刚刚安装好hive,进行第一次启动 提君博客原创 [hadoop@ltt1 bin]$ ...
- hadoop输入分片计算(Map Task个数的确定)
作业从JobClient端的submitJobInternal()方法提交作业的同时,调用InputFormat接口的getSplits()方法来创建split.默认是使用InputFormat的子类 ...
- Hadoop 使用Combiner提高Map/Reduce程序效率
众所周知,Hadoop框架使用Mapper将数据处理成一个<key,value>键值对,再网络节点间对其进行整理(shuffle),然后使用Reducer处理数据并进行最终输出. 在上述过 ...
- C#、JAVA操作Hadoop(HDFS、Map/Reduce)真实过程概述。组件、源码下载。无法解决:Response status code does not indicate success: 500。
一.Hadoop环境配置概述 三台虚拟机,操作系统为:Ubuntu 16.04. Hadoop版本:2.7.2 NameNode:192.168.72.132 DataNode:192.168.72. ...
- 【hadoop】如何向map和reduce脚本传递参数,加载文件和目录
本文主要讲解三个问题: 1 使用Java编写MapReduce程序时,如何向map.reduce函数传递参数. 2 使用Streaming编写MapReduce程序(C/C++ ...
- Hadoop 2.4.1 Map/Reduce小结【原创】
看了下MapReduce的例子.再看了下Mapper和Reducer源码,理清了参数的意义,就o了. public class Mapper<KEYIN, VALUEIN, KEYOUT, VA ...
- hadoop 2.2.0 编译报错: [ERROR] class file for org.mortbay.component.AbstractLifeCycle not found
[ERROR] class file for org.mortbay.component.AbstractLifeCycle not found 错误堆栈如下: [ERROR] COMPILATIO ...
- hadoop用mutipleInputs实现map读取不同格式的文件
mapmap读取不同格式的文件这个问题一直就有,之前的读取方式是在map里获取文件的名称,依照名称不同分不同的方式读取,比如以下的方式 //取文件名 InputSplit inputSplit = c ...
- Weekly Contest 78-------->811. Subdomain Visit Count (split string with space and hash map)
A website domain like "discuss.leetcode.com" consists of various subdomains. At the top le ...
随机推荐
- hdu 5971 Wrestling Match 二分图染色
题目链接 题意 \(n\)人进行\(m\)场比赛,给定\(m\)场比赛的双方编号:再给定已知的为\(good\ player\)的\(x\)个人的编号,已知的为\(bad\ player\)的\(y\ ...
- size_t、ptrdiff_t【转】
转自:http://www.cnblogs.com/liulipeng/archive/2012/10/08/2715246.html http://longzxr.blog.sohu.com/196 ...
- Linux用户空间与内核空间(理解高端内存)【转】
转自:http://www.cnblogs.com/wuchanming/p/4360277.html Linux 操作系统和驱动程序运行在内核空间,应用程序运行在用户空间,两者不能简单地使用指针传递 ...
- 为什么linux下多线程程序如此消耗虚拟内存【转】
转自:http://blog.csdn.net/chen19870707/article/details/43202679 权声明:本文为博主原创文章,未经博主允许不得转载. 目录(?)[-] 探 ...
- Redis监控工具—Redis-stat、RedisLive
Redis监控工具—Redis-stat.RedisLive https://blog.csdn.net/u010022051/article/details/51104681
- Network | Public-key cryptography
公开密钥加密public-key cryptography,也称为非对称(密钥)加密. 非对称密钥,是指一对加密密钥与解密密钥,这两个密钥是数学相关,用某用户密钥加密后所得的信息,只能用该用户的解密密 ...
- robot upstart 问题
1.启动后在记录文件发现左轮节点未启动: 因为左边的类未实例化,不会去订阅消息然后初始化 2.两个节点均可以启动后,发现启动后又死掉 因为在程序里有getenv(“HOME”)然后付给string,g ...
- java多线程04----------final和static
final和static关键字 final关键字 1.final关键字在单线程中的特点: 1)final修饰的静态成员:必须在进行显示初始化或静态代码块赋值,并且仅能赋值一次. 2)final修饰的类 ...
- Oracle SOA Suite OverView
SOA是一场架构的变革,那既然是变革,那就一定是有内在的原因来推动这个架构的变革.在过去几十年的时间里面,应用程序架构已经经历了3次巨大的变革,从Terminal/主机--> Client/Se ...
- Android 动态生成对话框和EditText
/** * (获取输入) */ private void showInputDialog() { ScrollView scrollview = getInitView() ; final Linea ...