splittability A SequenceFile can be split by Hadoop and distributed across map jobs whereas a GZIP file cannot be.
splittability
- Created by Confluence Administrator, last modified by Lefty Leverenz on Sep 19, 2017
Compressed Data Storage
Keeping data compressed in Hive tables has, in some cases, been known to give better performance than uncompressed storage; both in terms of disk usage and query performance.
You can import text files compressed with Gzip or Bzip2 directly into a table stored as TextFile. The compression will be detected automatically and the file will be decompressed on-the-fly during query execution. For example:
CREATE TABLE raw (line STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log.gz' INTO TABLE raw; |
The table 'raw' is stored as a TextFile, which is the default storage. However, in this case Hadoop will not be able to split your file into chunks/blocks and run multiple maps in parallel. This can cause underutilization of your cluster's 'mapping' power.
【 A SequenceFile can be split by Hadoop and distributed across map jobs whereas a GZIP file cannot be.】
The recommended practice is to insert data into another table, which is stored as a SequenceFile. A SequenceFile can be split by Hadoop and distributed across map jobs whereas a GZIP file cannot be. For example:
CREATE TABLE raw (line STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';CREATE TABLE raw_sequence (line STRING) STORED AS SEQUENCEFILE;LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log.gz' INTO TABLE raw;SET hive.exec.compress.output=true;SET io.seqfile.compression.type=BLOCK; -- NONE/RECORD/BLOCK (see below)INSERT OVERWRITE TABLE raw_sequence SELECT * FROM raw; |
The value for io.seqfile.compression.type determines how the compression is performed. Record compresses each value individually while BLOCK buffers up 1MB (default) before doing compression.
LZO Compression
See LZO Compression for information about using LZO with Hive.
splittability A SequenceFile can be split by Hadoop and distributed across map jobs whereas a GZIP file cannot be.的更多相关文章
- 初次启动hive,解决 ls: cannot access /home/hadoop/spark-2.2.0-bin-hadoop2.6/lib/spark-assembly-*.jar: No such file or directory问题
>>提君博客原创 http://www.cnblogs.com/tijun/ << 刚刚安装好hive,进行第一次启动 提君博客原创 [hadoop@ltt1 bin]$ ...
- hadoop输入分片计算(Map Task个数的确定)
作业从JobClient端的submitJobInternal()方法提交作业的同时,调用InputFormat接口的getSplits()方法来创建split.默认是使用InputFormat的子类 ...
- Hadoop 使用Combiner提高Map/Reduce程序效率
众所周知,Hadoop框架使用Mapper将数据处理成一个<key,value>键值对,再网络节点间对其进行整理(shuffle),然后使用Reducer处理数据并进行最终输出. 在上述过 ...
- C#、JAVA操作Hadoop(HDFS、Map/Reduce)真实过程概述。组件、源码下载。无法解决:Response status code does not indicate success: 500。
一.Hadoop环境配置概述 三台虚拟机,操作系统为:Ubuntu 16.04. Hadoop版本:2.7.2 NameNode:192.168.72.132 DataNode:192.168.72. ...
- 【hadoop】如何向map和reduce脚本传递参数,加载文件和目录
本文主要讲解三个问题: 1 使用Java编写MapReduce程序时,如何向map.reduce函数传递参数. 2 使用Streaming编写MapReduce程序(C/C++ ...
- Hadoop 2.4.1 Map/Reduce小结【原创】
看了下MapReduce的例子.再看了下Mapper和Reducer源码,理清了参数的意义,就o了. public class Mapper<KEYIN, VALUEIN, KEYOUT, VA ...
- hadoop 2.2.0 编译报错: [ERROR] class file for org.mortbay.component.AbstractLifeCycle not found
[ERROR] class file for org.mortbay.component.AbstractLifeCycle not found 错误堆栈如下: [ERROR] COMPILATIO ...
- hadoop用mutipleInputs实现map读取不同格式的文件
mapmap读取不同格式的文件这个问题一直就有,之前的读取方式是在map里获取文件的名称,依照名称不同分不同的方式读取,比如以下的方式 //取文件名 InputSplit inputSplit = c ...
- Weekly Contest 78-------->811. Subdomain Visit Count (split string with space and hash map)
A website domain like "discuss.leetcode.com" consists of various subdomains. At the top le ...
随机推荐
- 用node写的一个后台框架
server.js var http=require('http') var handleUrl=require('./handleUrl') var config = require('./conf ...
- CocoaPods | iOS详细使用说明
一:介绍 在iOS开发中,经常会使用到第三方库,[CocoaPods](https://github.com/CocoaPods/CocoaPods)可以用来方便的统一管理这些第三方库. 下面就和大家 ...
- HDU 5988最小网络流(浮点数)
题目链接:http://acm.split.hdu.edu.cn/showproblem.php?pid=5988 哇,以前的模版一直T,加了优先队列优化才擦边过. 建图很好建,概率乘法化成概率加法不 ...
- Codeforces Gym101502 B.Linear Algebra Test-STL(map)
B. Linear Algebra Test time limit per test 3.0 s memory limit per test 256 MB input standard input ...
- Codeforces Gym100735 H.Words from cubes-二分图最大匹配匈牙利
赛后补题,还是要经常回顾,以前学过的匈牙利都忘记了,“猪队友”又给我讲了一遍... 怎么感觉二分图的匈牙利算法东西好多啊,啊啊啊啊啊啊啊啊啊(吐血...) 先传送一个写的很好的博客,害怕智障找不到了. ...
- Maven的构建生命周期理解
以下引用官方的生命周期解释https://maven.apache.org/guides/introduction/introduction-to-the-lifecycle.html: 一.构建生命 ...
- 【GitHub】给GitHub上的ReadMe.md文件中添加图片怎么做 、 gitHub创建文件夹
1.首先在github上的仓库上,创建一个空的文件夹,用于上传图片 上图看 要点击的按钮是创建新的文件,并不是创建新的文件夹,具体怎么?往下看 这个时候,下面的提交按钮才能提交 2.进入新创建的文件夹 ...
- WCF中常用的binding方式 z
WCF中常用的binding方式: BasicHttpBinding: 用于把 WCF 服务当作 ASMX Web 服务.用于兼容旧的Web ASMX 服务. WSHttpBinding: 比 Bas ...
- Canvas的效果操作及save()和restore()方法应用
平移.缩放.旋转等操作等于是,我在一个正的画布绘制好图,然后再把画布做旋转.平移.缩放等等的效果. 也就是说,我使用的X.Y坐标还是正常的坐标(没旋转.平移.缩放等之前的坐标). save()和res ...
- C++ 面试问题
一面 (1) 多态性都有哪些?(静态和动态,然后分别叙述了一下虚函数和函数重载) (2) 动态绑定怎么实现?(就是问了一下基类与派生类指针和引用的转换问题) (3) 类型转换有哪些?(四种类型转换,分 ...