splittability

CompressedStorage

Created by Confluence Administrator, last modified by Lefty Leverenz on Sep 19, 2017

Compressed Data Storage

Keeping data compressed in Hive tables has, in some cases, been known to give better performance than uncompressed storage; both in terms of disk usage and query performance.

You can import text files compressed with Gzip or Bzip2 directly into a table stored as TextFile. The compression will be detected automatically and the file will be decompressed on-the-fly during query execution. For example:

CREATE TABLE raw (line STRING)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';

LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log.gz' INTO TABLE raw;

The table 'raw' is stored as a TextFile, which is the default storage. However, in this case Hadoop will not be able to split your file into chunks/blocks and run multiple maps in parallel. This can cause underutilization of your cluster's 'mapping' power.

【 A SequenceFile can be split by Hadoop and distributed across map jobs whereas a GZIP file cannot be.】

The recommended practice is to insert data into another table, which is stored as a SequenceFile. A SequenceFile can be split by Hadoop and distributed across map jobs whereas a GZIP file cannot be. For example:

CREATE TABLE raw (line STRING)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';

CREATE TABLE raw_sequence (line STRING)

STORED AS SEQUENCEFILE;

LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log.gz' INTO TABLE raw;

SET hive.exec.compress.output=true;

SET io.seqfile.compression.type=BLOCK; -- NONE/RECORD/BLOCK (see below)

INSERT OVERWRITE TABLE raw_sequence SELECT * FROM raw;

The value for io.seqfile.compression.type determines how the compression is performed. Record compresses each value individually while BLOCK buffers up 1MB (default) before doing compression.

LZO Compression

See LZO Compression for information about using LZO with Hive.

splittability A SequenceFile can be split by Hadoop and distributed across map jobs whereas a GZIP file cannot be.的更多相关文章

初次启动hive,解决 ls: cannot access /home/hadoop/spark-2.2.0-bin-hadoop2.6/lib/spark-assembly-*.jar: No such file or directory问题
>>提君博客原创 http://www.cnblogs.com/tijun/ << 刚刚安装好hive,进行第一次启动提君博客原创 [hadoop@ltt1 bin]$ ...
hadoop输入分片计算(Map Task个数的确定)
作业从JobClient端的submitJobInternal()方法提交作业的同时,调用InputFormat接口的getSplits()方法来创建split.默认是使用InputFormat的子类 ...
Hadoop 使用Combiner提高Map/Reduce程序效率
众所周知,Hadoop框架使用Mapper将数据处理成一个<key,value>键值对,再网络节点间对其进行整理(shuffle),然后使用Reducer处理数据并进行最终输出. 在上述过 ...
C#、JAVA操作Hadoop（HDFS、Map/Reduce）真实过程概述。组件、源码下载。无法解决：Response status code does not indicate success: 500。
一.Hadoop环境配置概述三台虚拟机,操作系统为:Ubuntu 16.04. Hadoop版本:2.7.2 NameNode:192.168.72.132 DataNode:192.168.72. ...
【hadoop】如何向map和reduce脚本传递参数,加载文件和目录
本文主要讲解三个问题: 1 使用Java编写MapReduce程序时,如何向map.reduce函数传递参数. 2 使用Streaming编写MapReduce程序(C/C++ ...
Hadoop 2.4.1 Map/Reduce小结【原创】
看了下MapReduce的例子.再看了下Mapper和Reducer源码,理清了参数的意义,就o了. public class Mapper<KEYIN, VALUEIN, KEYOUT, VA ...
hadoop 2.2.0 编译报错: [ERROR] class file for org.mortbay.component.AbstractLifeCycle not found
[ERROR] class file for org.mortbay.component.AbstractLifeCycle not found 错误堆栈如下: [ERROR] COMPILATIO ...
hadoop用mutipleInputs实现map读取不同格式的文件
mapmap读取不同格式的文件这个问题一直就有,之前的读取方式是在map里获取文件的名称,依照名称不同分不同的方式读取,比如以下的方式 //取文件名 InputSplit inputSplit = c ...
Weekly Contest 78-------->811. Subdomain Visit Count (split string with space and hash map)
A website domain like "discuss.leetcode.com" consists of various subdomains. At the top le ...

随机推荐

转 PV操作简单理解
传送门 PV操作简单理解进程通常分为就绪.运行和阻塞三个工作状态.三种状态在某些条件下可以转换,三者之间的转换关系如下: 进程三个状态之间的转换就是靠PV操作来控制的.PV操作主要就是P操作.V操作 ...
VirtualBox 與 Vmware 差異
VirtualBox 4.3.36_Ubuntu r105129 與 VMware® Workstation 12 Player 12.5.2 build-4638234, 分別在各自的 Ubunt ...
腾讯IM的那些坑
项目中接入腾讯IM,在这里记录下,以便大家解决问题时少走弯路 1.首先讲一下IM返回对象的问题: /** * 消息工厂方法 */ public static Message getMessage(TI ...
通过jQuery Ajax提交表单数据时同时上传附件
1.使用场景:需要使用ajax提交表单,但是提交的表单里含有附件上传 2.代码实现方式:  <form method="post" ...
codeforces gym 100825 D Rings
这题果然就是个暴力题.... 看每个T的四个方向,有'.',或者在边界上就填1 不然就填四个方向上最小的那个数再加1 然而写wa了几发,有点蠢... #include <bits/stdc++. ...
vue v-on:click传递动态参数
最近项目中要为一个循环列表动态传送当前点击列的数据,查了很久资料也没有一个完美的解决方案, 新手只能用vue的事件处理器与jquery的选择器做了一个不伦不类的方案,居然也能解决这个问题,作此记录留待 ...
[原创][FPGA][IP-Core]altlvds_tx & altlvds_rx
1. 概述 Alter公司的QuartusII软件提供了LVDS发送和接收的IP核供我们使用,其在本质上可以理解为并行-串行数据的转换器.其在官方文档(见附件)上也这样说过.其中的应用场景有告诉AD/ ...
squirrelsql安装
官网下载安装,第一次安装mac上,失败,后续重启mac看下.重启完后,还是起不来,估计和某些环境冲突,或者缺少环境使用squirrelsql如何连接hive? http://lxw1234.com/ ...
百科知识 epub文件如何打开
.epub 简介 EPub是一个自由的开放标准,属于一种可以"自动重新编排"的内容:也就是文字内容可以根据阅读设备的特性,以最适于阅读的方式显示.EPub档案内部使用了XHTML或 ...
Odoo(OpenERP)开发实践：通过XML-RPC接口訪问Odoo数据库
Odoo(OpenERP)server支持通过XML-RPC接口訪问.操作数据库,基于此可实现与其它系统的交互与集成. 本文是使用Java通过XMLRPC接口操作Odoo数据库的简单演示样例.本例引用 ...

splittability A SequenceFile can be split by Hadoop and distributed across map jobs whereas a GZIP file cannot be.

CompressedStorage

Compressed Data Storage

LZO Compression

splittability A SequenceFile can be split by Hadoop and distributed across map jobs whereas a GZIP file cannot be.的更多相关文章

随机推荐

热门专题