splittability

 
 

Skip to end of metadata

 

Go to start of metadata

 

Compressed Data Storage

Keeping data compressed in Hive tables has, in some cases, been known to give better performance than uncompressed storage; both in terms of disk usage and query performance.

You can import text files compressed with Gzip or Bzip2 directly into a table stored as TextFile. The compression will be detected automatically and the file will be decompressed on-the-fly during query execution. For example:

CREATE TABLE raw (line STRING)
   ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
 
LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log.gz' INTO TABLE raw;

The table 'raw' is stored as a TextFile, which is the default storage. However, in this case Hadoop will not be able to split your file into chunks/blocks and run multiple maps in parallel. This can cause underutilization of your cluster's 'mapping' power.

【 A SequenceFile can be split by Hadoop and distributed across map jobs whereas a GZIP file cannot be.】

The recommended practice is to insert data into another table, which is stored as a SequenceFile. A SequenceFile can be split by Hadoop and distributed across map jobs whereas a GZIP file cannot be. For example:

CREATE TABLE raw (line STRING)
   ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
 
CREATE TABLE raw_sequence (line STRING)
   STORED AS SEQUENCEFILE;
 
LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log.gz' INTO TABLE raw;
 
SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK; -- NONE/RECORD/BLOCK (see below)
INSERT OVERWRITE TABLE raw_sequence SELECT * FROM raw;

The value for io.seqfile.compression.type determines how the compression is performed. Record compresses each value individually while BLOCK buffers up 1MB (default) before doing compression.

LZO Compression

See LZO Compression for information about using LZO with Hive.

 

splittability A SequenceFile can be split by Hadoop and distributed across map jobs whereas a GZIP file cannot be.的更多相关文章

  1. 初次启动hive,解决 ls: cannot access /home/hadoop/spark-2.2.0-bin-hadoop2.6/lib/spark-assembly-*.jar: No such file or directory问题

    >>提君博客原创  http://www.cnblogs.com/tijun/  << 刚刚安装好hive,进行第一次启动 提君博客原创 [hadoop@ltt1 bin]$ ...

  2. hadoop输入分片计算(Map Task个数的确定)

    作业从JobClient端的submitJobInternal()方法提交作业的同时,调用InputFormat接口的getSplits()方法来创建split.默认是使用InputFormat的子类 ...

  3. Hadoop 使用Combiner提高Map/Reduce程序效率

    众所周知,Hadoop框架使用Mapper将数据处理成一个<key,value>键值对,再网络节点间对其进行整理(shuffle),然后使用Reducer处理数据并进行最终输出. 在上述过 ...

  4. C#、JAVA操作Hadoop(HDFS、Map/Reduce)真实过程概述。组件、源码下载。无法解决:Response status code does not indicate success: 500。

    一.Hadoop环境配置概述 三台虚拟机,操作系统为:Ubuntu 16.04. Hadoop版本:2.7.2 NameNode:192.168.72.132 DataNode:192.168.72. ...

  5. 【hadoop】如何向map和reduce脚本传递参数,加载文件和目录

    本文主要讲解三个问题:       1 使用Java编写MapReduce程序时,如何向map.reduce函数传递参数.       2 使用Streaming编写MapReduce程序(C/C++ ...

  6. Hadoop 2.4.1 Map/Reduce小结【原创】

    看了下MapReduce的例子.再看了下Mapper和Reducer源码,理清了参数的意义,就o了. public class Mapper<KEYIN, VALUEIN, KEYOUT, VA ...

  7. hadoop 2.2.0 编译报错: [ERROR] class file for org.mortbay.component.AbstractLifeCycle not found

    [ERROR]  class file for org.mortbay.component.AbstractLifeCycle not found 错误堆栈如下: [ERROR] COMPILATIO ...

  8. hadoop用mutipleInputs实现map读取不同格式的文件

    mapmap读取不同格式的文件这个问题一直就有,之前的读取方式是在map里获取文件的名称,依照名称不同分不同的方式读取,比如以下的方式 //取文件名 InputSplit inputSplit = c ...

  9. Weekly Contest 78-------->811. Subdomain Visit Count (split string with space and hash map)

    A website domain like "discuss.leetcode.com" consists of various subdomains. At the top le ...

随机推荐

  1. SEO总结(一)

  2. vue.js源码学习分享(九)

    /* */ var arrayKeys = Object.getOwnPropertyNames(arrayMethods);//获取arrayMethods的属性名称 /** * By defaul ...

  3. 反汇编->C++引用与指针

    先看一段最简单代码 #include<iostream> #include<stdlib.h> using namespace std; int main() { int te ...

  4. 中国正式发放5G牌照 详细对比中美两国5G实力

    今天,中国5G商用走进新里程:工信部向中国电信.中国移动.中国联通.中国广电发放5G商用牌照,中国也成为继韩国.美国.瑞士.英国后,第五个正式商用5G的国家. 按照之前的规划,中国原定于2020年开启 ...

  5. 洛谷——P2853 [USACO06DEC]牛的野餐Cow Picnic

    P2853 [USACO06DEC]牛的野餐Cow Picnic 题目描述 The cows are having a picnic! Each of Farmer John's K (1 ≤ K ≤ ...

  6. redis 延迟消息

    1.查询下redis 是否打开了键空间通知功能 发现打开了,如果没有打开可以在执行下 我们可以看到参数设置 2.订阅下键空间或者事件通知 订阅键空间:subscribe __keyspace@0__: ...

  7. oracle 查看各表空间剩余量

    1.查看所有表空间大小.剩余量: select dbf.tablespace_name,dbf.totalspace "总量(M)",dbf.totalblocks as 总块数, ...

  8. go 协程与主线程强占运行

    最近在学习了Go 语言 ,  正好学习到了 协程这一块 ,遇到了困惑的地方.这个是go语言官方文档 . 在我的理解当中是,协程只能在主线程释放时间片后才会经过系统调度来运行协程,其实正确的也确实是这样 ...

  9. K-L变换

    K-L变换( Karhunen-Loeve Transform)是建立在统计特性基础上的一种变换,有的文献也称为霍特林(Hotelling)变换,因他在1933年最先给出将离散信号变换成一串不相关系数 ...

  10. eclipse从svn检出项目

    在eclipse的project explorer 右键->import->svn->从svn检出项目,然后填写资源库的位置,完成,然后一直next. 直到项目检出完成后,选择项目, ...