splittability

 
 

Skip to end of metadata

 

Go to start of metadata

 

Compressed Data Storage

Keeping data compressed in Hive tables has, in some cases, been known to give better performance than uncompressed storage; both in terms of disk usage and query performance.

You can import text files compressed with Gzip or Bzip2 directly into a table stored as TextFile. The compression will be detected automatically and the file will be decompressed on-the-fly during query execution. For example:

CREATE TABLE raw (line STRING)
   ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
 
LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log.gz' INTO TABLE raw;

The table 'raw' is stored as a TextFile, which is the default storage. However, in this case Hadoop will not be able to split your file into chunks/blocks and run multiple maps in parallel. This can cause underutilization of your cluster's 'mapping' power.

【 A SequenceFile can be split by Hadoop and distributed across map jobs whereas a GZIP file cannot be.】

The recommended practice is to insert data into another table, which is stored as a SequenceFile. A SequenceFile can be split by Hadoop and distributed across map jobs whereas a GZIP file cannot be. For example:

CREATE TABLE raw (line STRING)
   ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
 
CREATE TABLE raw_sequence (line STRING)
   STORED AS SEQUENCEFILE;
 
LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log.gz' INTO TABLE raw;
 
SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK; -- NONE/RECORD/BLOCK (see below)
INSERT OVERWRITE TABLE raw_sequence SELECT * FROM raw;

The value for io.seqfile.compression.type determines how the compression is performed. Record compresses each value individually while BLOCK buffers up 1MB (default) before doing compression.

LZO Compression

See LZO Compression for information about using LZO with Hive.

 

splittability A SequenceFile can be split by Hadoop and distributed across map jobs whereas a GZIP file cannot be.的更多相关文章

  1. 初次启动hive,解决 ls: cannot access /home/hadoop/spark-2.2.0-bin-hadoop2.6/lib/spark-assembly-*.jar: No such file or directory问题

    >>提君博客原创  http://www.cnblogs.com/tijun/  << 刚刚安装好hive,进行第一次启动 提君博客原创 [hadoop@ltt1 bin]$ ...

  2. hadoop输入分片计算(Map Task个数的确定)

    作业从JobClient端的submitJobInternal()方法提交作业的同时,调用InputFormat接口的getSplits()方法来创建split.默认是使用InputFormat的子类 ...

  3. Hadoop 使用Combiner提高Map/Reduce程序效率

    众所周知,Hadoop框架使用Mapper将数据处理成一个<key,value>键值对,再网络节点间对其进行整理(shuffle),然后使用Reducer处理数据并进行最终输出. 在上述过 ...

  4. C#、JAVA操作Hadoop(HDFS、Map/Reduce)真实过程概述。组件、源码下载。无法解决:Response status code does not indicate success: 500。

    一.Hadoop环境配置概述 三台虚拟机,操作系统为:Ubuntu 16.04. Hadoop版本:2.7.2 NameNode:192.168.72.132 DataNode:192.168.72. ...

  5. 【hadoop】如何向map和reduce脚本传递参数,加载文件和目录

    本文主要讲解三个问题:       1 使用Java编写MapReduce程序时,如何向map.reduce函数传递参数.       2 使用Streaming编写MapReduce程序(C/C++ ...

  6. Hadoop 2.4.1 Map/Reduce小结【原创】

    看了下MapReduce的例子.再看了下Mapper和Reducer源码,理清了参数的意义,就o了. public class Mapper<KEYIN, VALUEIN, KEYOUT, VA ...

  7. hadoop 2.2.0 编译报错: [ERROR] class file for org.mortbay.component.AbstractLifeCycle not found

    [ERROR]  class file for org.mortbay.component.AbstractLifeCycle not found 错误堆栈如下: [ERROR] COMPILATIO ...

  8. hadoop用mutipleInputs实现map读取不同格式的文件

    mapmap读取不同格式的文件这个问题一直就有,之前的读取方式是在map里获取文件的名称,依照名称不同分不同的方式读取,比如以下的方式 //取文件名 InputSplit inputSplit = c ...

  9. Weekly Contest 78-------->811. Subdomain Visit Count (split string with space and hash map)

    A website domain like "discuss.leetcode.com" consists of various subdomains. At the top le ...

随机推荐

  1. 两个VLC实现播放串流测试 (转)

    实现原理: 一个VLC打开视频文件发布串流(格式HTTP.RTP.RTSP等),另一个VLC打开串流播放 发布串流步骤: 1.菜单“媒体”->“流”,先添加视频文件.选择“串流”,如下图: 2. ...

  2. LeetCode OJ-- Trapping Rain Water*

    https://oj.leetcode.com/problems/trapping-rain-water/ 模拟题,计算出在凹凸处存水量. 对于一个位置 i ,分别计算出它左边的最大值 left (从 ...

  3. npm install Unexpected token in JSON at position XXX

    问题描述执行npm install命令时出错,查看日志发现: Unexpected token < in JSON at position 183718 解决方法删除根目录下package-lo ...

  4. netframework中等待多个子线程执行完毕并计算执行时间

    本文主要描述在.netframework中(实验环境.netframework版本为4.6.1)提供两种方式等待多个子线程执行完毕. ManualResetEvent 在多线程中,将ManualRes ...

  5. Light oj 1134 - Be Efficient (前缀和)

    题目链接:http://www.lightoj.com/volume_showproblem.php?problem=1134 题意: 给你n个数,问你多少个连续的数的和是m的倍数. 思路: 前缀和取 ...

  6. JD路径配置及myeclipse主题和提示设置

    1. JDKAN安装及环境变量配置 安装jdk,注意记住安装路径(F:\Java\jdk1.8.0_121 )(个人爱好) 系统变量→新建 JAVA_HOME 变量 . 变量值填写jdk的安装目录(F ...

  7. NLP项目

    GitHub NLP项目:自然语言处理项目的相关干货整理 自然语言处理(NLP)是计算机科学,人工智能,语言学关注计算机和人类(自然)语言之间的相互作用的领域.本文作者为自然语言处理NLP初学者整理了 ...

  8. android状态栏总结

    针对状态栏的操作,只针对4.4kitKat(含)以上的机型,部分国产rom会失效,目前发现的有华为的EMUI Activity必须是noActionbar主题 本文基于StatusBarUtils略作 ...

  9. linux 文件属性与权限

    内容源于: 鸟哥的linux私房菜 链接如下: Linux 的文件权限与目录配置 Linux 磁盘与文件系统管理 Linux 文件与目录管理 目录 Linux文件属性 [文件属性解析(SUID/SGI ...

  10. java性能监控工具jcmd-windows

    jcmd Sends diagnostic command requests to a running Java Virtual Machine (JVM). Synopsis jcmd [-l|-h ...