SequenceFile文件
SequenceFile文件是Hadoop用来存储二进制形式的key-value对而设计的一种平面文件(Flat File)。目前,也有不少人在该文件的基础之上提出了一些HDFS中小文件存储的解决方案,他们的基本思路就是将小文件进行合并成一个大文件,同时对这些小文件的位置信息构建索引。不过,这类解决方案还涉及到Hadoop的另一种文件格式——MapFile文件。SequenceFile文件并不保证其存储的key-value数据是按照key的某个顺序存储的,同时不支持append操作。
在SequenceFile文件中,每一个key-value被看做是一条记录(Record),因此基于Record的压缩策略,SequenceFile文件可支持三种压缩类型(SequenceFile.CompressionType):
NONE: 对records不进行压缩;
RECORD: 仅压缩每一个record中的value值;
BLOCK: 将一个block中的所有records压缩在一起;
那么,基于这三种压缩类型,Hadoop提供了对应的三种类型的Writer:
SequenceFile.Writer 写入时不压缩任何的key-value对(Record);
- public static class Writer implements java.io.Closeable {
- ...
- //初始化Writer
- void init(Path name, Configuration conf, FSDataOutputStream out, Class keyClass, Class valClass, boolean compress, CompressionCodec codec, Metadata metadata) throws IOException {
- this.conf = conf;
- this.out = out;
- this.keyClass = keyClass;
- this.valClass = valClass;
- this.compress = compress;
- this.codec = codec;
- this.metadata = metadata;
- //创建非压缩的对象序列化器
- SerializationFactory serializationFactory = new SerializationFactory(conf);
- this.keySerializer = serializationFactory.getSerializer(keyClass);
- this.keySerializer.open(buffer);
- this.uncompressedValSerializer = serializationFactory.getSerializer(valClass);
- this.uncompressedValSerializer.open(buffer);
- //创建可压缩的对象序列化器
- if (this.codec != null) {
- ReflectionUtils.setConf(this.codec, this.conf);
- this.compressor = CodecPool.getCompressor(this.codec);
- this.deflateFilter = this.codec.createOutputStream(buffer, compressor);
- this.deflateOut = new DataOutputStream(new BufferedOutputStream(deflateFilter));
- this.compressedValSerializer = serializationFactory.getSerializer(valClass);
- this.compressedValSerializer.open(deflateOut);
- }
- }
- //添加一条记录(key-value,对象值需要序列化)
- public synchronized void append(Object key, Object val) throws IOException {
- if (key.getClass() != keyClass)
- throw new IOException("wrong key class: "+key.getClass().getName() +" is not "+keyClass);
- if (val.getClass() != valClass)
- throw new IOException("wrong value class: "+val.getClass().getName() +" is not "+valClass);
- buffer.reset();
- //序列化key(将key转化为二进制数组),并写入缓存buffer中
- keySerializer.serialize(key);
- int keyLength = buffer.getLength();
- if (keyLength < 0)
- throw new IOException("negative length keys not allowed: " + key);
- //compress在初始化是被置为false
- if (compress) {
- deflateFilter.resetState();
- compressedValSerializer.serialize(val);
- deflateOut.flush();
- deflateFilter.finish();
- } else {
- //序列化value值(不压缩),并将其写入缓存buffer中
- uncompressedValSerializer.serialize(val);
- }
- //将这条记录写入文件流
- checkAndWriteSync(); // sync
- out.writeInt(buffer.getLength()); // total record length
- out.writeInt(keyLength); // key portion length
- out.write(buffer.getData(), 0, buffer.getLength()); // data
- }
- //添加一条记录(key-value,二进制值)
- public synchronized void appendRaw(byte[] keyData, int keyOffset, int keyLength, ValueBytes val) throws IOException {
- if (keyLength < 0)
- throw new IOException("negative length keys not allowed: " + keyLength);
- int valLength = val.getSize();
- checkAndWriteSync();
- //直接将key-value写入文件流
- out.writeInt(keyLength+valLength); // total record length
- out.writeInt(keyLength); // key portion length
- out.write(keyData, keyOffset, keyLength); // key
- val.writeUncompressedBytes(out); // value
- }
- ...
- }
SequenceFile.RecordCompressWriter写入时只压缩key-value对(Record)中的value;
- static class RecordCompressWriter extends Writer {
- ...
- public synchronized void append(Object key, Object val) throws IOException {
- if (key.getClass() != keyClass)
- throw new IOException("wrong key class: "+key.getClass().getName() +" is not "+keyClass);
- if (val.getClass() != valClass)
- throw new IOException("wrong value class: "+val.getClass().getName() +" is not "+valClass);
- buffer.reset();
- //序列化key(将key转化为二进制数组),并写入缓存buffer中
- keySerializer.serialize(key);
- int keyLength = buffer.getLength();
- if (keyLength < 0)
- throw new IOException("negative length keys not allowed: " + key);
- //序列化value值(不压缩),并将其写入缓存buffer中
- deflateFilter.resetState();
- compressedValSerializer.serialize(val);
- deflateOut.flush();
- deflateFilter.finish();
- //将这条记录写入文件流
- checkAndWriteSync(); // sync
- out.writeInt(buffer.getLength()); // total record length
- out.writeInt(keyLength); // key portion length
- out.write(buffer.getData(), 0, buffer.getLength()); // data
- }
- /** 添加一条记录(key-value,二进制值,value已压缩) */
- public synchronized void appendRaw(byte[] keyData, int keyOffset,
- int keyLength, ValueBytes val) throws IOException {
- if (keyLength < 0)
- throw new IOException("negative length keys not allowed: " + keyLength);
- int valLength = val.getSize();
- checkAndWriteSync(); // sync
- out.writeInt(keyLength+valLength); // total record length
- out.writeInt(keyLength); // key portion length
- out.write(keyData, keyOffset, keyLength); // 'key' data
- val.writeCompressedBytes(out); // 'value' data
- }
- } // RecordCompressionWriter
- ...
- }
SequenceFile.BlockCompressWriter 写入时将一批key-value对(Record)压缩成一个Block;
- static class BlockCompressWriter extends Writer {
- ...
- void init(int compressionBlockSize) throws IOException {
- this.compressionBlockSize = compressionBlockSize;
- keySerializer.close();
- keySerializer.open(keyBuffer);
- uncompressedValSerializer.close();
- uncompressedValSerializer.open(valBuffer);
- }
- /** Workhorse to check and write out compressed data/lengths */
- private synchronized void writeBuffer(DataOutputBuffer uncompressedDataBuffer) throws IOException {
- deflateFilter.resetState();
- buffer.reset();
- deflateOut.write(uncompressedDataBuffer.getData(), 0, uncompressedDataBuffer.getLength());
- deflateOut.flush();
- deflateFilter.finish();
- WritableUtils.writeVInt(out, buffer.getLength());
- out.write(buffer.getData(), 0, buffer.getLength());
- }
- /** Compress and flush contents to dfs */
- public synchronized void sync() throws IOException {
- if (noBufferedRecords > 0) {
- super.sync();
- // No. of records
- WritableUtils.writeVInt(out, noBufferedRecords);
- // Write 'keys' and lengths
- writeBuffer(keyLenBuffer);
- writeBuffer(keyBuffer);
- // Write 'values' and lengths
- writeBuffer(valLenBuffer);
- writeBuffer(valBuffer);
- // Flush the file-stream
- out.flush();
- // Reset internal states
- keyLenBuffer.reset();
- keyBuffer.reset();
- valLenBuffer.reset();
- valBuffer.reset();
- noBufferedRecords = 0;
- }
- }
- //添加一条记录(key-value,对象值需要序列化)
- public synchronized void append(Object key, Object val) throws IOException {
- if (key.getClass() != keyClass)
- throw new IOException("wrong key class: "+key+" is not "+keyClass);
- if (val.getClass() != valClass)
- throw new IOException("wrong value class: "+val+" is not "+valClass);
- //序列化key(将key转化为二进制数组)(未压缩),并写入缓存keyBuffer中
- int oldKeyLength = keyBuffer.getLength();
- keySerializer.serialize(key);
- int keyLength = keyBuffer.getLength() - oldKeyLength;
- if (keyLength < 0)
- throw new IOException("negative length keys not allowed: " + key);
- WritableUtils.writeVInt(keyLenBuffer, keyLength);
- //序列化value(将value转化为二进制数组)(未压缩),并写入缓存valBuffer中
- int oldValLength = valBuffer.getLength();
- uncompressedValSerializer.serialize(val);
- int valLength = valBuffer.getLength() - oldValLength;
- WritableUtils.writeVInt(valLenBuffer, valLength);
- // Added another key/value pair
- ++noBufferedRecords;
- // Compress and flush?
- int currentBlockSize = keyBuffer.getLength() + valBuffer.getLength();
- //block已满,可将整个block进行压缩并写入文件流
- if (currentBlockSize >= compressionBlockSize) {
- sync();
- }
- }
- /**添加一条记录(key-value,二进制值,value已压缩). */
- public synchronized void appendRaw(byte[] keyData, int keyOffset, int keyLength, ValueBytes val) throws IOException {
- if (keyLength < 0)
- throw new IOException("negative length keys not allowed");
- int valLength = val.getSize();
- // Save key/value data in relevant buffers
- WritableUtils.writeVInt(keyLenBuffer, keyLength);
- keyBuffer.write(keyData, keyOffset, keyLength);
- WritableUtils.writeVInt(valLenBuffer, valLength);
- val.writeUncompressedBytes(valBuffer);
- // Added another key/value pair
- ++noBufferedRecords;
- // Compress and flush?
- int currentBlockSize = keyBuffer.getLength() + valBuffer.getLength();
- if (currentBlockSize >= compressionBlockSize) {
- sync();
- }
- }
- } // RecordCompressionWriter
- ...
- }
源码中,block的大小compressionBlockSize默认值为1000000,也可通过配置参数io.seqfile.compress.blocksize来指定。
根据三种压缩算法,共有三种类型的SequenceFile文件格式:
1). Uncompressed SequenceFile

2). Record-Compressed SequenceFile

3). Block-Compressed SequenceFile

SequenceFile文件的更多相关文章
- Hadoop 写SequenceFile文件 源代码
package com.tdxx.hadoop.sequencefile; import java.io.IOException; import org.apache.hadoop.conf.Conf ...
- 基于Hadoop Sequencefile的小文件解决方案
一.概述 小文件是指文件size小于HDFS上block大小的文件.这样的文件会给hadoop的扩展性和性能带来严重问题.首先,在HDFS中,任何block,文件或者目录在内存中均以对象的形式存储,每 ...
- hadoop 将HDFS上多个小文件合并到SequenceFile里
背景:hdfs上的文件最好和hdfs的块大小的N倍.如果文件太小,浪费namnode的元数据存储空间以及内存,如果文件分块不合理也会影响mapreduce中map的效率. 本例中将小文件的文件名作为k ...
- 5.4.1 sequenceFile读写文件、记录边界、同步点、压缩排序、格式
5.4.1 sequenceFile读写文件.记录边界.同步点.压缩排序.格式 HDFS和MapReduce是针对大文件优化的存储文本记录,不适合二进制类型的数据.SequenceFile作 ...
- Hadoop SequenceFile数据结构介绍及读写
在一些应用中,我们需要一种特殊的数据结构来存储数据,并进行读取,这里就分析下为什么用SequenceFile格式文件. Hadoop SequenceFile Hadoop提供的SequenceFil ...
- Hadoop基于文件的数据结构及实例
基于文件的数据结构 两种文件格式: 1.SequenceFile 2.MapFile SequenceFile 1.SequenceFile文件是Hadoop用来存储二进制形式的<key,val ...
- Hadoop之SequenceFile
Hadoop序列化文件SequenceFile能够用于解决大量小文件(所谓小文件:泛指小于black大小的文件)问题,SequenceFile是Hadoop API提供的一种二进制文件支持.这样的二进 ...
- 使用代码查看Nutch爬取的网站后生成的SequenceFile信息
必须针对data文件中的value类型来使用对应的类来查看(把这个data文件,放到了本地Windows的D盘根目录下). 代码: package cn.summerchill.nutch; impo ...
- SequenceFile实例操作
HDFS API提供了一种二进制文件支持,直接将<key,value>对序列化到文件中,该文件格式是不能直接查看的,可以通过hadoop dfs -text命令查看,后面跟上Sequen ...
随机推荐
- 腾讯云部署golang flow流程,vue.js+nginx+mysql+node.js
这次总算把js-ojus/flow的ui部署到腾讯云上,比较吐槽的就是,为啥这么复杂,vue.js前后端分离,比golang编写的部署方面复杂几万倍.真是浪费人生啊. golang+sqlite写的东 ...
- Linux笔记(二): WIN 10 Ubuntu 双系统
(一) 说明 记录一次ubuntu安装过程及遇到的问题. 环境:WIN 10 单硬盘 (二) ubuntu ISO文件下载 ubuntu 18.04 https://www.ubuntu.com/ ...
- java中传值方式的个人理解
前言 这几天在整理java基础知识方面的内容,对于值传递还不是特别理解,于是查阅了一些资料和网上相关博客,自己进行了归纳总结,最后将其整理成了一篇博客. 值传递 值传递是指在调用函数时将实际参数复制一 ...
- Git的安装配置(win环境)
安装: 首先安装win版本的git msysgit:https://git-for-windows.github.io 注:安装时要勾选生成桌面快捷方式. 默认安装完后依次执行: $ git conf ...
- 平板电脑安装Ubuntu教程
平板电脑安装Ubuntu教程-以V975w为例,Z3735系列CPU通用 最近尝试在昂达V975w平板电脑和intel stick中安装ubuntu,经过分析,发现存在一个非常大的坑.但因为这个坑,此 ...
- Python实现批量梯度下降算法
# -*- coding: UTF-8 -*- import numpy as npimport math # 定义基础变量learning_rate = 0.1n_iterations = 1000 ...
- 测试中Android与IOS分别关注的点
主要从本身系统的不同点.系统造成的不同点.和注意的测试点做总结 1.自身不同点 研发商:Adroid是google公司做的手机系统,IOS是苹果公司做的手机系统 开源程度:Android是开源的,IO ...
- IP负载均衡
推荐一篇关于LVS的好文: https://www.cnblogs.com/gaoxu387/p/7941381.html 一.原博主要内容: 1.概述 IP负载均衡:四层负载,是基于IP+端口的负载 ...
- linq Distinct 自定义去重字段
一.定义 1.Falcon_PumpX_Equal_Comparer :类名,随便取名 2.IEqualityComparer:必须继承这个接口 3.Falcon_PumpX:需要去重的对象 4.IE ...
- 选择is或者as操作符而不是做强制类型转换
无论何时,正确选择使用as运算符进行类型转换.比盲目的强制类型转换更安全,而且在运行时效率更高. 用as和is进行转换时,并不是对所有用户定义的类型都能完成,只是在运行时类型和目标类型匹配时,转换才能 ...