Hadoop IO基于文件的数据结构详解【列式和行式数据结构的存储策略】
SequenceFile和 MapFile两个代表。
SequenceFile
写入SequenceFile
。
2)获取FileSystem
3)创建文件输出路径Path
4)调用SequenceFile.createWriter得到SequenceFile.Writer对象
5)调用SequenceFile.Writer.append追加写入文件
6)关闭流
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
|
public class SequenceFileWriteDemo { private static final String[] DATA = { "One, two, buckle my shoe", "Three, four, shut the door", "Five, six, pick up sticks", "Seven, eight, lay them straight", "Nine, ten, a big fat hen" }; public static void main(String[] args) throws IOException { String uri = args[0]; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(uri), conf); Path path = new Path(uri); IntWritable key = new IntWritable(); Text value = new Text(); SequenceFile.Writer writer = null; try { writer = SequenceFile.createWriter(fs, conf, path,// SequenceFile.Writer key.getClass(), value.getClass()); for (int i = 0; i < 100; i++) { key.set(100 - i); value.set(DATA[i % DATA.length]); System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key, value);//getLength获取的是当前文件的位置 writer.append(key, value); } } finally { IOUtils.closeStream(writer); } }} |
%hadoop SequenceFileWriteDemo numbers.seq
[128] 100 One, two, buckle my shoe
[173] 99 Three, four, shut the door
[220] 98 Five, six, pick up sticks
[264] 97 Seven, eight, lay them straight
[314] 96 Nine, ten, a big fat hen
[359] 95 One, two, buckle my shoe
[404] 94 Three, four, shut the door
[451] 93 Five, six, pick up sticks
[495] 92 Seven, eight, lay them straight
[545] 91 Nine, ten, a big fat hen
...
[1976] 60 One, two, buckle my shoe
[2021] 59 Three, four, shut the door
[2088] 58 Five, six, pick up sticks
[2132] 57 Seven, eight, lay them straight
[2182] 56 Nine, ten, a big fat hen
...
[4557] 5 One, two, buckle my shoe
[4602] 4 Three, four, shut the door
[4649] 3 Five, six, pick up sticks
[4693] 2 Seven, eight, lay them straight
[4743] 1 Nine, ten, a big fat hen
读取SequenceFile
2)获取FileSystem
3)创建文件输出路径Path
4)new一个SequenceFile.Reader进行读取
5)得到keyClass和valueClass
6)关闭流
代码:
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
|
public class SequenceFileReadDemo { public static void main(String[] args) throws IOException { String uri = args[0]; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(uri), conf); Path path = new Path(uri); SequenceFile.Reader reader = null; try { reader = new SequenceFile.Reader(fs, path, conf); Writable key = (Writable) ReflectionUtils.newInstance(reader.getKeyClass(), conf);//获取key的数据类型 Writable value = (Writable) ReflectionUtils.newInstance(reader.getValueClass(), conf); long position = reader.getPosition(); while (reader.next(key, value)) { String syncSeen = reader.syncSeen() ? "*" : "";//同步点,那就*标记 System.out.printf("[%s%s]\t%s\t%s\n", position, syncSeen, key, value); position = reader.getPosition(); // beginning of next record } } finally { IOUtils.closeStream(reader); } }} |
Writable key = (Writable) ReflectionUtils.newInstance(reader.getKeyClass(), conf);//获取key的数据类型是从reader中获取的
Writable value = (Writable) ReflectionUtils.newInstance(reader.getValueClass(), conf);通过以上两行代码,我们可以找到任何reader的数据类型,也就是说我们可以处理任何数据类型,只要是writable实现的。
%hadoop SequenceFileReadDemo numbers.seq
[128] 100 One, two, buckle my shoe
[173] 99 Three, four, shut the door
[220] 98 Five, six, pick up sticks
[264] 97 Seven, eight, lay them straight
[314] 96 Nine, ten, a big fat hen
[359] 95 One, two, buckle my shoe
[404] 94 Three, four, shut the door
[451] 93 Five, six, pick up sticks
[495] 92 Seven, eight, lay them straight
[545] 91 Nine, ten, a big fat hen
[590] 90 One, two, buckle my shoe
...
[1976] 60 One, two, buckle my shoe
[2021*] 59 Three, four, shut the door
[2088] 58 Five, six, pick up sticks
[2132] 57 Seven, eight, lay them straight
[2182] 56 Nine, ten, a big fat hen
...
[4557] 5 One, two, buckle my shoe
[4602] 4 Three, four, shut the door
[4649] 3 Five, six, pick up sticks
[4693] 2 Seven, eight, lay them straight
[4743] 1 Nine, ten, a big fat hen
reader.seek(359);assertThat(reader.next(key,value),is(true));assertThat(((IntWritable)key).get(),is(95));
reader.seek(360);reader.next(key,value);// fails with IOException
reader.sync(360);assertThat(reader.getPosition(),is(2021L));assertThat(reader.next(key,value),is(true));assertThat(((IntWritable)key).get(),is(59));
用命令行接口显示序列文件
%hadoop fs -text numbers.seq | head
100 One, two, buckle my shoe
99 Three, four, shut the door
98 Five, six, pick up sticks
97 Seven, eight, lay them straight
96 Nine, ten, a big fat hen
95 One, two, buckle my shoe
94 Three, four, shut the door
93 Five, six, pick up sticks
92 Seven, eight, lay them straight
91 Nine, ten, a big fat hen
排序和合并序列文件
%hadoop jar \$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar \sort -r 1 \-inFormat org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat \-outFormat org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat \-outKey org.apache.hadoop.io.IntWritable \-outValue org.apache.hadoop.io.Text \numbers.seq sorted
%hadoop fs -text sorted/part-r-00000 | head
1 Nine, ten, a big fat hen
2 Seven, eight, lay them straight
3 Five, six, pick up sticks
4 Three, four, shut the door
5 One, two, buckle my shoe
6 Nine, ten, a big fat hen
7 Seven, eight, lay them straight
8 Five, six, pick up sticks
9 Three, four, shut the door
10 One, two, buckle my shoe
序列文件的格式
序列文件的压缩:
MapFile
写一个MapFil e
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
|
public class MapFileWriteDemo { private static final String[] DATA = { "One, two, buckle my shoe", "Three, four, shut the door", "Five, six, pick up sticks", "Seven, eight, lay them straight", "Nine, ten, a big fat hen" }; public static void main(String[] args) throws IOException { String uri = args[0]; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(uri), conf); IntWritable key = new IntWritable();//MapFile的key 是WritableComparable类型的,而value是Writable类型的;用来实现key的索引就需要比较值 Text value = new Text(); MapFile.Writer writer = null; try { writer = new MapFile.Writer(conf, fs, uri, key.getClass(), value.getClass()); for (int i = 0; i < 1024; i++) { key.set(i + 1); value.set(DATA[i % DATA.length]); writer.append(key, value); } } finally { IOUtils.closeStream(writer); } }} |
读取一个MapFile
|
1
2
3
4
5
6
7
8
9
10
11
12
13
|
public static void main(String[] args) throws IOException { Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(uri),conf); Path path = new Path("/numbers.map"); MapFile.Reader reader = new MapFile.Reader(fs, path.toString(), conf); WritableComparable key = (WritableComparable)ReflectionUtils.newInstance(reader.getKeyClass(), conf); Writable value = (Writable)ReflectionUtils.newInstance(reader.getValueClass(), conf); while(reader.next(key, value)){ System.out.println("key = "+key); System.out.println("value = "+value); } IOUtils.closeStream(reader); } |
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
|
/** Return the value for the named key, or null if none exists. */ public synchronized Writable get(WritableComparable key, Writable val) throws IOException { if (seek(key)) {//找key data.getCurrentValue(val);//找值 return val; } else return null; } /** Positions the reader at the named key, or if none such exists, at the * first entry after the named key. Returns true iff the named key exists * in this map. */ public synchronized boolean seek(WritableComparable key) throws IOException { return seekInternal(key) == 0; } private synchronized int seekInternal(WritableComparable key, final boolean before) throws IOException { readIndex(); // make sure index is read读取索引文件 if (seekIndex != -1 // seeked before && seekIndex+1 < count && comparator.compare(key, keys[seekIndex+1])<0 // before next indexed && comparator.compare(key, nextKey) >= 0) { // but after last seeked // do nothing } else { seekIndex = binarySearch(key); if (seekIndex < 0) // decode insertion point解码插入点 seekIndex = -seekIndex-2; if (seekIndex == -1) // belongs before first entry seekPosition = firstPosition; // use beginning of file从文件开头开始 else seekPosition = positions[seekIndex]; // else use index否则使用当前索引值 } data.seek(seekPosition);//文件指针指向开始的位置接下来想向后面查找 if (nextKey == null) nextKey = comparator.newKey(); // If we're looking for the key before, we need to keep track // of the position we got the current key as well as the position // of the key before it. long prevPosition = -1; long curPosition = seekPosition; while (data.next(nextKey)) {//读取下一个key到nextkey中去 int c = comparator.compare(key, nextKey);//比较之前的key和下一个key if (c <= 0) { // at or beyond desired if (before && c != 0) {//如果没有到key if (prevPosition == -1) { // We're on the first record of this index block // and we've already passed the search key. Therefore // we must be at the beginning of the file, so seek // to the beginning of this block and return c data.seek(curPosition);//重新找 } else { // We have a previous record to back up to data.seek(prevPosition);//返回之前的记录点 data.next(nextKey); // now that we've rewound, the search key must be greater than this key return 1; } } return c; } if (before) {//找到,获取当前的位置 prevPosition = curPosition; curPosition = data.getPosition(); } } return 1; }
/**
* Get the 'value' corresponding to the last read 'key'.
* @param val : The 'value' to be read.
* @throws IOException
*/
public synchronized void getCurrentValue(Writable val)
throws IOException {
if (val instanceof Configurable) {
((Configurable) val).setConf(this.conf);
}
// Position stream to 'current' value
seekToCurrentValue();
if (!blockCompressed) {
val.readFields(valIn);
if (valIn.read() > 0) {
LOG.info("available bytes: " + valIn.available());
throw new IOException(val+" read "+(valBuffer.getPosition()-keyLength)
+ " bytes, should read " +
(valBuffer.getLength()-keyLength));
}
} else {
// Get the value
int valLength = WritableUtils.readVInt(valLenIn);
val.readFields(valIn);
// Read another compressed 'value'
--noBufferedValues;
// Sanity check
if ((valLength < 0) && LOG.isDebugEnabled()) {
LOG.debug(val + " is a zero-length value");
}
}
}
|
,意味着不跳过索引键,如果值为1 , 意味着隔一个键跳过一个键(所以留下一半的键在索引中) 如果值为2 ,表示该取一个键跳过两个键(所以三分之一的键留在索引中) , 以此类推.更大的跳过值可以节省更多内存空间,但消耗查找时间,因为平均而言,需要在磁盘中扫描更多条目。
将SequenceFile 转换为MapFile
(),后者重建MapFile 的索引:
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
public class MapFileFixer { @SuppressWarnings("unchecked") public static void main(String[] args) throws Exception { String mapUri = args[0]; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(mapUri), conf); Path map = new Path(mapUri); Path mapData = new Path(map, MapFile.DATA_FILE_NAME); // Get key and value types from data sequence file SequenceFile.Reader reader = new SequenceFile.Reader(fs, mapData, conf);//读取文件 Class keyClass = reader.getKeyClass(); Class valueClass = reader.getValueClass(); reader.close(); // Create the map file index file long entries = MapFile.fix(fs, map, keyClass, valueClass, false, conf); System.out.printf("Created MapFile %s with %d entries\n", map, entries); }} |
其他文件和行式列式数据结构
Avro数据文件和sequencefile和mapfile一样,它们被设计用于大规模数据序列文件加工,紧凑和可分割,不同的编程语言都可以使用。存储在Avro数据文件的数据通过一种schema描述数据结构,而不是在一个可以执行的Java代码(如序列文件,就是使用的java硬代码),使他们不在以java为中心。Avro的数据文件在Hadoop的生态系统得到广泛支持,对于二进制文件存储传输他们是一个很好的默认选择。
序列文件,映射文件,Avro数据文件都是面向行式的数据格式,这意味着,每行的值被连续地存储在文件中。
一个面向列式的数据储存结构允许在查询访问的时候不需要访问的列被跳过,这在oracle,mysql等行式数据储存中是不能做到的,你要取数据就必须取出一行的数据。我们可以参考下面的图,在逻辑表视图中,我们只是想取出第二列的数据,文件存储在一个序列文件的时候,因为sequcenceFile是行式的,连续的,整个行(存储在序列文件的一个记录)都会被加载到存储器中,即使我们实际只是需要读取第二列。虽然我们可以只反序列化我们需要的那部分数据,从而节省了一些处理时间,但它不能避免从磁盘读取所有字节时间和性能消耗成本。
在面向列的存储结构中,只在第二列的数据(图中描黑边框)需要被读入内存。在一般情况下,查询表中小数目的列的时候,面向列的存储格式会更加高效。相反,需要查询一行的大量列的数据的时候,行式数据存储会更加有效。如何选择取决于我们的需求
datafile),再发生IO错误之后,可以读取到最后一个同步点。正是出于这个原因,Flume使用面向行式数据结构。
Hadoop中的第一个列式文件结构是Hive的RCFile,是Record Columnar File的简称。现在它已经被Hive的ORCFile,Parquet取代。Parquet是基于谷歌Dremel的通用列式文件数据结构,得到整个的Hadoop组件的广泛支持。Avro公司也有一个名为Trevni列式格式。
Hadoop IO基于文件的数据结构详解【列式和行式数据结构的存储策略】的更多相关文章
- 探索Redis设计与实现7:Redis内部数据结构详解——intset
本文转自互联网 本系列文章将整理到我在GitHub上的<Java面试指南>仓库,更多精彩内容请到我的仓库里查看 https://github.com/h2pl/Java-Tutorial ...
- 探索Redis设计与实现5:Redis内部数据结构详解——quicklist
本文转自互联网 本系列文章将整理到我在GitHub上的<Java面试指南>仓库,更多精彩内容请到我的仓库里查看 https://github.com/h2pl/Java-Tutorial ...
- 探索Redis设计与实现2:Redis内部数据结构详解——dict
本文转自互联网 本系列文章将整理到我在GitHub上的<Java面试指南>仓库,更多精彩内容请到我的仓库里查看 https://github.com/h2pl/Java-Tutorial ...
- C#文件后缀名详解
C#文件后缀名详解 .sln:解决方案文件,为解决方案资源管理器提供显示管理文件的图形接口所需的信息. .csproj:项目文件,创建应用程序所需的引用.数据连接.文件夹和文件的信息. .aspx:W ...
- [转]Redis内部数据结构详解-sds
本文是<Redis内部数据结构详解>系列的第二篇,讲述Redis中使用最多的一个基础数据结构:sds. 不管在哪门编程语言当中,字符串都几乎是使用最多的数据结构.sds正是在Redis中被 ...
- Python中__init__.py文件的作用详解
转自http://www.jb51.net/article/92863.htm Python中__init__.py文件的作用详解 http://www.jb51.net/article/86580. ...
- 9.proc目录下的文件和目录详解
1./proc目录下的文件和目录详解 /proc:虚拟目录.是内存的映射,内核和进程的虚拟文件系统目录,每个进程会生成1个pid,而每个进程都有1个目录. /proc/Version:内核版本 /pr ...
- JPEG文件编/解码详解
JPEG文件编/解码详解(1) JPEG(Joint Photographic Experts Group)是联合图像专家小组的英文缩写.它由国际电话与电报咨询委员会CCITT(The Interna ...
- Kubernetes YAML 文件全字段详解
Kubernetes YAML 文件全字段详解 Deployment yaml 其中主要参数都在podTemplate 中,DaemonSet StatefulSet 中的pod部分一样. apiVe ...
随机推荐
- Android中Application是什么?
Application是什么? Application和Activity,Service一样,是android框架的一个系统组件,当android程序启动时系统会创建一个 application对象, ...
- 前端 velocity(.vm)模板里写ajax出现解析异常
异常信息:Caused by: org.apache.velocity.exception.ParseErrorException: Encountered "{" at dist ...
- SpringCloud教程 | 第十二篇: 断路器监控(Hystrix Dashboard)
版权声明:本文为博主原创文章,欢迎转载,转载请注明作者.原文超链接 ,博主地址:http://blog.csdn.net/forezp. http://blog.csdn.net/forezp/art ...
- scorm标准的LMS在客户端的运行机制
1)运行SCORM APIAdapter. 2)调用API初始化函数. 3)加载课件SCO初始化数据. 4)获取Data Model中的用户ID和用户姓名. 5)获取Data Mode ...
- Android 框架学习1:EventBus 3.0 的特点与如何使用
前面总结了几篇基础,在这过程中看着别人分享自定义 View.架构或者源码分析,看起来比我写的"高大上"多了,内心也有点小波动. 但是自己的水平自己清楚,基础不扎实画再多源码流程图也 ...
- 使用 minio 搭建私有对象存储云。aws-php-sdk 操作object
How to use AWS SDK for PHP with Minio Server aws-sdk-php is the official AWS SDK for the PHP program ...
- RESTful 服务示例
WCF服务轻量级服务,可供JS调用 返回值格式:XML.Json 工程结构: 示例代码: using System; using System.Collections.Generic; using S ...
- LeetCode Valid Triangle Number
原题链接在这里:https://leetcode.com/problems/valid-triangle-number/description/ 题目: Given an array consists ...
- npm 私服工具verdaccio 安装配置试用
1. 安装 npm install -g verdaccio 2. 启动 verdaccio // 界面显示信息 Verdaccio doesn't need superuser privileg ...
- tensorflow dropout
我们都知道dropout对于防止过拟合效果不错dropout一般用在全连接的部分,卷积部分不会用到dropout,输出曾也不会使用dropout,适用范围[输入,输出)1.tf.nn.dropout( ...