Hadoop IO基于文件的数据结构详解【列式和行式数据结构的存储策略】
SequenceFile和 MapFile两个代表。
SequenceFile
写入SequenceFile
。
2)获取FileSystem
3)创建文件输出路径Path
4)调用SequenceFile.createWriter得到SequenceFile.Writer对象
5)调用SequenceFile.Writer.append追加写入文件
6)关闭流
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
|
public class SequenceFileWriteDemo { private static final String[] DATA = { "One, two, buckle my shoe" , "Three, four, shut the door" , "Five, six, pick up sticks" , "Seven, eight, lay them straight" , "Nine, ten, a big fat hen" }; public static void main(String[] args) throws IOException { String uri = args[ 0 ]; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(uri), conf); Path path = new Path(uri); IntWritable key = new IntWritable(); Text value = new Text(); SequenceFile.Writer writer = null ; try { writer = SequenceFile.createWriter(fs, conf, path, // SequenceFile.Writer key.getClass(), value.getClass()); for ( int i = 0 ; i < 100 ; i++) { key.set( 100 - i); value.set(DATA[i % DATA.length]); System.out.printf( "[%s]\t%s\t%s\n" , writer.getLength(), key, value);//getLength获取的是当前文件的位置 writer.append(key, value); } } finally { IOUtils.closeStream(writer); } } } |
%hadoop SequenceFileWriteDemo numbers.seq
[128] 100 One, two, buckle my shoe
[173] 99 Three, four, shut the door
[220] 98 Five, six, pick up sticks
[264] 97 Seven, eight, lay them straight
[314] 96 Nine, ten, a big fat hen
[359] 95 One, two, buckle my shoe
[404] 94 Three, four, shut the door
[451] 93 Five, six, pick up sticks
[495] 92 Seven, eight, lay them straight
[545] 91 Nine, ten, a big fat hen
...
[1976] 60 One, two, buckle my shoe
[2021] 59 Three, four, shut the door
[2088] 58 Five, six, pick up sticks
[2132] 57 Seven, eight, lay them straight
[2182] 56 Nine, ten, a big fat hen
...
[4557] 5 One, two, buckle my shoe
[4602] 4 Three, four, shut the door
[4649] 3 Five, six, pick up sticks
[4693] 2 Seven, eight, lay them straight
[4743] 1 Nine, ten, a big fat hen
读取SequenceFile
2)获取FileSystem
3)创建文件输出路径Path
4)new一个SequenceFile.Reader进行读取
5)得到keyClass和valueClass
6)关闭流
代码:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
|
public class SequenceFileReadDemo { public static void main(String[] args) throws IOException { String uri = args[ 0 ]; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(uri), conf); Path path = new Path(uri); SequenceFile.Reader reader = null ; try { reader = new SequenceFile.Reader(fs, path, conf); Writable key = (Writable) ReflectionUtils.newInstance(reader.getKeyClass(), conf);//获取key的数据类型 Writable value = (Writable) ReflectionUtils.newInstance(reader.getValueClass(), conf); long position = reader.getPosition(); while (reader.next(key, value)) { String syncSeen = reader.syncSeen() ? "*" : "" ;//同步点,那就*标记 System.out.printf( "[%s%s]\t%s\t%s\n" , position, syncSeen, key, value); position = reader.getPosition(); // beginning of next record } } finally { IOUtils.closeStream(reader); } } } |
Writable key = (Writable)
ReflectionUtils.newInstance(reader.getKeyClass(), conf);//获取key的数据类型是从reader中获取的
Writable value = (Writable)
ReflectionUtils.newInstance(reader.getValueClass(), conf);通过以上两行代码,我们可以找到任何reader的数据类型,也就是说我们可以处理任何数据类型,只要是writable实现的。
%hadoop SequenceFileReadDemo numbers.seq
[128] 100 One, two, buckle my shoe
[173] 99 Three, four, shut the door
[220] 98 Five, six, pick up sticks
[264] 97 Seven, eight, lay them straight
[314] 96 Nine, ten, a big fat hen
[359] 95 One, two, buckle my shoe
[404] 94 Three, four, shut the door
[451] 93 Five, six, pick up sticks
[495] 92 Seven, eight, lay them straight
[545] 91 Nine, ten, a big fat hen
[590] 90 One, two, buckle my shoe
...
[1976] 60 One, two, buckle my shoe
[2021*] 59 Three, four, shut the door
[2088] 58 Five, six, pick up sticks
[2132] 57 Seven, eight, lay them straight
[2182] 56 Nine, ten, a big fat hen
...
[4557] 5 One, two, buckle my shoe
[4602] 4 Three, four, shut the door
[4649] 3 Five, six, pick up sticks
[4693] 2 Seven, eight, lay them straight
[4743] 1 Nine, ten, a big fat hen
reader
.
seek
(
359
);
assertThat
(
reader
.
next
(
key
,
value
),
is
(
true
));
assertThat
(((
IntWritable
)
key
).
get
(),
is
(
95
));
reader
.
seek
(
360
);
reader
.
next
(
key
,
value
);
// fails with IOException
reader
.
sync
(
360
);
assertThat
(
reader
.
getPosition
(),
is
(
2021L
));
assertThat
(
reader
.
next
(
key
,
value
),
is
(
true
));
assertThat
(((
IntWritable
)
key
).
get
(),
is
(
59
));
用命令行接口显示序列文件
%hadoop fs -text numbers.seq | head
100 One, two, buckle my shoe
99 Three, four, shut the door
98 Five, six, pick up sticks
97 Seven, eight, lay them straight
96 Nine, ten, a big fat hen
95 One, two, buckle my shoe
94 Three, four, shut the door
93 Five, six, pick up sticks
92 Seven, eight, lay them straight
91 Nine, ten, a big fat hen
排序和合并序列文件
%hadoop jar \
$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar \
sort -r 1 \
-inFormat org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat \
-outFormat org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat \
-outKey org.apache.hadoop.io.IntWritable \
-outValue org.apache.hadoop.io.Text \
numbers.seq sorted
%hadoop fs -text sorted/part-r-00000 | head
1 Nine, ten, a big fat hen
2 Seven, eight, lay them straight
3 Five, six, pick up sticks
4 Three, four, shut the door
5 One, two, buckle my shoe
6 Nine, ten, a big fat hen
7 Seven, eight, lay them straight
8 Five, six, pick up sticks
9 Three, four, shut the door
10 One, two, buckle my shoe
序列文件的格式
序列文件的压缩:
MapFile
写一个MapFil e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
|
public class MapFileWriteDemo { private static final String[] DATA = { "One, two, buckle my shoe" , "Three, four, shut the door" , "Five, six, pick up sticks" , "Seven, eight, lay them straight" , "Nine, ten, a big fat hen" }; public static void main(String[] args) throws IOException { String uri = args[ 0 ]; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(uri), conf); IntWritable key = new IntWritable(); //MapFile的key 是WritableComparable类型的,而value是Writable类型的;用来实现key的索引就需要比较值 Text value = new Text(); MapFile.Writer writer = null ; try { writer = new MapFile.Writer(conf, fs, uri, key.getClass(), value.getClass()); for ( int i = 0 ; i < 1024 ; i++) { key.set(i + 1 ); value.set(DATA[i % DATA.length]); writer.append(key, value); } } finally { IOUtils.closeStream(writer); } } } |
读取一个MapFile
1
2
3
4
5
6
7
8
9
10
11
12
13
|
public static void main(String[] args) throws IOException { Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(uri),conf); Path path = new Path( "/numbers.map" ); MapFile.Reader reader = new MapFile.Reader(fs, path.toString(), conf); WritableComparable key = (WritableComparable)ReflectionUtils.newInstance(reader.getKeyClass(), conf); Writable value = (Writable)ReflectionUtils.newInstance(reader.getValueClass(), conf); while (reader.next(key, value)){ System.out.println( "key = " +key); System.out.println( "value = " +value); } IOUtils.closeStream(reader); } |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
|
/** Return the value for the named key, or null if none exists. */ public synchronized Writable get(WritableComparable key, Writable val) throws IOException { if (seek(key)) {//找key data.getCurrentValue(val);//找值 return val; } else return null ; } /** Positions the reader at the named key, or if none such exists, at the * first entry after the named key. Returns true iff the named key exists * in this map. */ public synchronized boolean seek(WritableComparable key) throws IOException { return seekInternal(key) == 0 ; } private synchronized int seekInternal(WritableComparable key, final boolean before) throws IOException { readIndex(); // make sure index is read读取索引文件 if (seekIndex != - 1 // seeked before && seekIndex+ 1 < count && comparator.compare(key, keys[seekIndex+ 1 ])< 0 // before next indexed && comparator.compare(key, nextKey) >= 0 ) { // but after last seeked // do nothing } else { seekIndex = binarySearch(key); if (seekIndex < 0 ) // decode insertion point解码插入点 seekIndex = -seekIndex- 2 ; if (seekIndex == - 1 ) // belongs before first entry seekPosition = firstPosition; // use beginning of file从文件开头开始 else seekPosition = positions[seekIndex]; // else use index否则使用当前索引值 } data.seek(seekPosition);//文件指针指向开始的位置接下来想向后面查找 if (nextKey == null ) nextKey = comparator.newKey(); // If we're looking for the key before, we need to keep track // of the position we got the current key as well as the position // of the key before it. long prevPosition = - 1 ; long curPosition = seekPosition; while (data.next(nextKey)) {//读取下一个key到nextkey中去 int c = comparator.compare(key, nextKey);//比较之前的key和下一个key if (c <= 0 ) { // at or beyond desired if (before && c != 0 ) {//如果没有到key if (prevPosition == - 1 ) { // We're on the first record of this index block // and we've already passed the search key. Therefore // we must be at the beginning of the file, so seek // to the beginning of this block and return c data.seek(curPosition);//重新找 } else { // We have a previous record to back up to data.seek(prevPosition);//返回之前的记录点 data.next(nextKey); // now that we've rewound, the search key must be greater than this key return 1 ; } } return c; } if (before) {//找到,获取当前的位置 prevPosition = curPosition; curPosition = data.getPosition(); } } return 1 ; }
/**
* Get the 'value' corresponding to the last read 'key'.
* @param val : The 'value' to be read.
* @throws IOException
*/
public synchronized void getCurrentValue(Writable val)
throws IOException {
if (val instanceof Configurable) {
((Configurable) val).setConf(this.conf);
}
// Position stream to 'current' value
seekToCurrentValue();
if (!blockCompressed) {
val.readFields(valIn);
if (valIn.read() > 0) {
LOG.info("available bytes: " + valIn.available());
throw new IOException(val+" read "+(valBuffer.getPosition()-keyLength)
+ " bytes, should read " +
(valBuffer.getLength()-keyLength));
}
} else {
// Get the value
int valLength = WritableUtils.readVInt(valLenIn);
val.readFields(valIn);
// Read another compressed 'value'
--noBufferedValues;
// Sanity check
if ((valLength < 0) && LOG.isDebugEnabled()) {
LOG.debug(val + " is a zero-length value");
}
}
}
|
,意味着不跳过索引键,如果值为1 , 意味着隔一个键跳过一个键(所以留下一半的键在索引中) 如果值为2 ,表示该取一个键跳过两个键(所以三分之一的键留在索引中) , 以此类推.更大的跳过值可以节省更多内存空间,但消耗查找时间,因为平均而言,需要在磁盘中扫描更多条目。
将SequenceFile 转换为MapFile
(),后者重建MapFile 的索引:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
public class MapFileFixer { @SuppressWarnings ( "unchecked" ) public static void main(String[] args) throws Exception { String mapUri = args[ 0 ]; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(mapUri), conf); Path map = new Path(mapUri); Path mapData = new Path(map, MapFile.DATA_FILE_NAME); // Get key and value types from data sequence file SequenceFile.Reader reader = new SequenceFile.Reader(fs, mapData, conf);//读取文件 Class keyClass = reader.getKeyClass(); Class valueClass = reader.getValueClass(); reader.close(); // Create the map file index file long entries = MapFile.fix(fs, map, keyClass, valueClass, false , conf); System.out.printf( "Created MapFile %s with %d entries\n" , map, entries); } } |
其他文件和行式列式数据结构
Avro数据文件和sequencefile和mapfile一样,它们被设计用于大规模数据序列文件加工,紧凑和可分割,不同的编程语言都可以使用。存储在Avro数据文件的数据通过一种schema描述数据结构,而不是在一个可以执行的Java代码(如序列文件,就是使用的java硬代码),使他们不在以java为中心。Avro的数据文件在Hadoop的生态系统得到广泛支持,对于二进制文件存储传输他们是一个很好的默认选择。
序列文件,映射文件,Avro数据文件都是面向行式的数据格式,这意味着,每行的值被连续地存储在文件中。
一个面向列式的数据储存结构允许在查询访问的时候不需要访问的列被跳过,这在oracle,mysql等行式数据储存中是不能做到的,你要取数据就必须取出一行的数据。我们可以参考下面的图,在逻辑表视图中,我们只是想取出第二列的数据,文件存储在一个序列文件的时候,因为sequcenceFile是行式的,连续的,整个行(存储在序列文件的一个记录)都会被加载到存储器中,即使我们实际只是需要读取第二列。虽然我们可以只反序列化我们需要的那部分数据,从而节省了一些处理时间,但它不能避免从磁盘读取所有字节时间和性能消耗成本。
在面向列的存储结构中,只在第二列的数据(图中描黑边框)需要被读入内存。在一般情况下,查询表中小数目的列的时候,面向列的存储格式会更加高效。相反,需要查询一行的大量列的数据的时候,行式数据存储会更加有效。如何选择取决于我们的需求
datafile),再发生IO错误之后,可以读取到最后一个同步点。正是出于这个原因,Flume使用面向行式数据结构。
Hadoop中的第一个列式文件结构是Hive的RCFile,是Record Columnar File的简称。现在它已经被Hive的ORCFile,Parquet取代。Parquet是基于谷歌Dremel的通用列式文件数据结构,得到整个的Hadoop组件的广泛支持。Avro公司也有一个名为Trevni列式格式。
Hadoop IO基于文件的数据结构详解【列式和行式数据结构的存储策略】的更多相关文章
- 探索Redis设计与实现7:Redis内部数据结构详解——intset
本文转自互联网 本系列文章将整理到我在GitHub上的<Java面试指南>仓库,更多精彩内容请到我的仓库里查看 https://github.com/h2pl/Java-Tutorial ...
- 探索Redis设计与实现5:Redis内部数据结构详解——quicklist
本文转自互联网 本系列文章将整理到我在GitHub上的<Java面试指南>仓库,更多精彩内容请到我的仓库里查看 https://github.com/h2pl/Java-Tutorial ...
- 探索Redis设计与实现2:Redis内部数据结构详解——dict
本文转自互联网 本系列文章将整理到我在GitHub上的<Java面试指南>仓库,更多精彩内容请到我的仓库里查看 https://github.com/h2pl/Java-Tutorial ...
- C#文件后缀名详解
C#文件后缀名详解 .sln:解决方案文件,为解决方案资源管理器提供显示管理文件的图形接口所需的信息. .csproj:项目文件,创建应用程序所需的引用.数据连接.文件夹和文件的信息. .aspx:W ...
- [转]Redis内部数据结构详解-sds
本文是<Redis内部数据结构详解>系列的第二篇,讲述Redis中使用最多的一个基础数据结构:sds. 不管在哪门编程语言当中,字符串都几乎是使用最多的数据结构.sds正是在Redis中被 ...
- Python中__init__.py文件的作用详解
转自http://www.jb51.net/article/92863.htm Python中__init__.py文件的作用详解 http://www.jb51.net/article/86580. ...
- 9.proc目录下的文件和目录详解
1./proc目录下的文件和目录详解 /proc:虚拟目录.是内存的映射,内核和进程的虚拟文件系统目录,每个进程会生成1个pid,而每个进程都有1个目录. /proc/Version:内核版本 /pr ...
- JPEG文件编/解码详解
JPEG文件编/解码详解(1) JPEG(Joint Photographic Experts Group)是联合图像专家小组的英文缩写.它由国际电话与电报咨询委员会CCITT(The Interna ...
- Kubernetes YAML 文件全字段详解
Kubernetes YAML 文件全字段详解 Deployment yaml 其中主要参数都在podTemplate 中,DaemonSet StatefulSet 中的pod部分一样. apiVe ...
随机推荐
- HTTPPost/AFNetWorking/JSONModel/NSPredicate
一.HTTPPost================================================ 1. POST方式发送请求 HTTP协议下默认数据发送请求方法是GET方式,若需要 ...
- cocos2dx lua 热加载实现
[Q]原创 2015-08-30 在公司使用cocos2dx+lua 开发游戏有一段时间了,刚好lua的热更新交给我负责.以前热更新的lua脚本大部分都是在下载之后加载.最近策划又有新需求,需要在游戏 ...
- HAWQ取代传统数仓实践(九)——维度表技术之退化维度
退化维度技术减少维度的数量,简化维度数据仓库模式.简单的模式比复杂的更容易理解,也有更好的查询性能. 有时,维度表中除了业务主键外没有其它内容.例如,在本销售订单示例中,订单维度表除了订 ...
- Mongodb 的劣势
MongoDB中的数据存放具有相当的随意性,不具有MySQL在开始就定义好了.对运维人员来说,他们可能不清楚数据库内部数据的数据格式,这也会数据库的运维带来了麻烦. 1. 事务关系支持薄弱.这也是所有 ...
- cmd 操作WinService
1.运行--〉cmd:打开cmd命令框 2.在命令行里定位到InstallUtil.exe所在的位置 InstallUtil.exe 默认的安装位置是在C:\Windows\Microsoft.NET ...
- Ubuntu 16.04 安装配置支持http2的nginx
第一步 安装最新版本的nginx 对于ubuntu16.04而言 直接装就是最新的 ``` sudo apt-get update sudo apt-get install nginx 查看Nginx ...
- JFinal自定义FreeMarker标签
为什么采用freemarker? 1.模板技术,不依附于语言和框架,前端和后端解耦,便于分工协作,更好的协同. 2.页面相应速度快 3.前端非常的灵活,采用自定义标签可以在不更改后端的基础上很容易的构 ...
- HDU - 6185 :Covering(矩阵乘法&状态压缩)
Bob's school has a big playground, boys and girls always play games here after school. To protect bo ...
- HDU - 3949 :XOR(线性基,所有集合的不同异或和中,求从小到大第K个)
XOR is a kind of bit operator, we define that as follow: for two binary base number A and B, let C=A ...
- c++11新特性之宽窄字符转换
C++11增加了unicode字面量的支持,可以通过L来定义宽字符:str::wstring str = L"中国人": 将宽字符转换为窄字符串需要用到codecvt库中的std: ...