(1)以怎样的方式从分片中读取一条记录,每读取一条记录都会调用RecordReader类;

(2)系统默认的RecordReader是LineRecordReader,如TextInputFormat;而SequenceFileInputFormat的RecordReader是SequenceFileRecordReader;
(3)LineRecordReader是用每行的偏移量作为map的key,每行的内容作为map的value;
(4)应用场景:自定义读取每一条记录的方式;自定义读入key的类型,如希望读取的key是文件的路径或名字而不是该行在文件中的偏移量。
 
自定义RecordReader:
(1)继承抽象类RecordReader,实现RecordReader的一个实例;
(2)实现自定义InputFormat类,重写InputFormat中createRecordReader()方法,返回值是自定义的RecordReader实例;
(3)配置job.setInputFormatClass()设置自定义的InputFormat实例;
 
源码见org.apache.mapreduce.lib.input.TextInputFormat类;
 
RecordReader例子:
应用场景:
数据:
1
2
3
4
5
6
7
......
要求:分别计算奇数行与偶数行数据之和
奇数行综合:10+30+50+70=160
偶数行综合:20+40+60=120
 
新建项目TestRecordReader,包com.recordreader,
源代码MyMapper.java:
package com.recordreader;
 
import java.io.IOException;
 
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
 
public class MyMapper extends Mapper {
 
@Override
protected void map(LongWritable key, Text value,Context context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
context.write(key, value);
}
 
}
 
 
源代码MyPartitioner.java:
package com.recordreader;
 
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
 
public class MyPartitioner extends Partitioner {
 
@Override
public int getPartition(LongWritable key, Text value, int numPartitions) {
// TODO Auto-generated method stub
if(key.get() % 2 == 0){
key.set(1);
return 1;
}
else {
key.set(0);
return 0;
}
}
 
}
 
源代码MyReducer.java:
package com.recordreader;
 
import java.io.IOException;
 
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
 
public class MyReducer extends Reducer {
 
@Override
protected void reduce(LongWritable key, Iterable value,Context context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
int sum = 0;
for(Text val: value){
sum += Integer.parseInt(val.toString());
}
Text write_key = new Text();
IntWritable write_value = new IntWritable();
if(key.get() == 0)
write_key.set("odd:");
else 
write_key.set("even:");
write_value.set(sum);
context.write(write_key, write_value);
}
 
}
 
源代码MyRecordReader.java:
package com.recordreader;
 
import java.io.IOException;
 
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.util.LineReader;
 
public class MyRecordReader extends RecordReader {
private long start;
private long end;
private long pos;
private FSDataInputStream fin = null;
private LongWritable key = null;
private Text value = null;
private LineReader reader = null;
@Override
public void close() throws IOException {
// TODO Auto-generated method stub
fin.close();
}
 
@Override
public LongWritable getCurrentKey() throws IOException,
InterruptedException {
// TODO Auto-generated method stub
return key;
}
 
@Override
public Text getCurrentValue() throws IOException, InterruptedException {
// TODO Auto-generated method stub
return value;
}
 
@Override
public float getProgress() throws IOException, InterruptedException {
// TODO Auto-generated method stub
return 0;
}
 
@Override
public void initialize(InputSplit inputSplit, TaskAttemptContext context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
FileSplit fileSplit = (FileSplit)inputSplit;
start = fileSplit.getStart();
end = start + fileSplit.getLength();
Configuration conf = context.getConfiguration();
Path path = fileSplit.getPath();
FileSystem fs = path.getFileSystem(conf);
fin = fs.open(path);
fin.seek(start);
reader = new LineReader(fin);
pos = 1;
}
 
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
// TODO Auto-generated method stub
if(key == null)
key = new LongWritable();
key.set(pos);
if(value == null)
value = new Text();
if(reader.readLine(value) == 0)
return false;
pos++;
return true;
}
 
}
 
源代码MyFileInputFormat.java:
package com.recordreader;
 
import java.io.IOException;
 
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
 
public class MyFileInputFormat extends FileInputFormat {
 
@Override
public RecordReader createRecordReader(InputSplit arg0,
TaskAttemptContext arg1) throws IOException, InterruptedException {
// TODO Auto-generated method stub
return new MyRecordReader();
}
 
@Override
protected boolean isSplitable(JobContext context, Path filename) {
// TODO Auto-generated method stub
return false;
}
 
}
 
源代码TestRecordReader.java:
package com.recordreader;
 
import java.io.IOException;
 
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
 
 
 
public class TestRecordReader {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException{
  Configuration conf = new Configuration();
   String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
   if (otherArgs.length != 2) {
     System.err.println("Usage: wordcount ");
     System.exit(2);
   }
   Job job = new Job(conf, "word count");
   job.setJarByClass(TestRecordReader.class);
   job.setMapperClass(MyMapper.class);
   
   job.setReducerClass(MyReducer.class);
   job.setPartitionerClass(MyPartitioner.class);
   job.setNumReduceTasks(2);
   job.setInputFormatClass(MyFileInputFormat.class);
   
   
   
   FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
   FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
   System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

MapReduce 重要组件——Recordreader组件 [转]的更多相关文章

  1. MapReduce 重要组件——Recordreader组件

    (1)以怎样的方式从分片中读取一条记录,每读取一条记录都会调用RecordReader类: (2)系统默认的RecordReader是LineRecordReader,如TextInputFormat ...

  2. Vue.js学习 Item11 – 组件与组件间的通信

    什么是组件? 组件(Component)是 Vue.js 最强大的功能之一.组件可以扩展 HTML 元素,封装可重用的代码.在较高层面上,组件是自定义元素,Vue.js 的编译器为它添加特殊功能.在有 ...

  3. Vue中父子组件通讯——组件todolist

    一.todolist功能开发 <div id="root"> <div> <input type="text" v-model=& ...

  4. $Django Rest Framework-认证组件,权限组件 知识点回顾choices,on_delete

    一 小知识点回顾 #orm class UserInfo (models.Model): id = models.AutoField (primary_key=True) name = models. ...

  5. Vuejs——(12)组件——动态组件

    版权声明:出处http://blog.csdn.net/qq20004604   目录(?)[+]   本篇资料来于官方文档: http://cn.vuejs.org/guide/components ...

  6. python 全栈开发,Day78(Django组件-forms组件)

    一.Django组件-forms组件 forms组件 django中的Form组件有以下几个功能: 生成HTML标签 验证用户数据(显示错误信息) HTML Form提交保留上次提交数据 初始化页面显 ...

  7. slot是标签的内容扩展,也就是说你用slot就可以在自定义组件时传递给组件内容,组件接收内容并输出

    html 父页面<div id="app"> <register> <span slot="name">{{message. ...

  8. 040——VUE中组件之组件间的数据参props的使用实例操作

    <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8&quo ...

  9. vue02—— 动画、组件、组件之间的数据通信

    一.vue中使用动画 文档:https://cn.vuejs.org/v2/guide/transitions.html 1. Vue 中的过渡动画 <!DOCTYPE html> < ...

随机推荐

  1. FLASH CC 2015 CANVAS (七)总结

    FLASH CC 2015 CANVAS (一至七)确切来说是自己在摸索学习过程中而写.所以定为“开荒教程”. 去年年底转战H5,半年中一直非常忙也不敢用CC来做项目,担心有BUG或者无法实现需求,所 ...

  2. HDU5730 FFT+CDQ分治

    题意:dp[n] = ∑ ( dp[n-i]*a[i] )+a[n], ( 1 <= i < n) cdq分治. 计算出dp[l ~ mid]后,dp[l ~ mid]与a[1 ~ r-l ...

  3. hiho_1054_滑动解锁

    题目大意 智能手机九点屏幕滑动解锁,如果给出某些连接线段,求出经过所有给出线段的合法的滑动解锁手势的总数.题目链接: 滑动解锁 题目分析 首先,尝试求解没有给定线段情况下,所有合法的路径的总数.可以使 ...

  4. phalcon: Windows 下 Phalcon dev-tools 配置 和 Phpstorm中配置Phalcon 代码提示, phalcon tools的使用

    准备: phalcon-devtools包 下载地址: https://github.com/phalcon/phalcon-devtools 解压到wampserver的www目录 (xampp 用 ...

  5. SAP MM模块之批次管理

    1.Batch的定义:Batch is a quantity any drug produced during a given cycle of manufacture. The essence of ...

  6. AOP 之 6.1 AOP基础(拾陆)

    6.1.1  AOP是什么 考虑这样一个问题:需要对系统中的某些业务做日志记录,比如支付系统中的支付业务需要记录支付相关日志,对于支付系统可能相当复杂,比如可能有自己的支付系统,也可能引入第三方支付平 ...

  7. iOS动态部署方案

    转载: iOS动态部署方案 前言 这里讨论的动态部署方案,就是指通过不发版的方式,将新的内容.新的业务流程部署进已发布的App.因为苹果的审核周期比较长,而且苹果的限制比较多,业界在这里也没有特别多的 ...

  8. Python 练习 12

    #!/usr/bin/python # -*- coding: UTF-8 -*- year = int(raw_input('year:\n')) month = int(raw_input('mo ...

  9. python 练习 6

    #!/usr/bin/python # -*- coding: utf-8 -*- from collections import deque from math import log10 def p ...

  10. DSP28377S - ADC学习编程笔记

    DSP28377S -  ADC学习编程笔记 彭会锋 2016-08-04  20:19:52 1 ADC类型导致的配置区别 F28377S的ADC类型是Type 4类型,我的理解是不同类型的ADC采 ...