转自:http://www.cnblogs.com/xiaolong1032/p/4529534.html

一:自定义实现InputFormat

*数据源来自于内存
*1.InputFormat是用于处理各种数据源的,下面是实现InputFormat,数据源是来自于内存.
*1.1 在程序的job.setInputFormatClass(MyselfmemoryInputFormat.class);
*1.2 实现InputFormat,extends InputFormat< , >,实现其中的两个方法,分别是getSplits(..),createRecordReader(..).
*1.3 getSplits(..)返回的是一个java.util.List<T>,List中的每个元素是InputSplit.每个InputSplit对应一个mappper任务.
*1.4 InputSplit是对原始海量数据源的划分,因为我们处理的是海量数据,不划分不行.InputSplit数据的大小完全是我们自己来定的.本例中是在内存中产生数据,然后封装到InputSplit.
*1.5 InputSplit封装的是hadoop数据类型,实现Writable接口.
*1.6 RecordReader读取每个InputSplit中的数据.解析成一个个<k,v>,供map处理.
*1.7 RecordReader有4个核心方法,分别是initalize(..).nextKeyValue(),getCurrentKey(),getCurrentValue().
*1.8 initalize重要性在于是拿到InputSplit和定义临时变量.
*1.9 nexKeyValue(..)该方法的每次调用,可以获得key和value值.
*1.10 当nextKeyValue(..)调用后,紧接着调用getCurrentKey(),getCurrentValue().
*       mapper方法中的run方法调用.

public class MyselInputFormatApp {
private static final String OUT_PATH = "hdfs://hadoop1:9000/out";// 输出路径,reduce作业输出的结果是一个目录
public static void main(String[] args) {
Configuration conf = new Configuration();// 配置对象
try {
FileSystem fileSystem = FileSystem.get(new URI(OUT_PATH), conf);
fileSystem.delete(new Path(OUT_PATH), true);
Job job = new Job(conf, WordCountApp.class.getSimpleName());// jobName:作业名称
job.setJarByClass(WordCountApp.class); job.setInputFormatClass(MyselfMemoryInputFormat.class);
job.setMapperClass(MyMapper.class);// 指定自定义map类
job.setMapOutputKeyClass(Text.class);// 指定map输出key的类型
job.setMapOutputValueClass(LongWritable.class);// 指定map输出value的类型
job.setReducerClass(MyReducer.class);// 指定自定义Reduce类
job.setOutputKeyClass(Text.class);// 设置Reduce输出key的类型
job.setOutputValueClass(LongWritable.class);// 设置Reduce输出的value类型
FileOutputFormat.setOutputPath(job, new Path(OUT_PATH));// Reduce输出完之后,就会产生一个最终的输出,指定最终输出的位置
job.waitForCompletion(true);// 提交给jobTracker并等待结束
} catch (Exception e) {
e.printStackTrace();
}
} public static class MyMapper extends
Mapper<NullWritable, Text, Text, LongWritable> {
@Override
protected void map(NullWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String[] splited = line.split("\t");
for (String word : splited) {
context.write(new Text(word), new LongWritable(1));// 把每个单词出现的次数1写出去.
}
}
} public static class MyReducer extends
Reducer<Text, LongWritable, Text, LongWritable> {
@Override
protected void reduce(Text key, Iterable<LongWritable> values,
Context context) throws IOException, InterruptedException {
long count = 0L;
for (LongWritable times : values) {
count += times.get();
}
context.write(key, new LongWritable(count));
}
} /**
* 从内存中产生数据,然后解析成一个个的键值对
*
*/
public static class MyselfMemoryInputFormat extends InputFormat<NullWritable,Text>{ @Override
public List<InputSplit> getSplits(JobContext context)
throws IOException, InterruptedException {
ArrayList<InputSplit> result = new ArrayList<InputSplit>();
result.add(new MemoryInputSplit());
result.add(new MemoryInputSplit());
result.add(new MemoryInputSplit());
return result;
} @Override
public RecordReader<NullWritable, Text> createRecordReader(
InputSplit split, TaskAttemptContext context)
throws IOException, InterruptedException {
return new MemoryRecordReader();
}
} public static class MemoryInputSplit extends InputSplit implements Writable{
int SIZE = 10;
//java中的数组在hadoop中不被支持,所以这里使用hadoop的数组
//在hadoop中使用的是这种数据结构,不能使用java中的数组表示.
ArrayWritable arrayWritable = new ArrayWritable(Text.class);
/**
* 先创建一个java数组类型,然后转化为hadoop的数据类型.
* @throws FileNotFoundException
*/
public MemoryInputSplit() throws FileNotFoundException {
//一个inputSplit供一个map使用,map函数如果要被调用多次的话,意味着InputSplit必须解析出多个键值对
Text[] array = new Text[SIZE];
Random random = new Random();
for(int i=0;i<SIZE;i++){
int nextInt = random.nextInt(999999);
Text text = new Text("Text"+nextInt);
array[i] = text ;
} // FileInputStream fs = new FileInputStream(new File("\\etc\\profile"));//从文件中读取
// 将流中的数据解析出来放到数据结构中.
arrayWritable.set(array);
}
@Override
public long getLength() throws IOException, InterruptedException {
return SIZE;
}
@Override
public String[] getLocations() throws IOException,
InterruptedException {
return new String[]{};
}
public ArrayWritable getValues() {
return arrayWritable;
}
@Override
public void write(DataOutput out) throws IOException {
arrayWritable.write(out);
}
@Override
public void readFields(DataInput in) throws IOException {
arrayWritable.readFields(in);
}
} public static class MemoryRecordReader extends RecordReader<NullWritable, Text>{
private Writable[] values = null ;
private Text value = null ;
private int i = 0;
@Override
public void initialize(InputSplit split, TaskAttemptContext context)
throws IOException, InterruptedException {
MemoryInputSplit inputSplit = (MemoryInputSplit)split;
ArrayWritable writables = inputSplit.getValues();
this.values = writables.get();
this.i = 0 ;
} @Override
public boolean nextKeyValue() throws IOException,
InterruptedException {
if(i >= values.length){
return false ;
}
if(null == this.value){
value = new Text();
}
value.set((Text)values[i]);
i++ ;
return true;
} @Override
public NullWritable getCurrentKey() throws IOException,
InterruptedException {
return NullWritable.get();
} @Override
public Text getCurrentValue() throws IOException,
InterruptedException {
return value;
} @Override
public float getProgress() throws IOException, InterruptedException {
// TODO Auto-generated method stub
return 0;
} /**
* 程序结束的时候,关闭
*/
@Override
public void close() throws IOException {
} } }

二:自定义实现OutputFormat

常见的输出类型:TextInputFormat:默认输出格式,key和value中间用tab隔开.
        DBOutputFormat:写出到数据库的.
        SequenceFileFormat:将key,value以Sequence格式输出的.
        SequenceFileAsOutputFormat:SequenceFile以原始二进制的格式输出.
        MapFileOutputFormat:将key和value写入MapFile中.由于MapFile中key是有序的,所以写入的时候必须保证记录是按key值顺序入的.
        MultipleOutputFormat:多文件的一个输出.默认情况下一个reducer产生一个输出,但是有些时候我们想一个reducer产生多个输出,MultipleOutputFormat和MultipleOutputs就可以实现这个功能.
          MultipleOutputFormat:可以自定义输出文件的名称.
          继承MultipleOutputFormat 需要实现
            getBaseRecordWriter():
            generateFileNameForKeyvalue():根据键值确定文件名.

/**
*自定义输出OutputFormat:用于处理各种输出目的地的.
*1.OutputFormat需要写出的键值对是来自于Reducer类.是通过RecordWriter获得的.
*2.RecordWriter(..)中write只有key和value,写到那里去哪?这要通过单独传入输出流来处理.write方法就是把k,v写入到outputStream中的.
*3.RecordWriter类是位于OutputFormat中的.因此,我们自定义OutputFormat必须继承OutputFormat类.那么流对象就必须在getRecordWriter(..)中获得.
*/
public class MySelfOutputFormatApp {
private static final String INPUT_PATH = "hdfs://hadoop1:9000/abd/hello";// 输入路径
private static final String OUT_PATH = "hdfs://hadoop1:9000/out";// 输出路径,reduce作业输出的结果是一个目录
private static final String OUT_FIE_NAME = "/abc";
public static void main(String[] args) {
Configuration conf = new Configuration();
try {
FileSystem fileSystem = FileSystem.get(new URI(OUT_PATH), conf);
fileSystem.delete(new Path(OUT_PATH), true);
Job job = new Job(conf, WordCountApp.class.getSimpleName());
job.setJarByClass(WordCountApp.class);
FileInputFormat.setInputPaths(job, INPUT_PATH);
job.setMapperClass(MyMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
job.setReducerClass(MyReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
job.setOutputFormatClass(MySelfTextOutputFormat.class);
job.waitForCompletion(true);
} catch (Exception e) {
e.printStackTrace();
}
} public static class MyMapper extends
Mapper<LongWritable, Text, Text, LongWritable> {
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String[] splited = line.split("\t");
for (String word : splited) {
context.write(new Text(word), new LongWritable(1));// 把每个单词出现的次数1写出去.
}
}
} public static class MyReducer extends
Reducer<Text, LongWritable, Text, LongWritable> {
@Override
protected void reduce(Text key, Iterable<LongWritable> values,
Context context) throws IOException, InterruptedException {
long count = 0L;
for (LongWritable times : values) {
count += times.get();
}
context.write(key, new LongWritable(count));
}
}
/**
*自定义输出类型
*/
public static class MySelfTextOutputFormat extends OutputFormat<Text,LongWritable>{
FSDataOutputStream outputStream = null ;
@Override
public RecordWriter<Text, LongWritable> getRecordWriter(
TaskAttemptContext context) throws IOException,
InterruptedException {
try {
FileSystem fileSystem = FileSystem.get(new URI(MySelfOutputFormatApp.OUT_PATH), context.getConfiguration());
//指定的是输出文件的路径
String opath = MySelfOutputFormatApp.OUT_PATH+OUT_FIE_NAME;
outputStream = fileSystem.create(new Path(opath));
} catch (URISyntaxException e) {
e.printStackTrace();
}
return new MySelfRecordWriter(outputStream);
} @Override
public void checkOutputSpecs(JobContext context) throws IOException,
InterruptedException {
} /**
* OutputCommitter:在作业初始化的时候创建一些临时的输出目录,作业的输出目录,管理作业和任务的临时文件的.
* 作业运行过程中,会产生很多的Task,Task在处理的时候也会产生很多的输出.也会创建这个输出目录.
* 当我们的Task或者是作业都运行完成之后,输出目录由OutputCommitter给删了.所以程序在运行结束之后,我们根本看不见任何额外的输出.
* 在程序运行中会产生很多的临时文件,临时文件全交给OutputCommitter处理,真正的输出是RecordWriter(..),我们只需要关注最后的输出就可以了.中间的临时文件就是程序运行时产生的.
*/
@Override
public OutputCommitter getOutputCommitter(TaskAttemptContext context)
throws IOException, InterruptedException {
//提交任务的输出,包括初始化路径,包括在作业完成的时候清理作业,删除临时目录,包括作业和任务的临时目录.
//作业的输出路径应该是一个路径
return new FileOutputCommitter(new Path(MySelfOutputFormatApp.OUT_PATH), context);
}
}
public static class MySelfRecordWriter extends RecordWriter<Text, LongWritable>{
FSDataOutputStream outputStream = null ;
public MySelfRecordWriter(FSDataOutputStream outputStream) {
this.outputStream = outputStream ;
}
@Override
public void write(Text key, LongWritable value) throws IOException,
InterruptedException {
this.outputStream.writeBytes(key.toString());
this.outputStream.writeBytes("\t");
this.outputStream.writeLong(value.get());
}
@Override
public void close(TaskAttemptContext context) throws IOException,
InterruptedException {
this.outputStream.close();
}
}
}

三:输出到多个文件目录中去

/**
*输出到多个文件目录中去
*使用旧api
*/
public class MyMultipleOutputFormatApp {
private static final String INPUT_PATH = "hdfs://hadoop1:9000/abd";// 输入路径
private static final String OUT_PATH = "hdfs://hadoop1:9000/out";// 输出路径,reduce作业输出的结果是一个目录
public static void main(String[] args) {
Configuration conf = new Configuration();
try {
FileSystem fileSystem = FileSystem.get(new URI(OUT_PATH), conf);
fileSystem.delete(new Path(OUT_PATH), true);
JobConf job = new JobConf(conf, WordCountApp.class);
job.setJarByClass(WordCountApp.class);
FileInputFormat.setInputPaths(job, INPUT_PATH);
job.setMapperClass(MyMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
job.setReducerClass(MyReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
job.setOutputFormat(MyMutipleFilesTextOutputFormat.class);
FileOutputFormat.setOutputPath(job, new Path(OUT_PATH));
JobClient.runJob(job);
} catch (Exception e) {
e.printStackTrace();
}
} public static class MyMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, LongWritable> { @Override
public void map(LongWritable key, Text value,
OutputCollector<Text, LongWritable> output, Reporter reporter)
throws IOException {
String line = value.toString();
String[] splited = line.split("\t");
for (String word : splited) {
output.collect(new Text(word), new LongWritable(1));
}
}
} public static class MyReducer extends MapReduceBase implements Reducer<Text, LongWritable, Text, LongWritable> {
@Override
public void reduce(Text key, Iterator<LongWritable> values,
OutputCollector<Text, LongWritable> output, Reporter reporter)
throws IOException {
long count = 0L ;
while(values.hasNext()){
LongWritable times = values.next();
count += times.get();
}
output.collect(key, new LongWritable(count));
}
}
public static class MyMutipleFilesTextOutputFormat extends MultipleOutputFormat<Text,LongWritable>{ @Override
protected org.apache.hadoop.mapred.RecordWriter<Text, LongWritable> getBaseRecordWriter(
FileSystem fs, JobConf job, String name, Progressable progress)
throws IOException {
TextOutputFormat<Text, LongWritable> textOutputFormat = new TextOutputFormat<Text,LongWritable>();
return textOutputFormat.getRecordWriter(fs, job, name, progress);
} @Override
protected String generateFileNameForKeyValue(Text key,
LongWritable value, String name) {
String keyString = key.toString();
if(keyString.startsWith("hello")){
return "hello";
}else{
//输出的文件名就是k3的值
return keyString ;
}
} }
}

四:hadoop1.x api写单词计数的例子

/**
*hadoop1.x
*使用旧api写单词计数的例子
*/
public class WordCountApp {
private static final String INPUT_PATH = "hdfs://hadoop1:9000/abd/hello";
private static final String OUT_PATH = "hdfs://hadoop1:9000/out";
public static void main(String[] args) {
try {
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
fs.delete(new Path(OUT_PATH),true);
JobConf job = new JobConf(conf, WordCountApp.class);
job.setJarByClass(WordCountApp.class); FileInputFormat.setInputPaths(job, INPUT_PATH);
job.setMapperClass(MyMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class); job.setReducerClass(MyReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
FileOutputFormat.setOutputPath(job, new Path(OUT_PATH));
JobClient.runJob(job);
} catch (IOException e) {
e.printStackTrace();
}
}
public static class MyMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, LongWritable>{ @Override
public void map(LongWritable key, Text value,
OutputCollector<Text, LongWritable> output, Reporter reporter)
throws IOException {
String line = value.toString();
String[] splited = line.split("\t");
for (String word : splited) {
output.collect(new Text(word), new LongWritable(1L));
}
}
} public static class MyReducer extends MapReduceBase implements Reducer<Text, LongWritable, Text, LongWritable>{ @Override
public void reduce(Text key, Iterator<LongWritable> values,
OutputCollector<Text, LongWritable> output, Reporter reporter)
throws IOException {
long times = 0L ;
while (values.hasNext()) {
LongWritable longWritable = (LongWritable) values.next();
times += longWritable.get();
}
output.collect(key, new LongWritable(times));
} } }

五:运行时接收命令行参数

/**
*运行时会接收一些命令行的参数
*Tool接口:支持命令行的参数
*命令行执行:
* hadoop jar jar.jar cmd.WordCountApp hdfs://hadoop1:9000/abd/hello hdfs://hadoop1:9000/out
*/
public class WordCountApp extends Configured implements Tool {
private static String INPUT_PATH = null;// 输入路径
private static String OUT_PATH = null;// 输出路径,reduce作业输出的结果是一个目录
@Override
public int run(String[] args) throws Exception {
INPUT_PATH = args[0];
OUT_PATH = args[1];
Configuration conf = getConf();// 配置对象
try {
FileSystem fileSystem = FileSystem.get(new URI(OUT_PATH), conf);
fileSystem.delete(new Path(OUT_PATH), true);
Job job = new Job(conf, WordCountApp.class.getSimpleName());// jobName:作业名称
job.setJarByClass(WordCountApp.class);
FileInputFormat.setInputPaths(job, INPUT_PATH);// 指定数据的输入
job.setMapperClass(MyMapper.class);// 指定自定义map类
job.setMapOutputKeyClass(Text.class);// 指定map输出key的类型
job.setMapOutputValueClass(LongWritable.class);// 指定map输出value的类型
job.setReducerClass(MyReducer.class);// 指定自定义Reduce类
job.setOutputKeyClass(Text.class);// 设置Reduce输出key的类型
job.setOutputValueClass(LongWritable.class);// 设置Reduce输出的value类型
FileOutputFormat.setOutputPath(job, new Path(OUT_PATH));// Reduce输出完之后,就会产生一个最终的输出,指定最终输出的位置
job.waitForCompletion(true);// 提交给jobTracker并等待结束
} catch (Exception e) {
e.printStackTrace();
}
return 0;
}
public static void main(String[] args) {
try {
ToolRunner.run(new Configuration(), new WordCountApp(),args);
} catch (Exception e) {
e.printStackTrace();
}
}
public static class MyMapper extends
Mapper<LongWritable, Text, Text, LongWritable> {
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String[] splited = line.split("\t");
for (String word : splited) {
context.write(new Text(word), new LongWritable(1));// 把每个单词出现的次数1写出去.
}
}
} public static class MyReducer extends
Reducer<Text, LongWritable, Text, LongWritable> {
@Override
protected void reduce(Text key, Iterable<LongWritable> values,
Context context) throws IOException, InterruptedException {
long count = 0L;
for (LongWritable times : values) {
count += times.get();
}
context.write(key, new LongWritable(count));
}
}
}

【转自】自定义InputFormat、OutputFormat的更多相关文章

  1. [Hadoop] - 自定义Mapreduce InputFormat&OutputFormat

    在MR程序的开发过程中,经常会遇到输入数据不是HDFS或者数据输出目的地不是HDFS的,MapReduce的设计已经考虑到这种情况,它为我们提供了两个组建,只需要我们自定义适合的InputFormat ...

  2. 自定义InputFormat和OutputFormat案例

    一.自定义InputFormat InputFormat是输入流,在前面的例子中使用的是文件输入输出流FileInputFormat和FileOutputFormat,而FileInputFormat ...

  3. MapReduce自定义InputFormat和OutputFormat

    一.自定义InputFormat 需求:将多个小文件合并为SequenceFile(存储了多个小文件) 存储格式:文件路径+文件的内容 c:/a.txt I love Beijing c:/b.txt ...

  4. 自定义inputformat和outputformat

    1. 自定义inputFormat 1.1 需求 无论hdfs还是mapreduce,对于小文件都有损效率,实践中,又难免面临处理大量小文件的场景,此时,就需要有相应解决方案 1.2 分析 小文件的优 ...

  5. 【Hadoop离线基础总结】MapReduce自定义InputFormat和OutputFormat案例

    MapReduce自定义InputFormat和OutputFormat案例 自定义InputFormat 合并小文件 需求 无论hdfs还是mapreduce,存放小文件会占用元数据信息,白白浪费内 ...

  6. InputFormat,OutputFormat,InputSplit,RecordRead(一些常见面试题),使用yum安装64位Mysql

    列举出hadoop常用的一些InputFormat InputFormat是用来对我们的输入数据进行格式化的.TextInputFormat是默认的. InputFormat有哪些类型? DBInpu ...

  7. MapReduce自定义InputFormat,RecordReader

    MapReduce默认的InputFormat是TextInputFormat,且key是偏移量,value是文本,自定义InputFormat需要实现FileInputFormat,并重写creat ...

  8. commoncrawl 源码库是用于 Hadoop 的自定义 InputFormat 配送实现

    commoncrawl 源码库是用于 Hadoop 的自定义 InputFormat 配送实现. Common Crawl 提供一个示例程序 BasicArcFileReaderSample.java ...

  9. Hadoop案例(六)小文件处理(自定义InputFormat)

    小文件处理(自定义InputFormat) 1.需求分析 无论hdfs还是mapreduce,对于小文件都有损效率,实践中,又难免面临处理大量小文件的场景,此时,就需要有相应解决方案.将多个小文件合并 ...

随机推荐

  1. 关于SkinTextBox中SkinTxt的事件使用

    SkinTxt的事件中的事件不能像一般的事件那样在属性中直接双击生成,它里面的事件都需要进行一定的转换间接使用,具体用法如下: 1.手动编写事件需要做的操作的方法 public void txtCod ...

  2. ZOJ 4081 Little Sub and Pascal's Triangle 题解

    ZOJ 4081 Little Sub and Pascal's Triangle 题解 题意 求杨辉三角第n行(从1开始计数)有几个奇数. 考察的其实是杨辉--帕斯卡三角的性质,或者说Gould's ...

  3. 859. Kruskal算法求最小生成树(模板)

    给定一个n个点m条边的无向图,图中可能存在重边和自环,边权可能为负数. 求最小生成树的树边权重之和,如果最小生成树不存在则输出impossible. 给定一张边带权的无向图G=(V, E),其中V表示 ...

  4. 国内下载Git的连接地址

    Git国内下载: https://github.com/waylau/git-for-win  

  5. Public Key Retrieval is not allowed

    链接MySQL数据库报错: 数据库连接url中添加对应属性的支持.allowPublicKeyRetrieval=true&useSSL=false url: jdbc:mysql://loc ...

  6. 一步步教你如何在Ubuntu虚拟机中安装QEMU并模拟模拟arm 开发环境(一)uImage u-boot(转)

    初次接触qemu是因为工作的需要,有时候下了班,可能需要在家研究一些东西,因为博主用到arm环境,这时候博主比较小气,不愿花钱买开发板,当然博主在这里给大家的建议是,如果要真正学懂arm构架的相关知识 ...

  7. 解决pjax重复加载js导致事件重复绑定的问题

    个人博客 地址:http://www.wenhaofan.com/article/20180925232057 1.所有js统一在pjax容器外引入 在pjax容器外引入的js只会被引入一次,所以不会 ...

  8. C语言面试题22. 链表中倒数第k个节点

    要求:输入一个链表,输出该链表中倒数第k个节点.为了符合大多数人的习惯,本题从1开始计数,即链表的尾节点是倒数第1个节点.例如,一个链表有6个节点,从头节点开始,它们的值依次是1.2.3.4.5.6. ...

  9. vue使用kkfileview文件预览功能

    1.环境要求: java 1.8+ 2.部署运行: 本机以及虚拟机上运行: 1.从https://gitee.com/kekingcn/file-online-preview/releases地址下载 ...

  10. Codeforces Round #602 (Div. 2, based on Technocup 2020 Elimination Round 3) B Box

    #include<bits/stdc++.h> using namespace std; ]; ]; int main() { int total; cin>>total; w ...