hadoop之mapper类妙用

1. Mapper类

首先 Mapper类有四个方法:

(1) protected void setup(Context context)

(2) Protected void map(KEYIN key,VALUEIN value,Context context)

(3) protected void cleanup(Context context)

(4) public void run(Context context)

setup()方法一般用来加载一些初始化的工作,像全局文件\建立数据库的链接等等;cleanup()方法是收尾工作,如关闭文件或者执行map()后的键值分发等;map()函数就不多说了.

默认的Mapper的run()方法的核心代码如下:

public void run(Context context) throws IOException,InterruptedException

{

    setup(context);

    while(context.nextKeyValue())

          map(context.getCurrentKey(),context,context.getCurrentValue(),context);

    cleanup(context);

}

从代码中也可以看出先执行setup函数,然后是map处理代码,最后是cleanup的收尾工作.值得注意的是,setup函数和cleanup函数由系统作为回调函数只做一次,并不像map函数那样执行多次.

2.setup函数应用

经典的wordcount在setup函数中加入黑名单就可以实现对黑名单中单词的过滤,详细代码如下:

public class WordCount {

   static private String blacklistFileName= "blacklist.dat";

    public static class WordCountMap extends

            Mapper<LongWritable, Text, Text, IntWritable> {  

        private final IntWritable one = new IntWritable(1);

        private Text word = new Text();

        private Set<String> blacklist;

        protected void setup(Context context) throws IOException,InterruptedException {

            blacklist=new TreeSet<String>();

            try{

              FileReader fileReader=new FileReader(blacklistFileName);

              BufferedReader bufferedReader=bew BufferedReader(fileReader);

              String str;

              while((str=bufferedReader.readLine())!=null){

                blacklist.add(str);

              }

            } catch(IOException e){

                e.printStackTrace();

            }

        } 

        public void map(LongWritable key, Text value, Context context)

                throws IOException, InterruptedException {

            String line = value.toString();

            StringTokenizer token = new StringTokenizer(line);

            while (token.hasMoreTokens()) {

                word.set(token.nextToken());

                if(blacklist.contains(word.toString())){

                   continue;

                }

                context.write(word, one);

            }

        }

    }  

    public static class WordCountReduce extends

            Reducer<Text, IntWritable, Text, IntWritable> {  

        public void reduce(Text key, Iterable<IntWritable> values,

                Context context) throws IOException, InterruptedException {

            int sum = 0;

            for (IntWritable val : values) {

                sum += val.get();

            }

            context.write(key, new IntWritable(sum));

        }

    }  

    public static void main(String[] args) throws Exception {

        Configuration conf = new Configuration();

        Job job = new Job(conf);

        job.setJarByClass(WordCount.class);

        job.setJobName("wordcount");  

        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(IntWritable.class);  

        job.setMapperClass(WordCountMap.class);

        job.setCombinerClass(WordCountReduce.class);

        job.setReducerClass(WordCountReduce.class);  

        job.setInputFormatClass(TextInputFormat.class);

        job.setOutputFormatClass(TextOutputFormat.class);  

        FileInputFormat.addInputPath(job, new Path(args[0]));

        FileOutputFormat.setOutputPath(job, new Path(args[1]));  

        System.exit(job.waitForCompletion(true) ? 0 : 1);

    }

}

3.cleanup应用

求最值最简单的办法就是对该文件进行一次遍历得出最值，但是现实中数据比量比较大，这种方法不能实现。在传统的MapReduce思想中，将文件的数据经过map迭代出来送到reduce中，在Reduce中求出最大值。但这个方法显然不够优化，我们可采用“分而治之”的思想，不需要map的所有数据全部送到reduce中，我们可以在map中先求出最大值，将该map任务的最大值送reduce中，这样就减少了数据的传输量。那么什么时候该把这个数据写出去呢？我们知道，每一个键值对都会调用一次map()，由于数据量大调用map()的次数也就多了，显然在map()函数中将该数据写出去是不明智的，所以最好的办法该Mapper任务结束后将该数据写出去。我们又知道，当Mapper/Reducer任务结束后会调用cleanup函数，所以我们可以在该函数中将该数据写出去。了解了这些我们可以看一下程序的代码:

public class TopKApp {

    static final String INPUT_PATH = "hdfs://hadoop:9000/input2";

    static final String OUT_PATH = "hdfs://hadoop:9000/out2";

    public static void main(String[] args) throws Exception {

        Configuration conf = new Configuration();

        final FileSystem fileSystem = FileSystem.get(new URI(INPUT_PATH), conf);

        final Path outPath = new Path(OUT_PATH);

        if(fileSystem.exists(outPath)){

            fileSystem.delete(outPath, true);

        }

        final Job job = new Job(conf , WordCountApp.class.getSimpleName());

        FileInputFormat.setInputPaths(job, INPUT_PATH);

        job.setMapperClass(MyMapper.class);

        job.setReducerClass(MyReducer.class);

        job.setOutputKeyClass(LongWritable.class);

        job.setOutputValueClass(NullWritable.class);

        FileOutputFormat.setOutputPath(job, outPath);

        job.waitForCompletion(true);

    }

    static class MyMapper extends Mapper<LongWritable, Text, LongWritable, NullWritable>{

        long max = Long.MIN_VALUE;

        protected void map(LongWritable k1, Text v1, Context context) throws java.io.IOException ,InterruptedException {

            final long temp = Long.parseLong(v1.toString());

            if(temp>max){

                max = temp;

            }

        }

        protected void cleanup(org.apache.hadoop.mapreduce.Mapper<LongWritable,Text,LongWritable, NullWritable>.Context context) throws java.io.IOException ,InterruptedException {

            context.write(new LongWritable(max), NullWritable.get());

        }

    }

    static class MyReducer extends Reducer<LongWritable, NullWritable, LongWritable, NullWritable>{

        long max = Long.MIN_VALUE;

        protected void reduce(LongWritable k2, java.lang.Iterable<NullWritable> arg1, org.apache.hadoop.mapreduce.Reducer<LongWritable,NullWritable,LongWritable,NullWritable>.Context arg2)

         throws java.io.IOException ,InterruptedException {

            final long temp = k2.get();

            if(temp>max){

                max = temp;

            }

        }

        protected void cleanup(org.apache.hadoop.mapreduce.Reducer<LongWritable,NullWritable,LongWritable,NullWritable>.Context context) throws java.io.IOException ,InterruptedException {

            context.write(new LongWritable(max), NullWritable.get());

        }

    }

}

hadoop之mapper类妙用的更多相关文章

[Hadoop源码解读]（二）MapReduce篇之Mapper类
前面在讲InputFormat的时候,讲到了Mapper类是如何利用RecordReader来读取InputSplit中的K-V对的. 这一篇里,开始对Mapper.class的子类进行解读. 先回忆 ...
MapReduce之Mapper类,Reducer类中的函数(转载)
Mapper类4个函数的解析 Mapper有setup(),map(),cleanup()和run()四个方法.其中setup()一般是用来进行一些map()前的准备工作,map()则一般承担主要的处 ...
Mapper类/Reducer类中的setup方法和cleanup方法以及run方法的介绍
在hadoop的源码中,基类Mapper类和Reducer类中都是只包含四个方法:setup方法,cleanup方法,run方法,map方法.如下所示: 其方法的调用方式是在run方法中,如下所示: ...
Hadoop 2:Mapper和Reduce
Hadoop 2:Mapper和Reduce Understanding and Practicing Hadoop Mapper and Reduce 1 Mapper过程 Hadoop将输入数据划 ...
Job流程：Mapper类分析
此文紧接Job流程:决定map个数的因素,Map任务被提交到Yarn后,被ApplicationMaster启动,任务的形式是YarnChild进程,在其中会执行MapTask的run()方法.无论是 ...
【mybatis】idea中 mybatis的mapper类去找对应的mapper.xml中的方法，使用插件mybatis-plugin
idea中 mybatis的mapper类去找对应的mapper.xml中的方法,使用插件mybatis-plugin,名字可能叫Free mybatis-plugin 安装上之后,可能需要重启ide ...
【spring boot】mybatis启动报错：Consider defining a bean of type 'com.newhope.interview.dao.UserMapper' in your configuration. 【Mapper类不能被找到】@Mapper 和@MapperScan注解的区别
启动报错: 2018-05-16 17:22:58.161 ERROR 4080 --- Disconnected from the target VM, address: '127.0.0.1:50 ...
Hadoop之TaskInputOutputContext类
在MapReduce过程中,每一个Job都会被分成若干个task,然后再进行处理.那么Hadoop是怎么将Job分成若干个task,并对其进行跟踪处理的呢?今天我们来看一个*Context类——Tas ...
Hadoop_MapReduce中Mapper类和Reduce类
在权威指南中,有个关于处理温度的MapReduce类,具体如下: 第一部分:Map public class MaxTemperatureMapper extends MapReduceBase im ...

随机推荐

Vue指令常见的几个内置指令
1.v-if指令:判断指令,根据表达式值得真假来插入或删除相应的值. 2.v-show指令:条件渲染指令,无论返回的布尔值是true还是false,元素都会存在在html中,只是false的元素会隐藏 ...
Oracle 12.2 设置LOCAL_TEMP_TABLESPACE
12.2 设置LOCAL_TEMP_TABLESPACE SQL> select username,DEFAULT_TABLESPACE,TEMPORARY_TABLESPACE,LOCAL_ ...
iOS 12 越狱支持 Cydia
Geosn0w在1月31日宣布推出 OsirisJailbreak12 越狱工具,是目前公开的第一个支持 iOS 12 的越狱,支持 iOS 12.0-12.1.2.项目地址:https://gith ...
树莓派3B+学习笔记：10、使用SSH连接树莓派
SSH(Secure Shell)是一种能够以安全的方式提供远程登录的协议,也是目前远程管理Linux系统的首选方式. 1.开启树莓派3B+的SSH远程管理功能,在终端中输入以下命令: sudo ra ...
STM32F4寄存器编写跑马灯例程
最近由于在学习STM32看到别人用寄存器编程控制跑马灯,于是自己也想试一试.可是试了好久终究弄不出来.回头看了下库函数的调用关系才搞明白.首先通过查看GPIOA的设置函数发现设置如下: void GP ...
python 时间time模块介绍和应用
1.其中format_string 类型的时间和struct_time之间可以转换,timestamp时间戳可以和struct_time之间进行转化,但是时间戳和格式化时间是不能直接转换的. time ...
ACM1013：Digital Roots
Problem Description The digital root of a positive integer is found by summing the digits of the int ...
Python使用__slots__限制实例属性
#定义一个类Student class Student(object): __slots__ = ('name','age') #用元组(tuple)的形式绑定属性名称 s = Student() s ...
vue搭建项目
vue-cli 作用:快速搭建项目脚手架安装3.0:npm i -g @vue/cli 安装桥接工具:npm i -g @vue/cli-init (vue-cli 3和旧版使用相同的命令,所以2被 ...
20155230 《Java程序设计》实验五 Java网络编程及安全
20155230 <Java程序设计>实验五 Java网络编程及安全实验内容 1．掌握Socket程序的编写: 2．掌握密码技术的使用: 3．设计安全传输系统. 实验1 两人一组结对编程 ...

hadoop之mapper类妙用

hadoop之mapper类妙用的更多相关文章

随机推荐

热门专题