MapReduce(四)

1.shuffle过程

2.map中setup,map,cleanup的作用。

一.shuffle过程

https://blog.csdn.net/techchan/article/details/53405519

来张图吧

二.map中setup,map,cleanup的作用。

  • setup(),此方法被MapReduce框架仅且执行一次,在执行Map任务前,进行相关变量或者资源的集中初始化工作。若是将资源初始化工作放在方法map()中,导致Mapper任务在解析每一行输入时都会进行资源初始化工作,导致重复,程序运行效率不高!
  • run()映射k,v 数据
  • cleanup(),此方法被MapReduce框架仅且执行一次,在执行完毕Map任务后,进行相关变量或资源的释放工作。若是将释放资源工作放入方法map()中,也会导致Mapper任务在解析、处理每一行文本后释放资源,而且在下一行文本解析前还要重复初始化,导致反复重复,程序运行效率不高!

代码测试 Cleanup的作用

package com.huhu.day04;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.util.HashSet; import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.filecache.DistributedCache;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.partition.HashPartitioner;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner; /**
 * 在这里进行wordCount统计 在一遍英语单词中 不统计 i have 这两个单词
 * 
 * @author huhu_k
 *
 */
public class TestCleanUpEffect extends ToolRunner implements Tool { private Configuration conf; public static class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private Path[] localCacheFiles;
// 不通过MapReduce过滤计算的word
private HashSet<String> keyWord; @Override
protected void setup(Context context) throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
localCacheFiles = DistributedCache.getLocalCacheFiles(conf);
keyWord = new HashSet<>(); for (Path p : localCacheFiles) {
BufferedReader br = new BufferedReader(new FileReader(p.toString()));
String word = "";
while ((word = br.readLine()) != null) {
String[] str = word.split(" ");
for (String s : str) {
keyWord.add(s);
}
}
br.close();
}
} @Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] line = value.toString().split(" ");
for (String str : line) {
for (String k : keyWord) {
if (!str.contains(k)) {
context.write(new Text(str), new IntWritable(1));
}
}
}
}
@Override
protected void cleanup(Mapper<LongWritable, Text, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
} } public static class MyReduce extends Reducer<Text, IntWritable, Text, IntWritable> { @Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable v : values) {
sum += v.get();
}
context.write(key, new IntWritable(sum));
} } public static void main(String[] args) throws Exception {
TestCleanUpEffect t = new TestCleanUpEffect();
Configuration conf = t.getConf();
String[] other = new GenericOptionsParser(conf, args).getRemainingArgs();
if (other.length != 2) {
System.err.println("number is fail");
}
int run = ToolRunner.run(conf, t, args);
System.exit(run);
} @Override
public Configuration getConf() {
if (conf != null) {
return conf;
}
return new Configuration();
} @Override
public void setConf(Configuration arg0) { } @Override
public int run(String[] other) throws Exception {
Configuration con = getConf();
DistributedCache.addCacheFile(new URI("hdfs://ry-hadoop1:8020/in/advice.txt"), con); Job job = Job.getInstance(con);
job.setJarByClass(TestCleanUpEffect.class);
job.setMapperClass(MyMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class); job.setReducerClass(MyReduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(other[0]));
FileOutputFormat.setOutputPath(job, new Path(other[1])); return job.waitForCompletion(true) ? 0 : 1;
} }

我是使用在setup中过滤另一个文件:advice 然后通过运行,wordCount时,adivce中有的word则过滤不计算。我的数据分别是:

运行结果:

测试mapper中cleanup的作用

package com.huhu.day04;

import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.partition.HashPartitioner;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner; public class TestCleanUpEffect extends ToolRunner implements Tool { private Configuration conf; public static class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private Map<String, Integer> map = new HashMap<String, Integer>(); @Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] line = value.toString().split(" ");
for (String s : line) {
if (map.containsKey(s)) {
map.put(s, map.get(s) + 1);
} else {
map.put(s, 1);
}
}
} @Override
protected void cleanup(Context context) throws IOException, InterruptedException {
for (Map.Entry<String, Integer> m : map.entrySet()) {
context.write(new Text(m.getKey()), new IntWritable(m.getValue()));
}
}
} public static class MyReduce extends Reducer<Text, IntWritable, Text, IntWritable> { @Override
protected void setup(Context context) throws IOException, InterruptedException {
} @Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
for (IntWritable v : values) {
context.write(key, new IntWritable(v.get()));
}
} @Override
protected void cleanup(Context context) throws IOException, InterruptedException {
}
} public static void main(String[] args) throws Exception {
TestCleanUpEffect t = new TestCleanUpEffect();
Configuration conf = t.getConf();
String[] other = new GenericOptionsParser(conf, args).getRemainingArgs();
if (other.length != 2) {
System.err.println("number is fail");
}
int run = ToolRunner.run(conf, t, args);
System.exit(run);
} @Override
public Configuration getConf() {
if (conf != null) {
return conf;
}
return new Configuration();
} @Override
public void setConf(Configuration arg0) { } @Override
public int run(String[] other) throws Exception {
Configuration con = getConf();
Job job = Job.getInstance(con);
job.setJarByClass(TestCleanUpEffect.class);
job.setMapperClass(MyMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class); // 默认分区
job.setPartitionerClass(HashPartitioner.class); job.setReducerClass(MyReduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class); FileInputFormat.addInputPath(job, new Path(other[0]));
FileOutputFormat.setOutputPath(job, new Path(other[1])); return job.waitForCompletion(true) ? 0 : 1;
} }

使用map来处理数据,减小reducer的压力,并使用mapper中的cleanup方法

运行结果

打印孩子的所有父母(爷爷,姥爷,奶奶,姥姥),看下数据

package com.huhu.day04;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List; import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner; /**
* 分代计算 将 孩子 父母 奶奶 姥姥 分为一代
*
* @author huhu_k
*
*/
public class ProgenyCount extends ToolRunner implements Tool { public static class MyMapper extends Mapper<LongWritable, Text, Text, Text> { @Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] line = value.toString().split(" ");
String childname = line[0];
String parentname = line[1];
if (line.length == 2 && !value.toString().contains("child")) {
context.write(new Text(childname), new Text("t1:" + childname + ":" + parentname));
context.write(new Text(parentname), new Text("t2:" + childname + ":" + parentname));
}
}
} public static class MyReduce extends Reducer<Text, Text, Text, Text> { boolean flag = true; @Override
protected void setup(Context context) throws IOException, InterruptedException {
} @Override
protected void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
if (flag) {
context.write(new Text("child1"), new Text("parent1"));
flag = false;
} List<String> child = new ArrayList<>();
List<String> parent = new ArrayList<>(); for (Text v : values) {
String line = v.toString();
System.out.println(line+"**");
if (line.contains("t1")) {
parent.add(line.split(":")[2]);
System.err.println(line.split(":")[2]);
} else if (line.contains("t2")) {
System.out.println(line.split(":")[1]);
child.add(line.split(":")[1]); }
}
for (String c : child) {
for (String p : parent) {
context.write(new Text(c), new Text(p));
}
}
}
} public static void main(String[] args) throws Exception {
ProgenyCount t = new ProgenyCount();
Configuration conf = t.getConf();
String[] other = new GenericOptionsParser(conf, args).getRemainingArgs();
if (other.length != 2) {
System.err.println("number is fail");
}
int run = ToolRunner.run(conf, t, args);
System.exit(run);
} @Override
public Configuration getConf() {
return new Configuration();
} @Override
public void setConf(Configuration arg0) { } @Override
public int run(String[] other) throws Exception {
Configuration con = getConf();
Job job = Job.getInstance(con);
job.setJarByClass(ProgenyCount.class);
job.setMapperClass(MyMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class); // 默认分区
// job.setPartitionerClass(HashPartitioner.class); job.setReducerClass(MyReduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class); FileInputFormat.addInputPath(job, new Path("hdfs://ry-hadoop1:8020/in/child.txt"));
Path path = new Path("hdfs://ry-hadoop1:8020/out/mr");
FileSystem fs = FileSystem.get(getConf());
if (fs.exists(path)) {
fs.delete(path, true);
}
FileOutputFormat.setOutputPath(job, path); return job.waitForCompletion(true) ? 0 : 1;
} }

MapReduce(四)的更多相关文章

  1. mapreduce (四) MapReduce实现Grep+sort

    1.txt dong xi cheng xi dong cheng wo ai beijing tian an men qiche dong dong dong 2.txt dong xi cheng ...

  2. MapReduce(四) 典型编程场景(二)

    一.MapJoin-DistributedCache 应用 1.mapreduce join 介绍 在各种实际业务场景中,按照某个关键字对两份数据进行连接是非常常见的.如果两份数据 都比较小,那么可以 ...

  3. Hadoop版本变迁

    内容来自<Hadoop技术内幕:深入解析YARN架构设计与实现原理>第2章:http://book.51cto.com/art/201312/422022.htm Hadoop版本变迁 当 ...

  4. Hadoop 概述

    Hadoop 是 Apache 基金会下的一个开源分布式计算平台,以 HDFS 分布式文件系统 和 MapReduce 分布式计算框架为核心,为用户提供底层细节透明的分布式基础设施.目前,Hadoop ...

  5. hadoop基础教程免费分享

    提起Hadoop相信大家还是很陌生的,但大数据呢?大数据可是红遍每一个角落,大数据的到来为我们社会带来三方面变革:思维变革.商业变革.管理变革,各行业将大数据纳入企业日常配置已成必然之势.阿里巴巴创办 ...

  6. Hadoop的版本演变

    Hadoop版本演变 Apache Hadoop的四大分支构成了三个系列的Hadoop版本: 0.20.X系列 主要有两个特征:Append与Security 0.21.0/0.22.X系列 整个Ha ...

  7. PowerJob 的故事开始:“玩够了,才有精力写开源啊!”

    本文适合有 Java 基础知识的人群 作者:HelloGitHub-Salieri HelloGitHub 推出的<讲解开源项目>系列.经过几番的努力和沟通,终于邀请到分布式任务调度与计算 ...

  8. ApacheCN 大数据译文集(二) 20211206 更新

    Hadoop3 大数据分析 零.前言 一.Hadoop 简介 二.大数据分析概述 三.MapReduce 大数据处理 四.基于 Python 和 Hadoop 的科学计算和大数据分析 五.基于 R 和 ...

  9. Hadoop阅读笔记(四)——一幅图看透MapReduce机制

    时至今日,已然看到第十章,似乎越是焦躁什么时候能翻完这本圣经的时候也让自己变得更加浮躁,想想后面还有一半的行程没走,我觉得这样“有口无心”的学习方式是不奏效的,或者是收效甚微的.如果有幸能有大牛路过, ...

随机推荐

  1. 理解 Redis(1) - Redis 简介

    Redis 的含义 全称: REmote DIctionary Server 远程词典服务器 由于支持 string, list, set, ordered set, hash 等多重数据结构, 因此 ...

  2. 性能跃升50%!解密自主研发的金融级分布式关系数据库OceanBase 2.0

    小蚂蚁说: 相信大家对蚂蚁金服自主研发的金融级分布式关系数据库OceanBase的故事不再陌生了.在刚刚过去的2018年天猫双11中,成交额2135亿再次创造了新纪录,而支撑今年双11的支付宝核心链路 ...

  3. Codeforces 781D Axel and Marston in Bitland

    题目链接:http://codeforces.com/contest/781/problem/D ${F[i][j][k][0,1]}$表示是否存在从${i-->j}$的路径走了${2^{k}} ...

  4. 【BZOJ】1187: [HNOI2007]神奇游乐园

    题目链接:http://www.lydsy.com/JudgeOnline/problem.php?id=1187 每个格子都具有权值,求任意一个回路使得路径上的权值和最大. 裸的插头DP,注意一下几 ...

  5. spoj IITWPC4F - Gopu and the Grid Problem 线段树

    IITWPC4F - Gopu and the Grid Problem no tags  Gopu is interested in the integer co-ordinates of the ...

  6. overload、override、overwrite的介绍

    答:(1)overload(重载),即函数重载: ①在同一个类中: ②函数名字相同: ③函数参数不同(类型不同.数量不同,两者满足其一即可): ④不以返回值类型不同作为函数重载的条件. (2)over ...

  7. java扫描文件夹下面的所有文件(递归与非递归实现)

    java中扫描指定文件夹下面的所有文件扫描一个文件夹下面的所有文件,因为文件夹的层数没有限制可能多达几十层几百层,通常会采用两种方式来遍历指定文件夹下面的所有文件.递归方式非递归方式(采用队列或者栈实 ...

  8. d3 + geojson in node

    d3.js本来主要是用于用“数据驱动dom”,在浏览器端,接收后端数据,数据绑定,渲染出svg. 即使是在ng中用,也是会由框架打包,供客户端下载. 那么,如果用所谓后端渲染,发布静态的svg,那就要 ...

  9. Java使用Spring初识

    1.首先是引用了,然后pom.xml如下: <dependency> <groupId>org.springframework</groupId> <arti ...

  10. GEO数据下载分析(SRA、SRR、GEM、SRX、SAMN、SRS、SRP、PRJNA全面解析)

    很多时候我们需要从GEO(https://www.ncbi.nlm.nih.gov/geo/)下载RNA-seq数据,一个典型的下载页面是https://www.ncbi.nlm.nih.gov/ge ...