MapReduce(四)

1.shuffle过程

2.map中setup,map,cleanup的作用。

一.shuffle过程

https://blog.csdn.net/techchan/article/details/53405519

来张图吧

二.map中setup,map,cleanup的作用。

setup()，此方法被MapReduce框架仅且执行一次，在执行Map任务前，进行相关变量或者资源的集中初始化工作。若是将资源初始化工作放在方法map()中，导致Mapper任务在解析每一行输入时都会进行资源初始化工作，导致重复，程序运行效率不高！
run（）映射k，v 数据
cleanup(),此方法被MapReduce框架仅且执行一次，在执行完毕Map任务后，进行相关变量或资源的释放工作。若是将释放资源工作放入方法map()中，也会导致Mapper任务在解析、处理每一行文本后释放资源，而且在下一行文本解析前还要重复初始化，导致反复重复，程序运行效率不高！

代码测试 Cleanup的作用

package com.huhu.day04;

import java.io.BufferedReader;

import java.io.FileReader;

import java.io.IOException;

import java.net.URI;

import java.util.HashSet;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.filecache.DistributedCache;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.mapreduce.lib.partition.HashPartitioner;

import org.apache.hadoop.util.GenericOptionsParser;

import org.apache.hadoop.util.Tool;

import org.apache.hadoop.util.ToolRunner;

/**

 * 在这里进行wordCount统计 在一遍英语单词中 不统计 i have 这两个单词

 * 

 * @author huhu_k

 *

 */

public class TestCleanUpEffect extends ToolRunner implements Tool {

	private Configuration conf;

	public static class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

		private Path[] localCacheFiles;

		// 不通过MapReduce过滤计算的word

		private HashSet<String> keyWord;

		@Override

		protected void setup(Context context) throws IOException, InterruptedException {

			Configuration conf = context.getConfiguration();

			localCacheFiles = DistributedCache.getLocalCacheFiles(conf);

			keyWord = new HashSet<>();

			for (Path p : localCacheFiles) {

				BufferedReader br = new BufferedReader(new FileReader(p.toString()));

				String word = "";

				while ((word = br.readLine()) != null) {

					String[] str = word.split(" ");

					for (String s : str) {

						keyWord.add(s);

					}

				}

				br.close();

			}

		}

		@Override

		protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

			String[] line = value.toString().split(" ");

			for (String str : line) {

				for (String k : keyWord) {

					if (!str.contains(k)) {

						context.write(new Text(str), new IntWritable(1));

					}

				}

			}

		}

		@Override

		protected void cleanup(Mapper<LongWritable, Text, Text, IntWritable>.Context context)

				throws IOException, InterruptedException {

		}

	}

	public static class MyReduce extends Reducer<Text, IntWritable, Text, IntWritable> {

		@Override

		protected void reduce(Text key, Iterable<IntWritable> values, Context context)

				throws IOException, InterruptedException {

			int sum = 0;

			for (IntWritable v : values) {

				sum += v.get();

			}

			context.write(key, new IntWritable(sum));

		}

	}

	public static void main(String[] args) throws Exception {

		TestCleanUpEffect t = new TestCleanUpEffect();

		Configuration conf = t.getConf();

		String[] other = new GenericOptionsParser(conf, args).getRemainingArgs();

		if (other.length != 2) {

			System.err.println("number is fail");

		}

		int run = ToolRunner.run(conf, t, args);

		System.exit(run);

	}

	@Override

	public Configuration getConf() {

		if (conf != null) {

			return conf;

		}

		return new Configuration();

	}

	@Override

	public void setConf(Configuration arg0) {

	}

	@Override

	public int run(String[] other) throws Exception {

		Configuration con = getConf();

		DistributedCache.addCacheFile(new URI("hdfs://ry-hadoop1:8020/in/advice.txt"), con);

		Job job = Job.getInstance(con);

		job.setJarByClass(TestCleanUpEffect.class);

		job.setMapperClass(MyMapper.class);

		job.setMapOutputKeyClass(Text.class);

		job.setMapOutputValueClass(IntWritable.class);

		job.setReducerClass(MyReduce.class);

		job.setOutputKeyClass(Text.class);

		job.setOutputValueClass(IntWritable.class);

		FileInputFormat.addInputPath(job, new Path(other[0]));

		FileOutputFormat.setOutputPath(job, new Path(other[1]));

		return job.waitForCompletion(true) ? 0 : 1;

	}

}

我是使用在setup中过滤另一个文件:advice 然后通过运行，wordCount时，adivce中有的word则过滤不计算。我的数据分别是：

运行结果：

测试mapper中cleanup的作用

package com.huhu.day04;

import java.io.IOException;

import java.util.HashMap;

import java.util.Map;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.mapreduce.lib.partition.HashPartitioner;

import org.apache.hadoop.util.GenericOptionsParser;

import org.apache.hadoop.util.Tool;

import org.apache.hadoop.util.ToolRunner;

public class TestCleanUpEffect extends ToolRunner implements Tool {

	private Configuration conf;

	public static class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

		private Map<String, Integer> map = new HashMap<String, Integer>();

		@Override

		protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

			String[] line = value.toString().split(" ");

			for (String s : line) {

				if (map.containsKey(s)) {

					map.put(s, map.get(s) + 1);

				} else {

					map.put(s, 1);

				}

			}

		}

		@Override

		protected void cleanup(Context context) throws IOException, InterruptedException {

			for (Map.Entry<String, Integer> m : map.entrySet()) {

				context.write(new Text(m.getKey()), new IntWritable(m.getValue()));

			}

		}

	}

	public static class MyReduce extends Reducer<Text, IntWritable, Text, IntWritable> {

		@Override

		protected void setup(Context context) throws IOException, InterruptedException {

		}

		@Override

		protected void reduce(Text key, Iterable<IntWritable> values, Context context)

				throws IOException, InterruptedException {

			for (IntWritable v : values) {

				context.write(key, new IntWritable(v.get()));

			}

		}

		@Override

		protected void cleanup(Context context) throws IOException, InterruptedException {

		}

	}

	public static void main(String[] args) throws Exception {

		TestCleanUpEffect t = new TestCleanUpEffect();

		Configuration conf = t.getConf();

		String[] other = new GenericOptionsParser(conf, args).getRemainingArgs();

		if (other.length != 2) {

			System.err.println("number is fail");

		}

		int run = ToolRunner.run(conf, t, args);

		System.exit(run);

	}

	@Override

	public Configuration getConf() {

		if (conf != null) {

			return conf;

		}

		return new Configuration();

	}

	@Override

	public void setConf(Configuration arg0) {

	}

	@Override

	public int run(String[] other) throws Exception {

		Configuration con = getConf();

		Job job = Job.getInstance(con);

		job.setJarByClass(TestCleanUpEffect.class);

		job.setMapperClass(MyMapper.class);

		job.setMapOutputKeyClass(Text.class);

		job.setMapOutputValueClass(Text.class);

		// 默认分区

		job.setPartitionerClass(HashPartitioner.class);

		job.setReducerClass(MyReduce.class);

		job.setOutputKeyClass(Text.class);

		job.setOutputValueClass(Text.class);

		FileInputFormat.addInputPath(job, new Path(other[0]));

		FileOutputFormat.setOutputPath(job, new Path(other[1]));

		return job.waitForCompletion(true) ? 0 : 1;

	}

}

使用map来处理数据，减小reducer的压力，并使用mapper中的cleanup方法

运行结果

打印孩子的所有父母(爷爷，姥爷，奶奶，姥姥)，看下数据

package com.huhu.day04;

import java.io.IOException;

import java.util.ArrayList;

import java.util.List;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.GenericOptionsParser;

import org.apache.hadoop.util.Tool;

import org.apache.hadoop.util.ToolRunner;

/**

 * 分代计算 将 孩子 父母 奶奶 姥姥 分为一代

 *

 * @author huhu_k

 *

 */

public class ProgenyCount extends ToolRunner implements Tool {

	public static class MyMapper extends Mapper<LongWritable, Text, Text, Text> {

		@Override

		protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

			String[] line = value.toString().split(" ");

			String childname = line[0];

			String parentname = line[1];

			if (line.length == 2 && !value.toString().contains("child")) {

				context.write(new Text(childname), new Text("t1:" + childname + ":" + parentname));

				context.write(new Text(parentname), new Text("t2:" + childname + ":" + parentname));

			}

		}

	}

	public static class MyReduce extends Reducer<Text, Text, Text, Text> {

		boolean flag = true;

		@Override

		protected void setup(Context context) throws IOException, InterruptedException {

		}

		@Override

		protected void reduce(Text key, Iterable<Text> values, Context context)

				throws IOException, InterruptedException {

			if (flag) {

				context.write(new Text("child1"), new Text("parent1"));

				flag = false;

			}

			List<String> child = new ArrayList<>();

			List<String> parent = new ArrayList<>();

			for (Text v : values) {

				String line = v.toString();

				System.out.println(line+"**");

				if (line.contains("t1")) {

					parent.add(line.split(":")[2]);

					System.err.println(line.split(":")[2]);

				} else if (line.contains("t2")) {

					System.out.println(line.split(":")[1]);

					child.add(line.split(":")[1]);

				}

			}

			for (String c : child) {

				for (String p : parent) {

					context.write(new Text(c), new Text(p));

				}

			}

		}

	}

	public static void main(String[] args) throws Exception {

		ProgenyCount t = new ProgenyCount();

		Configuration conf = t.getConf();

		String[] other = new GenericOptionsParser(conf, args).getRemainingArgs();

		if (other.length != 2) {

			System.err.println("number is fail");

		}

		int run = ToolRunner.run(conf, t, args);

		System.exit(run);

	}

	@Override

	public Configuration getConf() {

		return new Configuration();

	}

	@Override

	public void setConf(Configuration arg0) {

	}

	@Override

	public int run(String[] other) throws Exception {

		Configuration con = getConf();

		Job job = Job.getInstance(con);

		job.setJarByClass(ProgenyCount.class);

		job.setMapperClass(MyMapper.class);

		job.setMapOutputKeyClass(Text.class);

		job.setMapOutputValueClass(Text.class);

		// 默认分区

		// job.setPartitionerClass(HashPartitioner.class);

		job.setReducerClass(MyReduce.class);

		job.setOutputKeyClass(Text.class);

		job.setOutputValueClass(Text.class);

		FileInputFormat.addInputPath(job, new Path("hdfs://ry-hadoop1:8020/in/child.txt"));

		Path path = new Path("hdfs://ry-hadoop1:8020/out/mr");

		FileSystem fs = FileSystem.get(getConf());

		if (fs.exists(path)) {

			fs.delete(path, true);

		}

		FileOutputFormat.setOutputPath(job, path);

		return job.waitForCompletion(true) ? 0 : 1;

	}

}

MapReduce(四)的更多相关文章

mapreduce (四) MapReduce实现Grep+sort
1.txt dong xi cheng xi dong cheng wo ai beijing tian an men qiche dong dong dong 2.txt dong xi cheng ...
MapReduce(四) 典型编程场景（二）
一.MapJoin-DistributedCache 应用 1.mapreduce join 介绍在各种实际业务场景中,按照某个关键字对两份数据进行连接是非常常见的.如果两份数据都比较小,那么可以 ...
Hadoop版本变迁
内容来自<Hadoop技术内幕:深入解析YARN架构设计与实现原理>第2章:http://book.51cto.com/art/201312/422022.htm Hadoop版本变迁当 ...
Hadoop 概述
Hadoop 是 Apache 基金会下的一个开源分布式计算平台,以 HDFS 分布式文件系统和 MapReduce 分布式计算框架为核心,为用户提供底层细节透明的分布式基础设施.目前,Hadoop ...
hadoop基础教程免费分享
提起Hadoop相信大家还是很陌生的,但大数据呢?大数据可是红遍每一个角落,大数据的到来为我们社会带来三方面变革:思维变革.商业变革.管理变革,各行业将大数据纳入企业日常配置已成必然之势.阿里巴巴创办 ...
Hadoop的版本演变
Hadoop版本演变 Apache Hadoop的四大分支构成了三个系列的Hadoop版本: 0.20.X系列主要有两个特征:Append与Security 0.21.0/0.22.X系列整个Ha ...
PowerJob 的故事开始：“玩够了，才有精力写开源啊！”
本文适合有 Java 基础知识的人群作者:HelloGitHub-Salieri HelloGitHub 推出的<讲解开源项目>系列.经过几番的努力和沟通,终于邀请到分布式任务调度与计算 ...
ApacheCN 大数据译文集（二） 20211206 更新
Hadoop3 大数据分析零.前言一.Hadoop 简介二.大数据分析概述三.MapReduce 大数据处理四.基于 Python 和 Hadoop 的科学计算和大数据分析五.基于 R 和 ...
Hadoop阅读笔记（四）——一幅图看透MapReduce机制
时至今日,已然看到第十章,似乎越是焦躁什么时候能翻完这本圣经的时候也让自己变得更加浮躁,想想后面还有一半的行程没走,我觉得这样“有口无心”的学习方式是不奏效的,或者是收效甚微的.如果有幸能有大牛路过, ...

随机推荐

Ubuntu 18.04版本下安装网易云音乐
这是我迄今为止发现的最完美的解决方法,不用改任何东西,只需要安装然后打开即可,后台也有. 参考:http://archive.ubuntukylin.com:10006/ubuntukylin/poo ...
K-近邻
概述 KNN算法本身简单有效,是一种lazy-learning算法: 分类器不需要使用训练集进行训练,训练时间复杂度为0: KNN分类的计算复杂度和训练集中的文档数目成正比,也就是说,如果训练集中文档 ...
【Python】【socket】
[server.py] """#练习1import socketimport threading sock = socket.socket()sock.bind(('12 ...
屏幕尺寸，分辨率，像素，PPI之间到底什么关系？
转载自:http://www.jianshu.com/p/c3387bcc4f6e 感谢博主的无私分享. 今天我给大家来讲讲这几个咱们经常打交道的词到底啥意思,以及他们之间到底有什么关系.这篇文章是我 ...
ege demo
#include <ege.h> const float base_speed = 0.5f; const float randspeed = 1.5f; //自定义函数,用来返回一个0 ...
go 接口以及对象的使用
// Sample program to show how to declare methods and how the Go // compiler supports them. package m ...
6-1 建立客户端与zk服务端的连接
6-1 建立客户端与zk服务端的连接 zookeeper原生java api使用会话连接与恢复; 节点的增删改查; watch与acl的相关操作; 导入jar包;
JDK10 新特性
关于至此,我从大一下学习,以及大二上的巩固,这应该是SE部分的最后一章节内容,介绍一下jdk10的新特性 jdk在更新10之后,出现很多新特性,根据我所观看的视频,主要提及以下几点新特性 1.新增va ...
dockerfile debian 和pip使用国内源
python官方镜像是基于debian的.国内使用时定制一下,加快下载速度. 1 debian本身使用国内源 dockfile中: #国内debian源 ADD sources.list /etc/a ...
css设置字体单行，多行超出省略号显示
单行: overflow: hidden; text-overflow:ellipsis; white-space: nowrap; 多行 display: -webkit-box; -webkit- ...