使用map端连接结合分布式缓存机制实现Join算法

前面我们介绍了MapReduce中的Join算法，我们提到了可以通过map端连接或reduce端连接实现join算法，在文章中，我们只给出了reduce端连接的例子，下面我们说说使用map端连接结合分布式缓存机制实现Join算法

1、介绍

我们使用频道类型数据集和机顶盒用户数据集，进行连接，统计出每天、每个频道、每分钟的收视人数

2、数据集

频道类型数据集就是channelType.csv文件，如下示例

机顶盒用户数据集来源于“08.统计电视机顶盒中无效用户数据，并以压缩格式输出有效数据”这个实战项目处理后的结果，数据集如下所示

3、分析

基于项目的需求，我们通过以下几步完成：

1、编写Mapper类，连接用户数据和频道类型数据，按需求将数据解析为key=频道类别+日期+每分钟，value=机顶盒号，然后将结果输出。

2、编写Combiner类，先将Mapper输出结果合并一次，然后输出给Reducer。

3、编写Reducer类，统计出收视率，然后使用MultipleOutputs类将每分钟的收视率，按天输出到不同文件路径下

4、编写驱动方法 run，运行MapReduce程序

4、实现

1、编写Mapper、Reducer

package com.buaa;

import java.io.BufferedReader;

import java.io.FileNotFoundException;

import java.io.IOException;

import java.io.InputStreamReader;

import java.net.URI;

import java.util.Hashtable;

import java.util.List;

import org.apache.commons.lang.StringUtils;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.conf.Configured;

import org.apache.hadoop.fs.FSDataInputStream;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.MapWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat;

import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;

import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

import org.apache.hadoop.util.Tool;

import org.apache.hadoop.util.ToolRunner;

/**

* @ProjectName CountViewers

* @PackageName com.buaa

* @ClassName CountViews

* @Description 通过map端连接,最后统计出 每天 每个类别 每分钟的收视人数 并按天分别输出不同的文件下

* @Author 刘吉超

* @Date 2016-06-01 16:12:08

*/

@SuppressWarnings("deprecation")

public class CountViews extends Configured implements Tool {

	/*

	 * 解析tv用户数据

	 */

	public static class ViewsMapper extends Mapper<LongWritable, Text, Text, MapWritable> {

		// 定义全局 Hashtable 对象

		private Hashtable<String, String> table = new Hashtable<String, String>();

		@Override

		protected void setup(Context context) throws IOException, InterruptedException {

			// 返回本地文件路径

            Path[] localPaths = (Path[]) context.getLocalCacheFiles();

            if (localPaths.length == 0) {

                throw new FileNotFoundException("Distributed cache file not found.");

            }  

            // 获取本地 FileSystem实例

            FileSystem fs = FileSystem.getLocal(context.getConfiguration());

            // 打开输入流

            FSDataInputStream in = fs.open(new Path(localPaths[0].toString()));  

			// 创建BufferedReader读取器

			BufferedReader br = new BufferedReader(new InputStreamReader(in));

			String infoAddr = null;

			// 按行读取文件

			while (null != (infoAddr = br.readLine())) {

				// 将每行数据解析成数组 records

				String[] records = infoAddr.split("\t");

				/*

				 * records[0]为频道名称，records[1]为频道类别

				 * 世界地理	4

				 */

				table.put(records[0], records[1]);

			}

		}

		public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

		    /*

		     * 数据格式：机顶盒 + "@" + 日期 + "@" + 频道名称 + "@" + 开始时间+ "@" + 结束时间

		     * 01050908200002327@2012-09-17@CCTV-1 综合@02:21:03@02:21:06

		     */

			String[] records = value.toString().split("@");

			// 机顶盒

			String stbNum = records[0];

			// 日期

			String date = records[1];

			// 频道名称

			String sn = records[2];

			// 开始时间

			String s = records[3];

			// 结束时间

			String e = records[4];

			// 如果开始时间或结束时间为空，直接返回

			if(StringUtils.isEmpty(s) || StringUtils.isEmpty(e)){

				return ;

			}

			// 按每条记录的起始时间、结束时间 计算出分钟列表List

			List<String> list = ParseTime.getTimeSplit(s, e);

			if(list == null){

				return;

			}

			// 频道类别

			String channelType = StringUtils.defaultString(table.get(sn),"0");

			// 循环所有分钟，拆分数据记录并输出

			for (String min : list) {

				MapWritable avgnumMap = new MapWritable();

				avgnumMap.put(new Text(stbNum), new Text());

				/*

			     * 0@2012-09-17@02:59

			     */

				context.write(new Text(channelType + "@" + date+ "@" + min), avgnumMap);

			}

		}

	}

	/*

	 * 定义Combiner，合并 Mapper 输出结果

	 */

	public static class ViewsCombiner extends Reducer<Text, MapWritable, Text, MapWritable> {

		protected void reduce(Text key, Iterable<MapWritable> values,Context context) throws IOException, InterruptedException {

			MapWritable avgnumMap = new MapWritable();

			for (MapWritable val : values) {

				// 合并相同的机顶盒号

				avgnumMap.putAll(val);

			}

			context.write(key, avgnumMap);

		}

	}

	/*

	 * 统计每个频道类别，每分钟的收视人数，然后按日期输出到不同文件路径下

	 */

	public static class ViewsReduce extends Reducer<Text, MapWritable, Text, Text> {

		// 声明多路径输出对象

		private MultipleOutputs<Text, Text> mos;

		protected void setup(Context context) throws IOException,InterruptedException {

			mos = new MultipleOutputs<Text, Text>(context);

		}

		protected void reduce(Text key, Iterable<MapWritable> values, Context context) throws IOException, InterruptedException {

			// 数据格式:key=channelType+date+min  value=map(stbNum)

			String[] kv = key.toString().split("@");

			// 频道类别

			String channelType = kv[0];

			// 日期

			String date = kv[1];

			// 分钟

			String min = kv[2];

			MapWritable avgnumMap = new MapWritable();

			for (MapWritable m : values) {

				avgnumMap.putAll(m);

			}

			// 按日期将数据输出到不同文件路径下

			mos.write(new Text(channelType), new Text(min + "\t" + avgnumMap.size()), date.replaceAll("-", ""));

		}

		protected void cleanup(Context context) throws IOException, InterruptedException {

			mos.close();

		}

	}

	@Override

	public int run(String[] arg) throws Exception {

		// 读取配置文件

		Configuration conf = new Configuration();

		// 判断路径是否存在，如果存在，则删除

		Path mypath = new Path(arg[1]);

		FileSystem hdfs = mypath.getFileSystem(conf);

		if (hdfs.isDirectory(mypath)) {

			hdfs.delete(mypath, true);

		}

		Job job = Job.getInstance(conf,"CountViews");

		// 设置主类

		job.setJarByClass(CountViews.class);

		// 输入路径

		FileInputFormat.addInputPaths(job, arg[0]+"20120917,"+arg[0]+"20120918,"+arg[0]+

				"20120919,"+arg[0]+"20120920,"+arg[0]+"20120921,"+arg[0]+"20120922,"+arg[0]+"20120923");

		// 输出路径

		FileOutputFormat.setOutputPath(job, new Path(arg[1]));

		// 去part-r-00000空文件

		LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);  

		// Mapper

		job.setMapperClass(ViewsMapper.class);

		job.setMapOutputKeyClass(Text.class);

		job.setMapOutputValueClass(MapWritable.class);

		// 设置Combiner

		job.setCombinerClass(ViewsCombiner.class);

		// Reducer

		job.setReducerClass(ViewsReduce.class);

		job.setOutputKeyClass(Text.class);

		job.setOutputValueClass(Text.class);

		// 指定分布式缓存文件

		job.addCacheFile(new URI(arg[2]));

		//提交任务

		return job.waitForCompletion(true) ? 0 : 1;

	}

	public static void main(String[] args) throws Exception {

		String[] arg = {

				"hdfs://hadoop1:9000/buaa/tv/out/",

				"hdfs://hadoop1:9000/buaa/ctype/",

				"hdfs://hadoop1:9000/buaa/channel/channelType.csv"

			};

		int ec = ToolRunner.run(new Configuration(), new CountViews(), arg);

		System.exit(ec);

	}

}

2、提取开始时间~结束时间之间的分钟数

package com.buaa;

import java.text.ParseException;

import java.text.SimpleDateFormat;

import java.util.ArrayList;

import java.util.Calendar;

import java.util.List;

/**

* @ProjectName CountViewers

* @PackageName com.buaa

* @ClassName ParseTime

* @Description TODO

* @Author 刘吉超

* @Date 2016-06-01 16:11:10

*/

public class ParseTime {

	/**

	 * 提取start~end之间的分钟数

	 *

	 * @param start

	 * @param end

	 * @return List

	 */

	public static List<String> getTimeSplit(String start, String end) {

		List<String> list = new ArrayList<String>();

		// SimpleDateFormat

		SimpleDateFormat formatDate = new SimpleDateFormat("HH:mm");

		SimpleDateFormat parseDate = new SimpleDateFormat("HH:mm:ss");

		/*

		 * 开始时间格式：02:21:03

		 */

		Calendar startCalendar = Calendar.getInstance();

		/*

		 * 结束时间格式：02:21:06

		 */

		Calendar endCalendar = Calendar.getInstance();

		try {

			startCalendar.setTime(parseDate.parse(start));

			endCalendar.setTime(parseDate.parse(end));

		} catch (ParseException e1) {

			return null;

		}

		while (startCalendar.compareTo(endCalendar) <= 0) {

			list.add(formatDate.format(startCalendar.getTime()));

			startCalendar.add(Calendar.MINUTE, 1);

		}

		return list;

	}

	public static void main(String[] args) {

		String start = "12:59:24";

		String end = "13:03:45";

		List<String> list1  = getTimeSplit(start, end);

		for(String st1 : list1){

			System.out.println(st1);

		}

	}

}

5、运行结果

如果，您认为阅读这篇博客让您有些收获，不妨点击一下右下角的【推荐】。
如果，您希望更容易地发现我的新博客，不妨点击一下左下角的【关注我】。
如果，您对我的博客所讲述的内容有兴趣，请继续关注我的后续博客，我是【刘超★ljc】。

本文版权归作者和博客园共有，欢迎转载，但未经作者同意必须保留此段声明，且在文章页面明显位置给出原文连接，否则保留追究法律责任的权利。

使用map端连接结合分布式缓存机制实现Join算法的更多相关文章

9.3.1 map端连接- DistributedCache分布式缓存小数据集
1.1.1 map端连接- DistributedCache分布式缓存小数据集当一个数据集非常小时,可以将小数据集发送到每个节点,节点缓存到内存中,这个数据集称为边数据.用map函数 ...
分布式缓存设计:一致性Hash算法
缓存作为数据库前的一道屏障,它的可用性与缓存命中率都会直接影响到数据库,所以除了配置主从保证高可用之外还需要设计分布式缓存来扩充缓存的容量,将数据分布在多台机器上如果有一台不可用了对整体影响也比较小. ...
分布式缓存的一致性Hash算法 2 32
w 李智慧
MapReduce中的分布式缓存使用
MapReduce中的分布式缓存使用 @(Hadoop) 简介 DistributedCache是Hadoop为MapReduce框架提供的一种分布式缓存机制,它会将需要缓存的文件分发到各个执行任务的 ...
Hadoop 之分布式缓存的原理和方法——DistributedCache
1.什么时Hadoop的分布式缓存答:在执行MapReduce时,可能Mapper之间需要共享一些信息,如果信息量不大,可以将其从HDFS中加载到内存中,这就是Hadoop分布式缓存机制. 2.如何 ...
[.NET领域驱动设计实战系列]专题八：DDD案例：网上书店分布式消息队列和分布式缓存的实现
一.引言在上一专题中,商家发货和用户确认收货功能引入了消息队列来实现的,引入消息队列的好处可以保证消息的顺序处理,并且具有良好的可扩展性.但是上一专题消息队列是基于内存中队列对象来实现,这样实现有一 ...
hadoop中的分布式缓存——DistributedCache
分布式缓存一个最重要的应用就是在进行join操作的时候,如果一个表很大,另一个表很小很小,我们就可以将这个小表进行广播处理,即每个计算节点上都存一份,然后进行map端的连接操作,经过我的实验验证,这 ...
简单的Map缓存机制实现
大致思路是用一个单例的Map实现,当然此Map得是线程安全的--ConcurrentHashMap 原本项目需求是缓存十条消息,所以打算用Map实现缓存机制.中途夭折下面具体尚未实现... 当然此代码 ...
分布式理论(4):Leases 一种解决分布式缓存一致性的高效容错机制(转)
作者:Cary G.Gray and David R. Cheriton 1989 译者:phylips@bmy 2011-5-7 出处:http://duanple.blog.163.com/blo ...

随机推荐

XSS测试用例与原理讲解
1.<a href="javascript:alert(32)">DIBRG</a>2.<img href="javascript:aler ...
MIME
http://www1.huachu.com.cn/read/readbookinfo.asp?sectionid=1000000558 http://www.jb51.net/hack/10623. ...
【UVA 1395】 Slim Span （苗条树）
[题意] 求一颗生成树,满足最大边和最小边之差最小 InputThe input consists of multiple datasets, followed by a line containin ...
视频硬解api介绍
在一个gpu如此强大的时代,视频解码怎么能少了gpu厂商的参加.为了用硬件加速视频解码,厂商定义了一些api. 好吧,一旦和硬件打交道,就会有os的参加,有了硬件与os参加,api肯定会变成很凌乱,看 ...
2016年如果还没有关注这些机器人公司，你就out了
芯师爷语据知名市场研究机构IDC发布报告称,预计到2019年,全球机器人及相关服务上的投入将达到1350亿美元,较2015年的710亿美元增长近一倍.报告称,机器人相关投资预计将以每年17%的速度增 ...
【简译】Windows 线程基础
翻译一篇关于windows线程的文章,原文在此.第一次翻译,如有错误请多指教 =========================================华丽的分割线============== ...
[cocos2d] 显示状态与文字
前言: 对于显示数值与文字一般有三种方式可以使用: CCLabelTTF .CCLabelBMFont .CCLabelAtlas 详细区别可参考:cocos2d 中添加显示文字的三种方式(CCLab ...
linux 下 select 编程
linux 下的 select 知识点 unp 的第六章已经描述的很清楚,我们这里简单的说下 select 的作用,并给出 select 的客户端实例.我们知道 select 是IO 多路复用的一个最 ...
LintCode 38. Search a 2D Matrix II
Write an efficient algorithm that searches for a value in an m x n matrix, return the occurrence of ...
wpa_supplicant 配置与应用
概述 wpa_supplicant是wifi客户端(client)加密认证工具,和iwconfig不同,wpa_supplicant支持wep.wpa.wpa2等完整的加密认证,而iwconfig只能 ...

使用map端连接结合分布式缓存机制实现Join算法

使用map端连接结合分布式缓存机制实现Join算法的更多相关文章

随机推荐

热门专题