前言

本章主要内容是讲述hadoop的分布式缓存的使用，通过分布式缓存可以将一些需要共享的数据在各个集群中共享。

准备工作

数据集：ufo-60000条记录，这个数据集有一系列包含下列字段的UFO目击事件记录组成，每条记录的字段都是以tab键分割，请看http://www.cnblogs.com/cafebabe-yun/p/8679994.html

sighting date：UFO目击事件发生时间
Recorded date：报告目击事件的时间
Location：目击事件发生的地点
Shape：UFO形状
Duration：目击事件持续时间
Dexcription：目击事件的大致描述

例子：

19950915 19950915 Redmond, WA 6 min. Young man w/ 2 co-workers witness tiny, distinctly white round disc drifting slowly toward NE. Flew in dir. 90 deg. to winds.

需要共享的数据：州名缩写与全称的对应关系

数据：

AL      Alabama

AK      Alaska

AZ      Arizona

AR      Arkansas

CA      California

Distributed Cache介绍

作用：使用分布式缓存，可以将map和reduce任务要用的通用只读文件在集群所有节点共享。

Distributed Cache的使用

题目：使用共享数据替换州名缩写

将上面提到的共享数据保存为 states.txt 文件
将states.txt文件上传到hadoop

hadoop dfs -put states.txt states.txt

编写 UFORecordValidationMapper.java

import java.io.IOException;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapred.*;

import org.apache.hadoop.mapred.lib.*;

public class UFORecordValidationMapper extends MapReduceBase implements Mapper<LongWritable, Text, LongWritable, Text> {

    public void map(LongWritable key, Text value, OutputCollector<LongWritable, Text> output, Reporter reporter) throws IOException {

        String line = value.toString();

        if(validate(line)) {

            output.collect(key, value);

        }

    }

    private boolean validate(String str) {

        String[] parts = str.split("\t");

        if(parts.length != 6) {

            return false;

        }

        return true;

    }

}

编写 UFOLocation2.java

import java.io.*;

import java.util.*;

import java.net.*;

import java.util.regex.*;

import org.apache.hadoop.conf.*;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.filecache.DistributedCache;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapred.*;

import org.apache.hadoop.mapred.lib.*;

public class UFOLocation2 {

    public static class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, LongWritable> {

        private final static LongWritable one = new LongWritable(1);

        private static Pattern locationPattern = Pattern.compile("[a-zA-Z]{2}[^a-zA-Z]*$");

        private Map<String, String> stateNames;    

        @Override

        public void configure(JobConf job) {

            try {

                Path[] cacheFiles = DistributedCache.getLocalCacheFiles(job);

                setupStateMap(cacheFiles[0].toString());

            } catch (IOException e) {

                System.err.println("Error reading state file.");

                System.exit(1);

            }

        }

        private void setupStateMap(String fileName) throws IOException {

            Map<String, String> stateCache = new HashMap<String, String>();

            BufferedReader reader = new BufferedReader(new FileReader(fileName));

            String line = null;

            while((line = reader.readLine()) != null) {

                String[] splits = line.split("\t");

                stateCache.put(splits[0], splits[1]);

            }

            stateNames = stateCache;

        }

        @Override

        public void map(LongWritable key, Text value, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException {

            String line = value.toString();

            String[] fields = line.split("\t");

            String location = fields[2].trim();

            if(location.length() >= 2) {

                Matcher matcher = locationPattern.matcher(location);

                if(matcher.find()) {

                    int start = matcher.start();

                    String state = location.substring(start, start + 2);

                    output.collect(new Text(lookupState(state.toUpperCase())), one);

                }

            }

        }

        private String lookupState(String state) {

            String fullName = stateNames.get(state);

            if(fullName == null || "".equals(fullName)) {

                fullName = state;

            }

            return fullName;

        }

    }

    public static void main(String...args) throws Exception {

        Configuration config = new Configuration();

        JobConf conf = new JobConf(config, UFOLocation2.class);

        conf.setJobName("UFOLocation2");

        DistributedCache.addCacheFile(new URI("/user/root/states.txt"), conf);

        conf.setOutputKeyClass(Text.class);

        conf.setOutputValueClass(LongWritable.class);

        JobConf mapconf1 = new JobConf(false);

        ChainMapper.addMapper(conf, UFORecordValidationMapper.class, LongWritable.class, Text.class, LongWritable.class, Text.class, true, mapconf1);

        JobConf mapconf2 = new JobConf(false);

        ChainMapper.addMapper(conf, MapClass.class, LongWritable.class, Text.class, Text.class, LongWritable.class, true, mapconf2);

        conf.setMapperClass(ChainMapper.class);

        conf.setCombinerClass(LongSumReducer.class);

        conf.setReducerClass(LongSumReducer.class);

        FileInputFormat.setInputPaths(conf, args[0]);

        FileOutputFormat.setOutputPath(conf, new Path(args[1]));

        JobClient.runJob(conf);

    }

}

编译上述两个文件

javac UFORecordValidationMapper.java UFOLocation2.java

将编译好的文件打包成jar

jar cvf ufo.jar UFO*class

提交打包好的jar包到hadoop上运行

hadoop jar ufo.jar UFOLocation2 ufo.tsv output

从hadoop上获取结果到本地

hadoop dfs -get output/part-00000 ufo_result.txt

查看结果

more ufo_result.txt

[hadoop](2) MapReducer:Distributed Cache的更多相关文章

[转] .net core Session , Working with a distributed cache
本文转自:https://docs.microsoft.com/en-us/aspnet/core/performance/caching/distributed By Steve Smith+ Di ...
Distributed Cache Coherence at Scalable Requestor Filter Pipes that Accumulate Invalidation Acknowledgements from other Requestor Filter Pipes Using Ordering Messages from Central Snoop Tag
A multi-processor, multi-cache system has filter pipes that store entries for request messages sent ...
Hadoop之 MapReducer工作过程
1. 从输入到输出一个MapReducer作业经过了input,map,combine,reduce,output五个阶段,其中combine阶段并不一定发生,map输出的中间结果被分到reduce ...
spark hadoop 对比 Resilient Distributed Datasets
hadoop 迭代消耗大每次迭代启动一个完整的MapReduce作业 spark 首要目标就是避免运算时过多的网络和磁盘IO开销 Resilient Distributed Datasets ht ...
Flink分布式缓存Distributed Cache
1 分布式缓存 Flink提供了一个分布式缓存,类似于hadoop,可以使用户在并行函数中很方便的读取本地文件,并把它放在taskmanager节点中,防止task重复拉取. 此缓存的工作机制如下:程 ...
Distributed Cache(分布式缓存)-SqlServer
分布式缓存是由多个应用服务器共享的缓存,通常作为外部服务存储在单个应用服务器上,常用的有SqlServer,Redis,NCache. 分布式缓存可以提高ASP.NET Core应用程序的性能和可伸缩 ...
hadoop系列四:mapreduce的使用(二)
转载请在页首明显处注明作者与出处一:说明此为大数据系列的一些博文,有空的话会陆续更新,包含大数据的一些内容,如hadoop,spark,storm,机器学习等. 当前使用的hadoop版本为2.6 ...
Hadoop官方文档翻译——MapReduce Tutorial
MapReduce Tutorial(个人指导) Purpose(目的) Prerequisites(必备条件) Overview(综述) Inputs and Outputs(输入输出) MapRe ...
hadoop常见问题汇集
1 hadoop conf.addResource http://stackoverflow.com/questions/16017538/how-does-configuration-addreso ...

随机推荐

【ABAP系列】SAP VA01屏幕增强（user-exit）
公众号:SAP Technical 本文作者:matinal 原文出处:http://www.cnblogs.com/SAPmatinal/ 原文链接:[MM系列]SAP VA01屏幕增强(user- ...
HackGame2 writeup
网址:http://hackgame.blackbap.org/ 第一关突破客户端:无论输入什么密码都会提示"密码不能为空",使用浏览器检查网页元素会发现提交时会触发 javas ...
JavaSE编码试题强化练习3
1.给20块钱买可乐,每瓶可乐3块钱,喝完之后退瓶子可以换回1块钱,问最多可以喝到多少瓶可乐. public class TestCirculation { public static void ma ...
java 获取某路径下的子文件/子路径
/** * 获取某路径下的子文件 * */ public static List<String> getSubFile(String path){ List<String> s ...
iptables防火墙常用命令
iptables防火墙启动停止和基本操作 iptables是centos7之前常用的防火墙,在centos7上使用了firewall 防火墙基本操作: # 查询防火墙状态 service iptabl ...
[常用类]StringBuffer 类，以及 StringBuilder 类
线程安全,可变的字符序列. 字符串缓冲区就像一个String ,但可以修改. 在任何时间点,它包含一些特定的字符序列,但可以通过某些方法调用来更改序列的长度和内容. 字符串缓冲区可以安全地被多个线程使 ...
ES6判断当前页面是否微信浏览器中打开
1.使用jq判断是否用微信浏览器打开页面 var is_weixin = (function(){return navigator.userAgent.toLowerCase().indexOf('m ...
七层模型？ IP ，TCP/UDP ，HTTP ，RTSP ，FTP 分别在哪层？
IP: 网络层TCP/UDP: 传输层HTTP.RTSP.FTP: 应用层协议
Django重点之url别名
django重点之url别名[参数名必须是name,格式是name="XXX] 不论后台路径如何进行修改路径,前台访问的路径不变,永远是alias, 这样方便开发前台根据 {{ url & ...
CSS制作垂直口风琴2
<!doctype html> <html lang="en"> <head> <meta charset="UTF-8&quo ...

[hadoop](2) MapReducer:Distributed Cache

前言

准备工作

Distributed Cache介绍

Distributed Cache的使用

[hadoop](2) MapReducer:Distributed Cache的更多相关文章

随机推荐

热门专题