前言

本章主要内容是讲述hadoop的分布式缓存的使用，通过分布式缓存可以将一些需要共享的数据在各个集群中共享。

准备工作

数据集：ufo-60000条记录，这个数据集有一系列包含下列字段的UFO目击事件记录组成，每条记录的字段都是以tab键分割，请看http://www.cnblogs.com/cafebabe-yun/p/8679994.html

sighting date：UFO目击事件发生时间
Recorded date：报告目击事件的时间
Location：目击事件发生的地点
Shape：UFO形状
Duration：目击事件持续时间
Dexcription：目击事件的大致描述

例子：

19950915 19950915 Redmond, WA 6 min. Young man w/ 2 co-workers witness tiny, distinctly white round disc drifting slowly toward NE. Flew in dir. 90 deg. to winds.

需要共享的数据：州名缩写与全称的对应关系

数据：

AL      Alabama

AK      Alaska

AZ      Arizona

AR      Arkansas

CA      California

Distributed Cache介绍

作用：使用分布式缓存，可以将map和reduce任务要用的通用只读文件在集群所有节点共享。

Distributed Cache的使用

题目：使用共享数据替换州名缩写

将上面提到的共享数据保存为 states.txt 文件
将states.txt文件上传到hadoop

hadoop dfs -put states.txt states.txt

编写 UFORecordValidationMapper.java

import java.io.IOException;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapred.*;

import org.apache.hadoop.mapred.lib.*;

public class UFORecordValidationMapper extends MapReduceBase implements Mapper<LongWritable, Text, LongWritable, Text> {

    public void map(LongWritable key, Text value, OutputCollector<LongWritable, Text> output, Reporter reporter) throws IOException {

        String line = value.toString();

        if(validate(line)) {

            output.collect(key, value);

        }

    }

    private boolean validate(String str) {

        String[] parts = str.split("\t");

        if(parts.length != 6) {

            return false;

        }

        return true;

    }

}

编写 UFOLocation2.java

import java.io.*;

import java.util.*;

import java.net.*;

import java.util.regex.*;

import org.apache.hadoop.conf.*;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.filecache.DistributedCache;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapred.*;

import org.apache.hadoop.mapred.lib.*;

public class UFOLocation2 {

    public static class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, LongWritable> {

        private final static LongWritable one = new LongWritable(1);

        private static Pattern locationPattern = Pattern.compile("[a-zA-Z]{2}[^a-zA-Z]*$");

        private Map<String, String> stateNames;    

        @Override

        public void configure(JobConf job) {

            try {

                Path[] cacheFiles = DistributedCache.getLocalCacheFiles(job);

                setupStateMap(cacheFiles[0].toString());

            } catch (IOException e) {

                System.err.println("Error reading state file.");

                System.exit(1);

            }

        }

        private void setupStateMap(String fileName) throws IOException {

            Map<String, String> stateCache = new HashMap<String, String>();

            BufferedReader reader = new BufferedReader(new FileReader(fileName));

            String line = null;

            while((line = reader.readLine()) != null) {

                String[] splits = line.split("\t");

                stateCache.put(splits[0], splits[1]);

            }

            stateNames = stateCache;

        }

        @Override

        public void map(LongWritable key, Text value, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException {

            String line = value.toString();

            String[] fields = line.split("\t");

            String location = fields[2].trim();

            if(location.length() >= 2) {

                Matcher matcher = locationPattern.matcher(location);

                if(matcher.find()) {

                    int start = matcher.start();

                    String state = location.substring(start, start + 2);

                    output.collect(new Text(lookupState(state.toUpperCase())), one);

                }

            }

        }

        private String lookupState(String state) {

            String fullName = stateNames.get(state);

            if(fullName == null || "".equals(fullName)) {

                fullName = state;

            }

            return fullName;

        }

    }

    public static void main(String...args) throws Exception {

        Configuration config = new Configuration();

        JobConf conf = new JobConf(config, UFOLocation2.class);

        conf.setJobName("UFOLocation2");

        DistributedCache.addCacheFile(new URI("/user/root/states.txt"), conf);

        conf.setOutputKeyClass(Text.class);

        conf.setOutputValueClass(LongWritable.class);

        JobConf mapconf1 = new JobConf(false);

        ChainMapper.addMapper(conf, UFORecordValidationMapper.class, LongWritable.class, Text.class, LongWritable.class, Text.class, true, mapconf1);

        JobConf mapconf2 = new JobConf(false);

        ChainMapper.addMapper(conf, MapClass.class, LongWritable.class, Text.class, Text.class, LongWritable.class, true, mapconf2);

        conf.setMapperClass(ChainMapper.class);

        conf.setCombinerClass(LongSumReducer.class);

        conf.setReducerClass(LongSumReducer.class);

        FileInputFormat.setInputPaths(conf, args[0]);

        FileOutputFormat.setOutputPath(conf, new Path(args[1]));

        JobClient.runJob(conf);

    }

}

编译上述两个文件

javac UFORecordValidationMapper.java UFOLocation2.java

将编译好的文件打包成jar

jar cvf ufo.jar UFO*class

提交打包好的jar包到hadoop上运行

hadoop jar ufo.jar UFOLocation2 ufo.tsv output

从hadoop上获取结果到本地

hadoop dfs -get output/part-00000 ufo_result.txt

查看结果

more ufo_result.txt

[hadoop](2) MapReducer:Distributed Cache的更多相关文章

[转] .net core Session , Working with a distributed cache
本文转自:https://docs.microsoft.com/en-us/aspnet/core/performance/caching/distributed By Steve Smith+ Di ...
Distributed Cache Coherence at Scalable Requestor Filter Pipes that Accumulate Invalidation Acknowledgements from other Requestor Filter Pipes Using Ordering Messages from Central Snoop Tag
A multi-processor, multi-cache system has filter pipes that store entries for request messages sent ...
Hadoop之 MapReducer工作过程
1. 从输入到输出一个MapReducer作业经过了input,map,combine,reduce,output五个阶段,其中combine阶段并不一定发生,map输出的中间结果被分到reduce ...
spark hadoop 对比 Resilient Distributed Datasets
hadoop 迭代消耗大每次迭代启动一个完整的MapReduce作业 spark 首要目标就是避免运算时过多的网络和磁盘IO开销 Resilient Distributed Datasets ht ...
Flink分布式缓存Distributed Cache
1 分布式缓存 Flink提供了一个分布式缓存,类似于hadoop,可以使用户在并行函数中很方便的读取本地文件,并把它放在taskmanager节点中,防止task重复拉取. 此缓存的工作机制如下:程 ...
Distributed Cache(分布式缓存)-SqlServer
分布式缓存是由多个应用服务器共享的缓存,通常作为外部服务存储在单个应用服务器上,常用的有SqlServer,Redis,NCache. 分布式缓存可以提高ASP.NET Core应用程序的性能和可伸缩 ...
hadoop系列四:mapreduce的使用(二)
转载请在页首明显处注明作者与出处一:说明此为大数据系列的一些博文,有空的话会陆续更新,包含大数据的一些内容,如hadoop,spark,storm,机器学习等. 当前使用的hadoop版本为2.6 ...
Hadoop官方文档翻译——MapReduce Tutorial
MapReduce Tutorial(个人指导) Purpose(目的) Prerequisites(必备条件) Overview(综述) Inputs and Outputs(输入输出) MapRe ...
hadoop常见问题汇集
1 hadoop conf.addResource http://stackoverflow.com/questions/16017538/how-does-configuration-addreso ...

随机推荐

基于nginx实现二维码下载安装apk文件
将apk文件置于nginx目录下  /usr/local/nginx  mkdir -p resources ...
Node.js实战14：一个简单的TCP服务器。
本文,将会展示如何用Nodejs内置的net模块开发一个TCP服务器,同时模拟一个客户端,并实现客户端和服务端交互. net模块是nodejs内置的基础网络模块,通过使用net,可以创建一个简单的tc ...
前端 CSS 盒子模型 padding 内边距属性
padding:就是内边距的意思,它是边框到内容之间的距离另外padding的区域是有背景颜色的.并且背景颜色和内容区域的颜色一样.也就是说background-color这个属性将填充所有的bor ...
Vue 基础 day04
什么是路由后端路由: 对于普通的网站,所有的超链接都是URL地址,所有的URL地址都对应服务器的资源: 前端路由: 对于单页面应用程序来说,主要是通过URL中的hash(#)来实现不同页面之间的跳转 ...
mysql 主从设置
总结:1.如果是虚拟克隆mysql 请注意auto.cnf的uuid保证不一样,即删除auto.cnf 重新启动即可2.默认安装的mysql配置文件mysqld.cnf可能绑定了127.0.0.1 只 ...
在无界面centos7上部署MYSQL5.7数据库
1. 利用xshell连接好服务后,输入 wget http://dev.mysql.com/get/mysql57-community-release-el7-8.noarch.rpm 下载软件安装 ...
Java中 Json的使用
转自:http://huyan.couplecoders.tech/%E5%BC%80%E5%8F%91%E8%80%85%E6%89%8B%E5%86%8C/2018/11/02/Java%E4%B ...
Sublime text设置快捷键让编写的HTML文件在打指定浏览器预览
作者:浪人链接:https://www.zhihu.com/question/27219231/answer/43608776来源:知乎著作权归作者所有.商业转载请联系作者获得授权,非商业转载请注明出 ...
Scala本地安装
一.下载 https://www.scala-lang.org/download/ 这里我选择Scala2.10.4版本二.安装安装比较简单和jdk类似点击一路安装: 选择自己的路径完成 ...
Robot Framework 源码阅读 day1 __main__.py
robot文件夹下的__main__.py函数是使用module运行时的入口函数: import sys # Allows running as a script. __name__ check n ...

[hadoop](2) MapReducer:Distributed Cache

前言

准备工作

Distributed Cache介绍

Distributed Cache的使用

[hadoop](2) MapReducer:Distributed Cache的更多相关文章

随机推荐

热门专题