1.概述

  最近有同学反应,如何在配置了HA的Hadoop平台运行MapReduce程序呢?对于刚步入Hadoop行业的同学,这个疑问却是会存在,其实仔细想想,如果你之前的语言功底不错的,应该会想到自动重连,自动重连也可以帮我我们解决运行MapReduce程序的问题。然后,今天我赘述的是利用Hadoop的Java API 来实现。

2.介绍

  下面直接附上代码,代码中我都有注释。

2.1Java操作HDFS HA的API

  代码如下:

/**
*
*/
package cn.hdfs.mr.example; import java.io.IOException; import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path; /**
* @author dengjie
* @date 2015年3月24日
* @description TODO
*/
public class DFS { public static void main(String[] args) {
Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://cluster1");//指定hdfs的nameservice为cluster1,是NameNode的URI
conf.set("dfs.nameservices", "cluster1");//指定hdfs的nameservice为cluster1
conf.set("dfs.ha.namenodes.cluster1", "nna,nns");//cluster1下面有两个NameNode,分别是nna,nns
conf.set("dfs.namenode.rpc-address.cluster1.nna", "10.211.55.26:9000");//nna的RPC通信地址
conf.set("dfs.namenode.rpc-address.cluster1.nns", "10.211.55.27:9000");//nns的RPC通信地址
conf.set("dfs.client.failover.proxy.provider.cluster1", "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider");//配置失败自动切换实现方式
FileSystem fs = null;
try {
fs = FileSystem.get(conf);//获取文件对象
FileStatus[] list = fs.listStatus(new Path("/"));//文件状态集合
for (FileStatus file : list) {
System.out.println(file.getPath().getName());//打印目录名
}
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
if (fs != null) {
fs.close();
}
} catch (IOException e) {
e.printStackTrace();
}
}
} }

  接下来,附上 Java 运行 MapReduce 程序的 API 代码。

2.2Java 运行 MapReduce 程序的 API

  以 WordCount 为例子,代码如下:

package cn.jpush.hdfs.mr.example;

import java.io.IOException;
import java.util.Random;
import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory; import cn.jpush.hdfs.utils.ConfigUtils; /**
*
* @author dengjie
* @date 2014年11月29日
* @description Wordcount的例子是一个比较经典的mapreduce例子,可以叫做Hadoop版的hello world。
* 它将文件中的单词分割取出,然后shuffle,sort(map过程),接着进入到汇总统计
* (reduce过程),最后写道hdfs中。基本流程就是这样。
*/
public class WordCount { private static Logger log = LoggerFactory.getLogger(WordCount.class); public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1);
private Text word = new Text(); /*
* 源文件:a b b
*
* map之后:
*
* a 1
*
* b 1
*
* b 1
*/
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());// 整行读取
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());// 按空格分割单词
context.write(word, one);// 每次统计出来的单词+1
}
}
} /*
* reduce之前:
*
* a 1
*
* b 1
*
* b 1
*
* reduce之后:
*
* a 1
*
* b 2
*/
public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
} @SuppressWarnings("deprecation")
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://cluster1");
conf.set("dfs.nameservices", "cluster1");
conf.set("dfs.ha.namenodes.cluster1", "nna,nns");
conf.set("dfs.namenode.rpc-address.cluster1.nna", "10.211.55.26:9000");
conf.set("dfs.namenode.rpc-address.cluster1.nns", "10.211.55.27:9000");
conf.set("dfs.client.failover.proxy.provider.cluster1", "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider");
long random1 = new Random().nextLong();// 重定下输出目录
log.info("random1 -> " + random1); Job job1 = new Job(conf, "word count");
job1.setJarByClass(WordCount.class);
job1.setMapperClass(TokenizerMapper.class);// 指定Map计算的类
job1.setCombinerClass(IntSumReducer.class);// 合并的类
job1.setReducerClass(IntSumReducer.class);// Reduce的类
job1.setOutputKeyClass(Text.class);// 输出Key类型
job1.setOutputValueClass(IntWritable.class);// 输出值类型 FileInputFormat.addInputPath(job1, new Path("/home/hdfs/test/hello.txt"));// 指定输入路径
FileOutputFormat.setOutputPath(job1, new Path(String.format(ConfigUtils.HDFS.WORDCOUNT_OUT, random1)));// 指定输出路径 System.exit(job1.waitForCompletion(true) ? 0 : 1);// 执行完MR任务后退出应用
}
}

3.运行结果

  下面附上部分运行 Log 日志,如下所示:

[Job.main] - Running job: job_local551164419_0001
2015-03-24 11:52:09 INFO [LocalJobRunner.Thread-12] - OutputCommitter set in config null
2015-03-24 11:52:09 INFO [LocalJobRunner.Thread-12] - OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
2015-03-24 11:52:10 INFO [LocalJobRunner.Thread-12] - Waiting for map tasks
2015-03-24 11:52:10 INFO [LocalJobRunner.LocalJobRunner Map Task Executor #0] - Starting task: attempt_local551164419_0001_m_000000_0
2015-03-24 11:52:10 INFO [ProcfsBasedProcessTree.LocalJobRunner Map Task Executor #0] - ProcfsBasedProcessTree currently is supported only on Linux.
2015-03-24 11:52:10 INFO [Task.LocalJobRunner Map Task Executor #0] - Using ResourceCalculatorProcessTree : null
2015-03-24 11:52:10 INFO [MapTask.LocalJobRunner Map Task Executor #0] - Processing split: hdfs://cluster1/home/hdfs/test/hello.txt:0+24
2015-03-24 11:52:10 INFO [MapTask.LocalJobRunner Map Task Executor #0] - Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2015-03-24 11:52:10 INFO [MapTask.LocalJobRunner Map Task Executor #0] - (EQUATOR) 0 kvi 26214396(104857584)
2015-03-24 11:52:10 INFO [MapTask.LocalJobRunner Map Task Executor #0] - mapreduce.task.io.sort.mb: 100
2015-03-24 11:52:10 INFO [MapTask.LocalJobRunner Map Task Executor #0] - soft limit at 83886080
2015-03-24 11:52:10 INFO [MapTask.LocalJobRunner Map Task Executor #0] - bufstart = 0; bufvoid = 104857600
2015-03-24 11:52:10 INFO [MapTask.LocalJobRunner Map Task Executor #0] - kvstart = 26214396; length = 6553600
2015-03-24 11:52:10 INFO [LocalJobRunner.LocalJobRunner Map Task Executor #0] -
2015-03-24 11:52:10 INFO [MapTask.LocalJobRunner Map Task Executor #0] - Starting flush of map output
2015-03-24 11:52:10 INFO [MapTask.LocalJobRunner Map Task Executor #0] - Spilling map output
2015-03-24 11:52:10 INFO [MapTask.LocalJobRunner Map Task Executor #0] - bufstart = 0; bufend = 72; bufvoid = 104857600
2015-03-24 11:52:10 INFO [MapTask.LocalJobRunner Map Task Executor #0] - kvstart = 26214396(104857584); kvend = 26214352(104857408); length = 45/6553600
2015-03-24 11:52:10 INFO [MapTask.LocalJobRunner Map Task Executor #0] - Finished spill 0
2015-03-24 11:52:10 INFO [Task.LocalJobRunner Map Task Executor #0] - Task:attempt_local551164419_0001_m_000000_0 is done. And is in the process of committing
2015-03-24 11:52:10 INFO [LocalJobRunner.LocalJobRunner Map Task Executor #0] - map
2015-03-24 11:52:10 INFO [Task.LocalJobRunner Map Task Executor #0] - Task 'attempt_local551164419_0001_m_000000_0' done.
2015-03-24 11:52:10 INFO [LocalJobRunner.LocalJobRunner Map Task Executor #0] - Finishing task: attempt_local551164419_0001_m_000000_0
2015-03-24 11:52:10 INFO [LocalJobRunner.Thread-12] - map task executor complete.
2015-03-24 11:52:10 INFO [LocalJobRunner.Thread-12] - Waiting for reduce tasks
2015-03-24 11:52:10 INFO [LocalJobRunner.pool-6-thread-1] - Starting task: attempt_local551164419_0001_r_000000_0
2015-03-24 11:52:10 INFO [ProcfsBasedProcessTree.pool-6-thread-1] - ProcfsBasedProcessTree currently is supported only on Linux.
2015-03-24 11:52:10 INFO [Task.pool-6-thread-1] - Using ResourceCalculatorProcessTree : null
2015-03-24 11:52:10 INFO [ReduceTask.pool-6-thread-1] - Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@1197414
2015-03-24 11:52:10 INFO [MergeManagerImpl.pool-6-thread-1] - MergerManager: memoryLimit=1503238528, maxSingleShuffleLimit=375809632, mergeThreshold=992137472, ioSortFactor=10, memToMemMergeOutputsThreshold=10
2015-03-24 11:52:10 INFO [EventFetcher.EventFetcher for fetching Map Completion Events] - attempt_local551164419_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
2015-03-24 11:52:10 INFO [LocalFetcher.localfetcher#1] - localfetcher#1 about to shuffle output of map attempt_local551164419_0001_m_000000_0 decomp: 50 len: 54 to MEMORY
2015-03-24 11:52:10 INFO [InMemoryMapOutput.localfetcher#1] - Read 50 bytes from map-output for attempt_local551164419_0001_m_000000_0
2015-03-24 11:52:10 INFO [MergeManagerImpl.localfetcher#1] - closeInMemoryFile -> map-output of size: 50, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->50
2015-03-24 11:52:10 INFO [EventFetcher.EventFetcher for fetching Map Completion Events] - EventFetcher is interrupted.. Returning
2015-03-24 11:52:10 INFO [LocalJobRunner.pool-6-thread-1] - 1 / 1 copied.
2015-03-24 11:52:10 INFO [MergeManagerImpl.pool-6-thread-1] - finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
2015-03-24 11:52:10 INFO [Merger.pool-6-thread-1] - Merging 1 sorted segments
2015-03-24 11:52:10 INFO [Merger.pool-6-thread-1] - Down to the last merge-pass, with 1 segments left of total size: 46 bytes
2015-03-24 11:52:10 INFO [MergeManagerImpl.pool-6-thread-1] - Merged 1 segments, 50 bytes to disk to satisfy reduce memory limit
2015-03-24 11:52:10 INFO [MergeManagerImpl.pool-6-thread-1] - Merging 1 files, 54 bytes from disk
2015-03-24 11:52:10 INFO [MergeManagerImpl.pool-6-thread-1] - Merging 0 segments, 0 bytes from memory into reduce
2015-03-24 11:52:10 INFO [Merger.pool-6-thread-1] - Merging 1 sorted segments
2015-03-24 11:52:10 INFO [Merger.pool-6-thread-1] - Down to the last merge-pass, with 1 segments left of total size: 46 bytes
2015-03-24 11:52:10 INFO [LocalJobRunner.pool-6-thread-1] - 1 / 1 copied.
2015-03-24 11:52:10 INFO [deprecation.pool-6-thread-1] - mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
2015-03-24 11:52:10 INFO [Task.pool-6-thread-1] - Task:attempt_local551164419_0001_r_000000_0 is done. And is in the process of committing
2015-03-24 11:52:10 INFO [LocalJobRunner.pool-6-thread-1] - 1 / 1 copied.
2015-03-24 11:52:10 INFO [Task.pool-6-thread-1] - Task attempt_local551164419_0001_r_000000_0 is allowed to commit now
2015-03-24 11:52:10 INFO [FileOutputCommitter.pool-6-thread-1] - Saved output of task 'attempt_local551164419_0001_r_000000_0' to hdfs://cluster1/output/result/-3636988299559297154/_temporary/0/task_local551164419_0001_r_000000
2015-03-24 11:52:10 INFO [LocalJobRunner.pool-6-thread-1] - reduce > reduce
2015-03-24 11:52:10 INFO [Task.pool-6-thread-1] - Task 'attempt_local551164419_0001_r_000000_0' done.
2015-03-24 11:52:10 INFO [LocalJobRunner.pool-6-thread-1] - Finishing task: attempt_local551164419_0001_r_000000_0
2015-03-24 11:52:10 INFO [LocalJobRunner.Thread-12] - reduce task executor complete.
2015-03-24 11:52:10 INFO [Job.main] - Job job_local551164419_0001 running in uber mode : false
2015-03-24 11:52:10 INFO [Job.main] - map 100% reduce 100%
2015-03-24 11:52:10 INFO [Job.main] - Job job_local551164419_0001 completed successfully
2015-03-24 11:52:10 INFO [Job.main] - Counters: 35
File System Counters
FILE: Number of bytes read=462
FILE: Number of bytes written=466172
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=48
HDFS: Number of bytes written=24
HDFS: Number of read operations=13
HDFS: Number of large read operations=0
HDFS: Number of write operations=4
Map-Reduce Framework
Map input records=2
Map output records=12
Map output bytes=72
Map output materialized bytes=54
Input split bytes=105
Combine input records=12
Combine output records=6
Reduce input groups=6
Reduce shuffle bytes=54
Reduce input records=6
Reduce output records=6
Spilled Records=12
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=13
Total committed heap usage (bytes)=514850816
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=24
File Output Format Counters
Bytes Written=24

  原文件如下所示:

a a c v d d
a d d s s x

  Reduce 结果图,如下所示:

4.总结

  我们可以按以下步骤进行验证代码的可用性:

  1. 保证 NNA( active 状态)和 NNS( standby 状态)。注意,DN 节点都是正常运行的。
  2. 然后,我们运行 WordCount 程序,看能否统计出结果。
  3. 若安上述步骤下来,可以统计;我们接着往下执行。若不行,请排查错误,然后继续。
  4. 然后,我们 kill 掉 NNA 节点的 NameNode 进程,此时,NNS 的状态会由 standby 转变为 active
  5. 接着我们在支持 WordCount 程序,看能否统计结果;若是能统计结果,表示代码可用。

  以上就是整个验证的流程。

5.结束语

  这篇文章就分享到这里,如果在验证的过程当中有什么问题,可以加群进行讨论或发送邮件给我,我会尽我所能为您解答,与君共勉!

高可用Hadoop平台-运行MapReduce程序的更多相关文章

  1. 高可用Hadoop平台-探索

    1.概述 上篇<高可用Hadoop平台-启航>博客已经让我们初步了解了Hadoop平台:接下来,我们对Hadoop做进一步的探索,一步一步的揭开Hadoop的神秘面纱.下面,我们开始赘述今 ...

  2. 高可用Hadoop平台-Hue In Hadoop

    1.概述 前面一篇博客<高可用Hadoop平台-Ganglia安装部署>,为大家介绍了Ganglia在Hadoop中的集成,今天为大家介绍另一款工具——Hue,该工具功能比较丰富,下面是今 ...

  3. 高可用Hadoop平台-Oozie工作流之Hadoop调度

    1.概述 在<高可用Hadoop平台-Oozie工作流>一篇中,给大家分享了如何去单一的集成Oozie这样一个插件.今天为大家介绍如何去使用Oozie创建相关工作流运行与Hadoop上,已 ...

  4. 高可用Hadoop平台-实战尾声篇

    1.概述 今天这篇博客就是<高可用Hadoop平台>的尾声篇了,从搭建安装到入门运行 Hadoop 版的 HelloWorld(WordCount 可以称的上是 Hadoop 版的 Hel ...

  5. 高可用Hadoop平台-实战

    1.概述 今天继续<高可用的Hadoop平台>系列,今天开始进行小规模的实战下,前面的准备工作完成后,基本用于统计数据的平台都拥有了,关于导出统计结果的文章留到后面赘述.今天要和大家分享的 ...

  6. 高可用Hadoop平台-集成Hive HAProxy

    1.概述 这篇博客是接着<高可用Hadoop平台>系列讲,本篇博客是为后面用 Hive 来做数据统计做准备的,介绍如何在 Hadoop HA 平台下集成高可用的 Hive 工具,下面我打算 ...

  7. 高可用Hadoop平台-启航

    1.概述 在上篇博客中,我们搭建了<配置高可用Hadoop平台>,接下来我们就可以驾着Hadoop这艘巨轮在大数据的海洋中遨游了.工欲善其事,必先利其器.是的,没错:我们开发需要有开发工具 ...

  8. 高可用Hadoop平台-Flume NG实战图解篇

    1.概述 今天补充一篇关于Flume的博客,前面在讲解高可用的Hadoop平台的时候遗漏了这篇,本篇博客为大家讲述以下内容: Flume NG简述 单点Flume NG搭建.运行 高可用Flume N ...

  9. 高可用Hadoop平台-HBase集群搭建

    1.概述 今天补充一篇HBase集群的搭建,这个是高可用系列遗漏的一篇博客,今天抽时间补上,今天给大家介绍的主要内容目录如下所示: 基础软件的准备 HBase介绍 HBase集群搭建 单点问题验证 截 ...

随机推荐

  1. 1045 Favorite Color Stripe 动态规划

    1045 Favorite Color Stripe 1045. Favorite Color Stripe (30)Eva is trying to make her own color strip ...

  2. spring-事件通知实现

    ok,今天不知道看啥来着,突然想起来spring内部的事件通知的实现,其实比较简单,简要记一下.然后又回顾了下eventbus的实现,其实俩者的实现方式大同小异吧,只是spring的很多操作都可以直接 ...

  3. nodejs mongodb 数据库封装DB类 -转

    使用到了nodejs的插件mongoose,用mongoose操作mongodb其实蛮方便的. 关于mongoose的安装就是 npm install -g mongoose 这个DB类的数据库配置是 ...

  4. hdu 4920

    http://acm.hdu.edu.cn/showproblem.php?pid=4920 给定两个n阶矩阵,求矩阵相乘后模3. 直接搞肯定会超时 特殊处理1和2的情况 实际上是水过的..... 貌 ...

  5. inline&friend&操作符重载

    (1).inline:是一种以空间换时间的做法省去调用函数的额外开销,提高程序的运行效率,它对于编译器而言只是一种建议 (2).友元函数:是可以直接访问类的private成员的非成员函数.它是定义在类 ...

  6. java 异步机制与同步机制的区别

    所谓异步输入输出机制,是指在进行输入输出处理时,不必等到输入输出处理完毕才返回.所以异步的同义语是非阻塞(None Blocking). 网上有很多网友用很通俗的比喻  把同步和异步讲解的很透彻 转过 ...

  7. MySql and Oracle Data Type Mappings

    the default settings used by SQL Developer to convert data types from MySQL to Oracle. SQL Developer ...

  8. MVC使用TempData将返回的string传到另一个控制器页面中显示!

    我需要把数据库中查询出的数据,传递到另一个控制器的页面去显示. https://blog.csdn.net/ma_jiang/article/details/8164212 看了上面这篇文章,反应过来 ...

  9. [leetcode.com]算法题目 - Plus One

    Given a number represented as an array of digits, plus one to the number. class Solution { public: v ...

  10. 【LeetCode】200. 岛屿的个数

    题目 给定一个由 '1'(陆地)和 '0'(水)组成的的二维网格,计算岛屿的数量.一个岛被水包围,并且它是通过水平方向或垂直方向上相邻的陆地连接而成的.你可以假设网格的四个边均被水包围. 示例 1:输 ...