下面的代码中AffairClient类中包含了三个内之类,分别对应于Hadoop Mapreduce程序运行所需的Mapper类,Reducer类,和主类。
AffairClient类中其余方法用于配置和运行EMR程序。
可以修改相关参数来对程序做适当调整。比如:修改map和reduce函数,添加combiner类,或者设置集群大小。
这个样例是一个去重的mapreduce程序,具体见map函数和reduce函数。
我们创建的是一个Maven项目,因为是在AWS EMR上运行hadoop程序,所以需要AWS和hadoop-client的dependency:

<!-- aws -->
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk</artifactId>
<version>1.10.26</version>
</dependency>
<!-- hadoop-client -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.1</version>
</dependency>

另外一个可能会出现的情况是在运行Java程序的环境下(可能是某台远程服务器)的CLASSPATH没有aws java sdk或hadoop-client对应的jar包,这个时候运行程序可能会出现ClassNotFoundException,所以我们需要在pom.xml中的build->plugins中添加如下代码使得依赖的jar包也能在mvn package的时候被打到jar包里:

<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<archive>
<manifest>
<mainClass></mainClass>
</manifest>
</archive>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>

运行程序前,我们需要把输入文件放到程序中全局变量INPUT_DIR目录下;还需要把jar包放到JAR_DIR目录下,并且jar包名为JAR_NAME。
可以通过如下指令在server上对maven项目打jar包:
$ mvn clean install
我们假设最终将jar包改名为了affair.jar并上传到了s3上对应的位置,之后再affair.jar包同一目录下输入如下指令便可启动AWS EMR的MapReduce程序:
$ java -cp affair.jar AffairClient
在终端显示的结果类似如下:

log4j:WARN No appenders could be found for logger (com.amazonaws.AmazonWebServiceClient).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
j-61EJHJXPSR2L STARTING this job flow runs a mapreduce affair.
PENDING AffairClient$AffairJob Affair j-61EJHJXPSR2L STARTING this job flow runs a mapreduce affair.
PENDING AffairClient$AffairJob Affair j-61EJHJXPSR2L STARTING this job flow runs a mapreduce affair.
PENDING AffairClient$AffairJob Affair j-61EJHJXPSR2L STARTING this job flow runs a mapreduce affair.
PENDING AffairClient$AffairJob Affair j-61EJHJXPSR2L STARTING this job flow runs a mapreduce affair.
PENDING AffairClient$AffairJob Affair j-61EJHJXPSR2L STARTING this job flow runs a mapreduce affair.
PENDING AffairClient$AffairJob Affair j-61EJHJXPSR2L RUNNING this job flow runs a mapreduce affair.
RUNNING AffairClient$AffairJob Affair j-61EJHJXPSR2L RUNNING this job flow runs a mapreduce affair.
RUNNING AffairClient$AffairJob Affair j-61EJHJXPSR2L RUNNING this job flow runs a mapreduce affair.
RUNNING AffairClient$AffairJob Affair j-61EJHJXPSR2L TERMINATING this job flow runs a mapreduce affair.
COMPLETED AffairClient$AffairJob Affair j-61EJHJXPSR2L TERMINATING this job flow runs a mapreduce affair.
COMPLETED AffairClient$AffairJob Affair j-61EJHJXPSR2L TERMINATING this job flow runs a mapreduce affair.
COMPLETED AffairClient$AffairJob Affair j-61EJHJXPSR2L TERMINATING this job flow runs a mapreduce affair.
COMPLETED AffairClient$AffairJob Affair j-61EJHJXPSR2L TERMINATED this job flow runs a mapreduce affair.
COMPLETED AffairClient$AffairJob Affair

如果出现错误了(比如说有的时候输出目录已存在就会出现错误),可以在AWS Web管理控制台中的EMR中查看对应的syslog。
代码:

import java.io.IOException;
import java.util.ArrayList;
import java.util.List; import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.record.compiler.generated.ParseException;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner; import com.amazonaws.auth.AWSCredentials;
import com.amazonaws.auth.BasicAWSCredentials;
import com.amazonaws.services.ec2.model.InstanceType;
import com.amazonaws.services.elasticmapreduce.AmazonElasticMapReduceClient;
import com.amazonaws.services.elasticmapreduce.model.ActionOnFailure;
import com.amazonaws.services.elasticmapreduce.model.Cluster;
import com.amazonaws.services.elasticmapreduce.model.DescribeClusterRequest;
import com.amazonaws.services.elasticmapreduce.model.DescribeClusterResult;
import com.amazonaws.services.elasticmapreduce.model.HadoopJarStepConfig;
import com.amazonaws.services.elasticmapreduce.model.JobFlowInstancesConfig;
import com.amazonaws.services.elasticmapreduce.model.ListStepsRequest;
import com.amazonaws.services.elasticmapreduce.model.ListStepsResult;
import com.amazonaws.services.elasticmapreduce.model.PlacementType;
import com.amazonaws.services.elasticmapreduce.model.RunJobFlowRequest;
import com.amazonaws.services.elasticmapreduce.model.RunJobFlowResult;
import com.amazonaws.services.elasticmapreduce.model.StepConfig;
import com.amazonaws.services.elasticmapreduce.model.StepSummary;
import com.amazonaws.services.elasticmapreduce.model.TerminateJobFlowsRequest; public class AffairClient { private static AmazonElasticMapReduceClient emr; private static final long SLEEP_TIME = 1000 * 30;
private static final String JAR_DIR = "s3://bucketname/affair/";
private static final String JAR_NAME = "affair.jar";
private static final String INPUT_DIR = "s3://bucketname/affair/input/";
private static final String OUTPUT_DIR = "s3://bucketname/affair/output/";
private static final String LOG_DIR = "s3://bucketname/affair/log/";
private static final String JOB_FLOW_NAME = "this job flow runs a mapreduce affair.";
private static final String AWS_ACCESS_KEY = "YOUR_AWS_ACCESS_KEY";
private static final String AWS_SECRET_KEY = "YOUR_AWS_SECRET_LEY"; public static class AffairMapper
extends Mapper<LongWritable, Text, Text, Text> { public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.write(new Text(value), new Text(""));
}
} public static class AffairReducer
extends Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterable<Text>values, Context context)
throws IOException, InterruptedException {
context.write(key, new Text(""));
}
} public static class AffairJob extends Configured implements Tool { public int run(String[] arg0) throws Exception {
Configuration conf = getConf();
conf.set("mapred.reduce.tasks", "" + 1); Job job = new Job(conf, "Affair MR job");
job.setJarByClass(AffairJob.class);
job.setMapperClass(AffairMapper.class);
job.setReducerClass(AffairReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
//job.setNumReduceTasks(1); FileInputFormat.addInputPath(job, new Path(INPUT_DIR));
FileOutputFormat.setOutputPath(job, new Path(OUTPUT_DIR)); job.waitForCompletion(true);
return 0;
} public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new AffairJob(), args);
System.exit(exitCode);
} } public static void main(String[] args) throws ParseException { // emr jobflow
try {
String mainClass = AffairJob.class.getName();
String stepName = mainClass + " Affair";
runStep(mainClass, JAR_NAME, stepName);
} catch (Exception e) {
e.printStackTrace();
}
} private static void runStep(String mainClass, String jarName, String stepName)
throws InterruptedException { String jarPath = JAR_DIR + JAR_NAME;
HadoopJarStepConfig hadoopJarStep = new HadoopJarStepConfig(jarPath);
hadoopJarStep.setMainClass(mainClass);
hadoopJarStep.setArgs(null); StepConfig step = new StepConfig().withName(stepName)
.withActionOnFailure(ActionOnFailure.TERMINATE_JOB_FLOW)
.withHadoopJarStep(hadoopJarStep); String logUri = LOG_DIR;
JobFlowInstancesConfig instances = createInstances(); List<StepConfig> steps = new ArrayList<StepConfig>();
steps.add(step);
String jobFlowId = CreateJobFlow(JOB_FLOW_NAME, logUri, instances, steps); terminateJobFlow(jobFlowId);
} private static void terminateJobFlow(String jobFlowId) {
TerminateJobFlowsRequest request = new TerminateJobFlowsRequest().withJobFlowIds(jobFlowId);
emr.terminateJobFlows(request);
} private static String CreateJobFlow(String jobFlowName, String logUri,
JobFlowInstancesConfig instances, List<StepConfig> steps)
throws InterruptedException { AWSCredentials credentials = new BasicAWSCredentials(AWS_ACCESS_KEY, AWS_SECRET_KEY);
emr = new AmazonElasticMapReduceClient(credentials); // run job flow
RunJobFlowRequest request = new RunJobFlowRequest().withName(jobFlowName)
.withLogUri(logUri)
.withSteps(steps)
.withInstances(instances);
RunJobFlowResult result = emr.runJobFlow(request); // get job flow details
String jobFlowId = result.getJobFlowId();
boolean runing = true;
while(runing) {
Thread.sleep(SLEEP_TIME); List<String> jobFlowIdList = new ArrayList<String>();
jobFlowIdList.add(jobFlowId); System.out.println(getJobFlowStatus(jobFlowIdList)); for(String clusterId : jobFlowIdList) {
DescribeClusterRequest describeClusterRequest = new DescribeClusterRequest().withClusterId(clusterId);
DescribeClusterResult describeClusterResult = emr.describeCluster(describeClusterRequest);
Cluster cluster = describeClusterResult.getCluster(); if(cluster.getStatus().getState().contains("FAILED") ||
cluster.getStatus().getState().contains("COMPLETED") ||
cluster.getStatus().getState().contains("TERMINATED") ||
cluster.getStatus().getState().contains("SHUTTING_DOWN") ||
cluster.getStatus().getState().contains("WAITING"))
runing = false;
break;
}
}
return jobFlowId;
} private static String getJobFlowStatus(List<String> jobFlowIdList) { String info = new String(); for(String clusterId : jobFlowIdList) {
DescribeClusterRequest describeClusterRequest = new DescribeClusterRequest().withClusterId(clusterId);
DescribeClusterResult describeClusterResult = emr.describeCluster(describeClusterRequest);
Cluster cluster = describeClusterResult.getCluster(); info += cluster.getId() + "\t" + cluster.getStatus().getState() + "\t" + cluster.getName() + "\n";
ListStepsRequest listStepsRequest = new ListStepsRequest().withClusterId(clusterId);
ListStepsResult listStepsResult = emr.listSteps(listStepsRequest);
for(StepSummary step : listStepsResult.getSteps()) {
info += "\t" + step.getStatus().getState() + "\t" + step.getName() + "\n";
}
}
return info;
} private static JobFlowInstancesConfig createInstances() {
JobFlowInstancesConfig instances = new JobFlowInstancesConfig()
.withHadoopVersion("1.0.3")
.withInstanceCount(5)
.withKeepJobFlowAliveWhenNoSteps(false)
.withMasterInstanceType(InstanceType.M1Large.toString())
.withSlaveInstanceType(InstanceType.M1Large.toString())
.withPlacement(new PlacementType("us-east-1a"));
return instances;
} }

在AWS EMR上运行Map Reduce的Java示例程序 及 操作小计的更多相关文章

  1. paip.提升效率---filter map reduce 的java 函数式编程实现

    #paip.提升效率---filter map reduce 的java 函数式编程实现 ======================================================= ...

  2. mac 上运行cassandra出现的java.net.MalformedURLException: Local host name unknown: java.net.UnknownHostException: : : unknown error错误解决方法

    mac 上运行cassandra出现的java.net.MalformedURLException: Local host name unknown: java.net.UnknownHostExce ...

  3. 在PC上运行安卓(Android)应用程序的4个方法

    我有一部荣耀3C,一般放在宿舍(我随身携带的是一部诺基亚E63,小巧.稳定.待机时间长),在宿舍我就会用它在微信上看公众号里的文章,最近要考驾照也在上面用驾考宝典.最近想在实验室用这两个软件,但又懒得 ...

  4. 如何在国产龙芯架构平台上运行c/c++、java、nodejs等编程语言

    高能预警:本文内容过于硬核,涉及编译器原理.cpu指令集.机器码.编程语言原理.跨平台原理等计算机专业基础知识,建议具有c.c++.java.nodejs等多种编程语言开发能力,且实战经验丰富的资深开 ...

  5. AWS EMR上搭建HBase环境

    0. 概述 AWS的EMR服务为客户提供的托管 Hadoop 框架可以让您轻松.快 速.经济高效地在多个动态可扩展的 Amazon EC2 实例之间分发和处理 大量数据.您还可以运行其他常用的分发框架 ...

  6. 在 aws emr 上,将 hbase table A 的数据,对 key 做 hash,写到另外一张 table B

    先 scan 原表,然后 bulkload 到新表. 采坑纪录1. bulkload 产生 hfile 前,需要先对 hash(key) 做 repartition,在 shuffle 的 read ...

  7. 在PC上运行安卓(Android)应用程序的几个方法

    三种方法: 1.在PC安装一个安卓模拟器,在模拟器里面运行apk: 2.虚拟机安装 Android x86 然后在此系统里运行: 3.利用谷歌chrome浏览器运行(这是一个新颖.有前途.激动人心的方 ...

  8. 在Ubuntu上为Android系统内置Java应用程序测试Application Frameworks层的硬件服务(老罗学习笔记6)

    一:Eclipse下 1.创建工程: ---- 2.创建后目录 3.添加java函数 4.在src下创建package,在package下创建file 5.res---layout下创建xml文件,命 ...

  9. eclipse 中运行 Hadoop2.7.3 map reduce程序 出现错误(null) entry in command string: null chmod 0700

    运行map reduce任务报错: (null) entry in command string: null chmod 0700 解决办法: 在https://download.csdn.net/d ...

随机推荐

  1. bzoj 1087 状压dp

    1087: [SCOI2005]互不侵犯King Time Limit: 10 Sec  Memory Limit: 162 MBSubmit: 4130  Solved: 2390[Submit][ ...

  2. Gruntjs提高生产力(一)

    gruntjs是一个基于nodejs的自动化工具,只要熟悉nodejs或者又一定js经验就可以熟练应用. 1. 安装 a. 保证已安装了nodejs,并带有npm b.安装客户端命令行工具,grunt ...

  3. yii控制布局方式

    1:在控制器内成员变量设置 public $layout = false; //不使用布局 public $layout = “main”; //设置使用的布局文件 2:在控制器成员方法内设置 $th ...

  4. 转:CentOS 7使用nmcli配置双网卡聚合LACP

    进入CentOS 7以后,网络方面变化比较大,例如eth0不见了,ifconfig不见了,其原因是网络服务全部都由NetworkManager管理了,下面记录下今天下午用nmcli配置的网卡聚合,网络 ...

  5. ansible 调用playbook api执行(一)

    一 调用ansible playbook api执行playbook 1 准备好hosts文件 root@ansible:~/ansible/playbooks# cat hosts [all:var ...

  6. hdu 4770 13 杭州 现场 A - Lights Against Dudely 暴力 bfs 状态压缩DP 难度:1

    Description Harry: "But Hagrid. How am I going to pay for all of this? I haven't any money.&quo ...

  7. 【前端工具】 git windows下搭建全过程

    1. Git,Windows下的Git,地址:http://msysgit.googlecode.com/files/Git-1.7.9-preview20120201.exe(方便下载) 2 .SS ...

  8. yii2.0 使用不同语言

    1.建立语言目录.文件.项目根目录建立messages文件夹.存放不同语言对应的目录文件. 例如中文和英文 message 下建立两个文件夹 en.zh_CN 里面可以对应着多个翻译文件 2.在mai ...

  9. (转)MapReduce Design Patterns(chapter 1)(一)

    翻译的是这本书: Chapter 1.Design Patterns and MapReduce MapReduce 是一种运行于成百上千台机器上的处理数据的框架,目前被google,Hadoop等多 ...

  10. awk结合正则匹配

    利用awk分析data.csv中label列各取值的分布. 在终端执行head data.csv查看数据: name,business,label,label_name 沧州光松房屋拆迁有限公司,旧房 ...