在AWS EMR上运行Map Reduce的Java示例程序及操作小计

下面的代码中AffairClient类中包含了三个内之类，分别对应于Hadoop Mapreduce程序运行所需的Mapper类，Reducer类，和主类。
AffairClient类中其余方法用于配置和运行EMR程序。
可以修改相关参数来对程序做适当调整。比如：修改map和reduce函数，添加combiner类，或者设置集群大小。
这个样例是一个去重的mapreduce程序，具体见map函数和reduce函数。
我们创建的是一个Maven项目，因为是在AWS EMR上运行hadoop程序，所以需要AWS和hadoop-client的dependency：

<!-- aws -->

<dependency>

    <groupId>com.amazonaws</groupId>

    <artifactId>aws-java-sdk</artifactId>

    <version>1.10.26</version>

</dependency>

<!-- hadoop-client -->

<dependency>

    <groupId>org.apache.hadoop</groupId>

    <artifactId>hadoop-client</artifactId>

    <version>2.7.1</version>

</dependency>

另外一个可能会出现的情况是在运行Java程序的环境下(可能是某台远程服务器)的CLASSPATH没有aws java sdk或hadoop-client对应的jar包，这个时候运行程序可能会出现ClassNotFoundException，所以我们需要在pom.xml中的build->plugins中添加如下代码使得依赖的jar包也能在mvn package的时候被打到jar包里：

<plugin>

    <artifactId>maven-assembly-plugin</artifactId>

    <configuration>

        <descriptorRefs>

              <descriptorRef>jar-with-dependencies</descriptorRef>

        </descriptorRefs>

          <archive>

            <manifest>

                  <mainClass></mainClass>

            </manifest>

          </archive>

    </configuration>

      <executions>

        <execution>

              <id>make-assembly</id>

              <phase>package</phase>

              <goals>

                <goal>single</goal>

              </goals>

           </execution>

      </executions>

</plugin>

运行程序前，我们需要把输入文件放到程序中全局变量INPUT_DIR目录下；还需要把jar包放到JAR_DIR目录下，并且jar包名为JAR_NAME。
可以通过如下指令在server上对maven项目打jar包：
$ mvn clean install
我们假设最终将jar包改名为了affair.jar并上传到了s3上对应的位置，之后再affair.jar包同一目录下输入如下指令便可启动AWS EMR的MapReduce程序：
$ java -cp affair.jar AffairClient
在终端显示的结果类似如下：

log4j:WARN No appenders could be found for logger (com.amazonaws.AmazonWebServiceClient).

log4j:WARN Please initialize the log4j system properly.

log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

j-61EJHJXPSR2L  STARTING        this job flow runs a mapreduce affair.

        PENDING AffairClient$AffairJob Affair

j-61EJHJXPSR2L  STARTING        this job flow runs a mapreduce affair.

        PENDING AffairClient$AffairJob Affair

j-61EJHJXPSR2L  STARTING        this job flow runs a mapreduce affair.

        PENDING AffairClient$AffairJob Affair

j-61EJHJXPSR2L  STARTING        this job flow runs a mapreduce affair.

        PENDING AffairClient$AffairJob Affair

j-61EJHJXPSR2L  STARTING        this job flow runs a mapreduce affair.

        PENDING AffairClient$AffairJob Affair

j-61EJHJXPSR2L  STARTING        this job flow runs a mapreduce affair.

        PENDING AffairClient$AffairJob Affair

j-61EJHJXPSR2L  RUNNING this job flow runs a mapreduce affair.

        RUNNING AffairClient$AffairJob Affair

j-61EJHJXPSR2L  RUNNING this job flow runs a mapreduce affair.

        RUNNING AffairClient$AffairJob Affair

j-61EJHJXPSR2L  RUNNING this job flow runs a mapreduce affair.

        RUNNING AffairClient$AffairJob Affair

j-61EJHJXPSR2L  TERMINATING     this job flow runs a mapreduce affair.

        COMPLETED       AffairClient$AffairJob Affair

j-61EJHJXPSR2L  TERMINATING     this job flow runs a mapreduce affair.

        COMPLETED       AffairClient$AffairJob Affair

j-61EJHJXPSR2L  TERMINATING     this job flow runs a mapreduce affair.

        COMPLETED       AffairClient$AffairJob Affair

j-61EJHJXPSR2L  TERMINATING     this job flow runs a mapreduce affair.

        COMPLETED       AffairClient$AffairJob Affair

j-61EJHJXPSR2L  TERMINATED      this job flow runs a mapreduce affair.

        COMPLETED       AffairClient$AffairJob Affair

如果出现错误了(比如说有的时候输出目录已存在就会出现错误)，可以在AWS Web管理控制台中的EMR中查看对应的syslog。
代码：

import java.io.IOException;

import java.util.ArrayList;

import java.util.List;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.conf.Configured;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.record.compiler.generated.ParseException;

import org.apache.hadoop.util.Tool;

import org.apache.hadoop.util.ToolRunner;

import com.amazonaws.auth.AWSCredentials;

import com.amazonaws.auth.BasicAWSCredentials;

import com.amazonaws.services.ec2.model.InstanceType;

import com.amazonaws.services.elasticmapreduce.AmazonElasticMapReduceClient;

import com.amazonaws.services.elasticmapreduce.model.ActionOnFailure;

import com.amazonaws.services.elasticmapreduce.model.Cluster;

import com.amazonaws.services.elasticmapreduce.model.DescribeClusterRequest;

import com.amazonaws.services.elasticmapreduce.model.DescribeClusterResult;

import com.amazonaws.services.elasticmapreduce.model.HadoopJarStepConfig;

import com.amazonaws.services.elasticmapreduce.model.JobFlowInstancesConfig;

import com.amazonaws.services.elasticmapreduce.model.ListStepsRequest;

import com.amazonaws.services.elasticmapreduce.model.ListStepsResult;

import com.amazonaws.services.elasticmapreduce.model.PlacementType;

import com.amazonaws.services.elasticmapreduce.model.RunJobFlowRequest;

import com.amazonaws.services.elasticmapreduce.model.RunJobFlowResult;

import com.amazonaws.services.elasticmapreduce.model.StepConfig;

import com.amazonaws.services.elasticmapreduce.model.StepSummary;

import com.amazonaws.services.elasticmapreduce.model.TerminateJobFlowsRequest;

public class AffairClient {

    private static AmazonElasticMapReduceClient emr;

    private static final long SLEEP_TIME = 1000 * 30;

    private static final String JAR_DIR = "s3://bucketname/affair/";

    private static final String JAR_NAME = "affair.jar";

    private static final String INPUT_DIR = "s3://bucketname/affair/input/";

    private static final String OUTPUT_DIR = "s3://bucketname/affair/output/";

    private static final String LOG_DIR = "s3://bucketname/affair/log/";

    private static final String JOB_FLOW_NAME = "this job flow runs a mapreduce affair.";

    private static final String AWS_ACCESS_KEY = "YOUR_AWS_ACCESS_KEY";

    private static final String AWS_SECRET_KEY = "YOUR_AWS_SECRET_LEY";

    public static class AffairMapper

      extends Mapper<LongWritable, Text, Text, Text> {

        public void map(LongWritable key, Text value, Context context)

                throws IOException, InterruptedException {

            context.write(new Text(value), new Text(""));

        }

    }

    public static class AffairReducer

      extends Reducer<Text, Text, Text, Text> {

        public void reduce(Text key, Iterable<Text>values, Context context)

                throws IOException, InterruptedException {

            context.write(key, new Text(""));

        }

    }

    public static class AffairJob extends Configured implements Tool {

        public int run(String[] arg0) throws Exception {

            Configuration conf = getConf();

            conf.set("mapred.reduce.tasks", "" + 1);

            Job job = new Job(conf, "Affair MR job");

            job.setJarByClass(AffairJob.class);

            job.setMapperClass(AffairMapper.class);

            job.setReducerClass(AffairReducer.class);

            job.setMapOutputKeyClass(Text.class);

            job.setMapOutputValueClass(Text.class);

            job.setOutputKeyClass(Text.class);

            job.setOutputValueClass(Text.class);

            //job.setNumReduceTasks(1);

            FileInputFormat.addInputPath(job, new Path(INPUT_DIR));

            FileOutputFormat.setOutputPath(job, new Path(OUTPUT_DIR));

            job.waitForCompletion(true);

            return 0;

        }

        public static void main(String[] args) throws Exception {

            int exitCode = ToolRunner.run(new AffairJob(), args);

            System.exit(exitCode);

        }

    }

    public static void main(String[] args) throws ParseException {

        // emr jobflow

        try {

            String mainClass = AffairJob.class.getName();

            String stepName = mainClass + " Affair";

            runStep(mainClass, JAR_NAME, stepName);

        } catch (Exception e) {

            e.printStackTrace();

        }

    }

    private static void runStep(String mainClass, String jarName, String stepName)

            throws InterruptedException {

        String jarPath = JAR_DIR + JAR_NAME;

        HadoopJarStepConfig hadoopJarStep = new HadoopJarStepConfig(jarPath);

        hadoopJarStep.setMainClass(mainClass);

        hadoopJarStep.setArgs(null);

        StepConfig step = new StepConfig().withName(stepName)

                .withActionOnFailure(ActionOnFailure.TERMINATE_JOB_FLOW)

                .withHadoopJarStep(hadoopJarStep);

        String logUri = LOG_DIR;

        JobFlowInstancesConfig instances = createInstances();

        List<StepConfig> steps = new ArrayList<StepConfig>();

        steps.add(step);

        String jobFlowId = CreateJobFlow(JOB_FLOW_NAME, logUri, instances, steps);

        terminateJobFlow(jobFlowId);

    }

    private static void terminateJobFlow(String jobFlowId) {

        TerminateJobFlowsRequest request = new TerminateJobFlowsRequest().withJobFlowIds(jobFlowId);

        emr.terminateJobFlows(request);

    }

    private static String CreateJobFlow(String jobFlowName, String logUri,

            JobFlowInstancesConfig instances, List<StepConfig> steps)

            throws InterruptedException {

        AWSCredentials credentials = new BasicAWSCredentials(AWS_ACCESS_KEY, AWS_SECRET_KEY);

        emr = new AmazonElasticMapReduceClient(credentials);

        // run job flow

        RunJobFlowRequest request = new RunJobFlowRequest().withName(jobFlowName)

                                            .withLogUri(logUri)

                                            .withSteps(steps)

                                            .withInstances(instances);

        RunJobFlowResult result = emr.runJobFlow(request);

        // get job flow details

        String jobFlowId = result.getJobFlowId();

        boolean runing = true;

        while(runing) {

            Thread.sleep(SLEEP_TIME);

            List<String> jobFlowIdList = new ArrayList<String>();

            jobFlowIdList.add(jobFlowId);

            System.out.println(getJobFlowStatus(jobFlowIdList));

            for(String clusterId : jobFlowIdList) {

                DescribeClusterRequest describeClusterRequest = new DescribeClusterRequest().withClusterId(clusterId);

                DescribeClusterResult describeClusterResult = emr.describeCluster(describeClusterRequest);

                Cluster cluster = describeClusterResult.getCluster();

                if(cluster.getStatus().getState().contains("FAILED") ||

                        cluster.getStatus().getState().contains("COMPLETED") ||

                        cluster.getStatus().getState().contains("TERMINATED") ||

                        cluster.getStatus().getState().contains("SHUTTING_DOWN") ||

                        cluster.getStatus().getState().contains("WAITING"))

                    runing = false;

                break;

            }

        }

        return jobFlowId;

    }

    private static String getJobFlowStatus(List<String> jobFlowIdList) {

        String info = new String();

        for(String clusterId : jobFlowIdList) {

            DescribeClusterRequest describeClusterRequest = new DescribeClusterRequest().withClusterId(clusterId);

            DescribeClusterResult describeClusterResult = emr.describeCluster(describeClusterRequest);

            Cluster cluster = describeClusterResult.getCluster();

            info += cluster.getId() + "\t" + cluster.getStatus().getState() + "\t" + cluster.getName() + "\n";

            ListStepsRequest listStepsRequest = new ListStepsRequest().withClusterId(clusterId);

            ListStepsResult listStepsResult = emr.listSteps(listStepsRequest);

            for(StepSummary step : listStepsResult.getSteps()) {

                info += "\t" + step.getStatus().getState() + "\t" + step.getName() + "\n";

            }

        }

        return info;

    }

    private static JobFlowInstancesConfig createInstances() {

        JobFlowInstancesConfig instances = new JobFlowInstancesConfig()

                                .withHadoopVersion("1.0.3")

                                .withInstanceCount(5)

                                .withKeepJobFlowAliveWhenNoSteps(false)

                                .withMasterInstanceType(InstanceType.M1Large.toString())

                                .withSlaveInstanceType(InstanceType.M1Large.toString())

                                .withPlacement(new PlacementType("us-east-1a"));

        return instances;

    }

}

在AWS EMR上运行Map Reduce的Java示例程序及操作小计的更多相关文章

paip.提升效率---filter map reduce 的java 函数式编程实现
#paip.提升效率---filter map reduce 的java 函数式编程实现 ======================================================= ...
mac 上运行cassandra出现的java.net.MalformedURLException: Local host name unknown: java.net.UnknownHostException: : : unknown error错误解决方法
mac 上运行cassandra出现的java.net.MalformedURLException: Local host name unknown: java.net.UnknownHostExce ...
在PC上运行安卓（Android）应用程序的4个方法
我有一部荣耀3C,一般放在宿舍(我随身携带的是一部诺基亚E63,小巧.稳定.待机时间长),在宿舍我就会用它在微信上看公众号里的文章,最近要考驾照也在上面用驾考宝典.最近想在实验室用这两个软件,但又懒得 ...
如何在国产龙芯架构平台上运行c/c++、java、nodejs等编程语言
高能预警:本文内容过于硬核,涉及编译器原理.cpu指令集.机器码.编程语言原理.跨平台原理等计算机专业基础知识,建议具有c.c++.java.nodejs等多种编程语言开发能力,且实战经验丰富的资深开 ...
AWS EMR上搭建HBase环境
0. 概述 AWS的EMR服务为客户提供的托管 Hadoop 框架可以让您轻松.快速.经济高效地在多个动态可扩展的 Amazon EC2 实例之间分发和处理大量数据.您还可以运行其他常用的分发框架 ...
在 aws emr 上，将 hbase table A 的数据，对 key 做 hash，写到另外一张 table B
先 scan 原表,然后 bulkload 到新表. 采坑纪录1. bulkload 产生 hfile 前,需要先对 hash(key) 做 repartition,在 shuffle 的 read ...
在PC上运行安卓（Android）应用程序的几个方法
三种方法: 1.在PC安装一个安卓模拟器,在模拟器里面运行apk: 2.虚拟机安装 Android x86 然后在此系统里运行: 3.利用谷歌chrome浏览器运行(这是一个新颖.有前途.激动人心的方 ...
在Ubuntu上为Android系统内置Java应用程序测试Application Frameworks层的硬件服务（老罗学习笔记6）
一:Eclipse下 1.创建工程: ---- 2.创建后目录 3.添加java函数 4.在src下创建package,在package下创建file 5.res---layout下创建xml文件,命 ...
eclipse 中运行 Hadoop2.7.3 map reduce程序出现错误(null) entry in command string: null chmod 0700
运行map reduce任务报错: (null) entry in command string: null chmod 0700 解决办法: 在https://download.csdn.net/d ...

随机推荐

bzoj 1087 状压dp
1087: [SCOI2005]互不侵犯King Time Limit: 10 Sec Memory Limit: 162 MBSubmit: 4130 Solved: 2390[Submit][ ...
Gruntjs提高生产力（一）
gruntjs是一个基于nodejs的自动化工具,只要熟悉nodejs或者又一定js经验就可以熟练应用. 1. 安装 a. 保证已安装了nodejs,并带有npm b.安装客户端命令行工具,grunt ...
yii控制布局方式
1:在控制器内成员变量设置 public $layout = false; //不使用布局 public $layout = “main”; //设置使用的布局文件 2:在控制器成员方法内设置 $th ...
转：CentOS 7使用nmcli配置双网卡聚合LACP
进入CentOS 7以后,网络方面变化比较大,例如eth0不见了,ifconfig不见了,其原因是网络服务全部都由NetworkManager管理了,下面记录下今天下午用nmcli配置的网卡聚合,网络 ...
ansible 调用playbook api执行（一）
一调用ansible playbook api执行playbook 1 准备好hosts文件 root@ansible:~/ansible/playbooks# cat hosts [all:var ...
hdu 4770 13 杭州现场 A - Lights Against Dudely 暴力 bfs 状态压缩DP 难度:1
Description Harry: "But Hagrid. How am I going to pay for all of this? I haven't any money.&quo ...
【前端工具】 git windows下搭建全过程
1. Git,Windows下的Git,地址:http://msysgit.googlecode.com/files/Git-1.7.9-preview20120201.exe(方便下载) 2 .SS ...
yii2.0 使用不同语言
1.建立语言目录.文件.项目根目录建立messages文件夹.存放不同语言对应的目录文件. 例如中文和英文 message 下建立两个文件夹 en.zh_CN 里面可以对应着多个翻译文件 2.在mai ...
(转)MapReduce Design Patterns（chapter 1）（一）
翻译的是这本书: Chapter 1.Design Patterns and MapReduce MapReduce 是一种运行于成百上千台机器上的处理数据的框架,目前被google,Hadoop等多 ...
awk结合正则匹配
利用awk分析data.csv中label列各取值的分布．在终端执行head data.csv查看数据: name,business,label,label_name 沧州光松房屋拆迁有限公司,旧房 ...

在AWS EMR上运行Map Reduce的Java示例程序 及 操作小计

在AWS EMR上运行Map Reduce的Java示例程序 及 操作小计的更多相关文章

随机推荐

热门专题

在AWS EMR上运行Map Reduce的Java示例程序及操作小计

在AWS EMR上运行Map Reduce的Java示例程序及操作小计的更多相关文章