搭建Hadoop2.6.0+Eclipse开发调试环境

上一篇在win7虚拟机下搭建了hadoop2.6.0伪分布式环境。为了开发调试方便，本文介绍在eclipse下搭建开发环境，连接和提交任务到hadoop集群。

1. 环境

Eclipse版本Luna 4.4.1

安装插件hadoop-eclipse-plugin-2.6.0.jar，下载后放到eclipse/plugins目录即可。

2. 配置插件

2.1 配置hadoop主目录

解压缩hadoop-2.6.0.tar.gz到C:\Downloads\hadoop-2.6.0，在eclipse的Windows->Preferences的Hadoop Map/Reduce中设置安装目录。

2.2 配置插件

打开Windows->Open Perspective中的Map/Reduce，在此perspective下进行hadoop程序开发。

打开Windows->Show View中的Map/Reduce Locations，如下图右键选择New Hadoop location…新建hadoop连接。

确认完成以后如下，eclipse会连接hadoop集群。

如果连接成功，在project explorer的DFS Locations下会展现hdfs集群中的文件。

3. 开发hadoop程序

3.1 程序开发

开发一个Sort示例，对输入整数进行排序。输入文件格式是每行一个整数。

 package com.ccb;

 /**

  * Created by hp on 2015-7-20.

  */

 import java.io.IOException;

 import org.apache.hadoop.conf.Configuration;

 import org.apache.hadoop.fs.FileSystem;

 import org.apache.hadoop.fs.Path;

 import org.apache.hadoop.io.IntWritable;

 import org.apache.hadoop.io.Text;

 import org.apache.hadoop.mapreduce.Job;

 import org.apache.hadoop.mapreduce.Mapper;

 import org.apache.hadoop.mapreduce.Reducer;

 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

 public class Sort {

     // 每行记录是一个整数。将Text文本转换为IntWritable类型，作为map的key

     public static class Map extends Mapper<Object, Text, IntWritable, IntWritable> {

         private static IntWritable data = new IntWritable();

         // 实现map函数

         public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

             String line = value.toString();

             data.set(Integer.parseInt(line));

             context.write(data, new IntWritable(1));

         }

     }

     // reduce之前hadoop框架会进行shuffle和排序，因此直接输出key即可。

     public static class Reduce extends Reducer<IntWritable, IntWritable, IntWritable, Text> {

         //实现reduce函数

         public void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {

             for (IntWritable v : values) {

                 context.write(key, new Text(""));

             }

         }

     }

     public static void main(String[] args) throws Exception {

         Configuration conf = new Configuration();

         // 指定JobTracker地址

         conf.set("mapred.job.tracker", "192.168.62.129:9001");

         if (args.length != 2) {

             System.err.println("Usage: Data Sort <in> <out>");

             System.exit(2);

         }

         System.out.println(args[0]);

         System.out.println(args[1]);

         Job job = Job.getInstance(conf, "Data Sort");

         job.setJarByClass(Sort.class);

         //设置Map和Reduce处理类

         job.setMapperClass(Map.class);

         job.setReducerClass(Reduce.class);

         //设置输出类型

         job.setOutputKeyClass(IntWritable.class);

         job.setOutputValueClass(IntWritable.class);

         //设置输入和输出目录

         FileInputFormat.addInputPath(job, new Path(args[0]));

         FileOutputFormat.setOutputPath(job, new Path(args[1]));

         System.exit(job.waitForCompletion(true) ? 0 : 1);

     }

 }

3.2 配置文件

把log4j.properties和hadoop集群中的core-site.xml加入到classpath中。我的示例工程是maven组织，因此放到src/main/resources目录。

程序执行时会从core-site.xml中获取hdfs地址。

3.3 程序执行

右键选择Run As -> Run Configurations…，在参数中填好输入输出目录，执行Run即可。

执行日志：

 hdfs://192.168.62.129:9000/user/vm/sort_in

 hdfs://192.168.62.129:9000/user/vm/sort_out

 // :: INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id

 // :: INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=

 // :: WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.

 // :: WARN mapreduce.JobSubmitter: No job jar file set.  User classes may not be found. See Job or Job#setJar(String).

 // :: INFO input.FileInputFormat: Total input paths to process :

 // :: INFO mapreduce.JobSubmitter: number of splits:

 // :: INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address

 // :: INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1592166400_0001

 // :: INFO mapreduce.Job: The url to track the job: http://localhost:8080/

 // :: INFO mapreduce.Job: Running job: job_local1592166400_0001

 // :: INFO mapred.LocalJobRunner: OutputCommitter set in config null

 // :: INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter

 // :: INFO mapred.LocalJobRunner: Waiting for map tasks

 // :: INFO mapred.LocalJobRunner: Starting task: attempt_local1592166400_0001_m_000000_0

 // :: INFO util.ProcfsBasedProcessTree: ProcfsBasedProcessTree currently is supported only on Linux.

 // :: INFO mapred.Task:  Using ResourceCalculatorProcessTree : org.apache.hadoop.yarn.util.WindowsBasedProcessTree@4c90dbc4

 // :: INFO mapred.MapTask: Processing split: hdfs://192.168.62.129:9000/user/vm/sort_in/file1:0+25

 // :: INFO mapred.MapTask: (EQUATOR)  kvi ()

 // :: INFO mapred.MapTask: mapreduce.task.io.sort.mb:

 // :: INFO mapred.MapTask: soft limit at

 // :: INFO mapred.MapTask: bufstart = ; bufvoid =

 // :: INFO mapred.MapTask: kvstart = ; length =

 // :: INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer

 // :: INFO mapred.LocalJobRunner:

 // :: INFO mapred.MapTask: Starting flush of map output

 // :: INFO mapred.MapTask: Spilling map output

 // :: INFO mapred.MapTask: bufstart = ; bufend = ; bufvoid =

 // :: INFO mapred.MapTask: kvstart = (); kvend = (); length = /

 // :: INFO mapred.MapTask: Finished spill

 // :: INFO mapred.Task: Task:attempt_local1592166400_0001_m_000000_0 is done. And is in the process of committing

 // :: INFO mapred.LocalJobRunner: map

 // :: INFO mapred.Task: Task 'attempt_local1592166400_0001_m_000000_0' done.

 // :: INFO mapred.LocalJobRunner: Finishing task: attempt_local1592166400_0001_m_000000_0

 // :: INFO mapred.LocalJobRunner: Starting task: attempt_local1592166400_0001_m_000001_0

 // :: INFO util.ProcfsBasedProcessTree: ProcfsBasedProcessTree currently is supported only on Linux.

 // :: INFO mapred.Task:  Using ResourceCalculatorProcessTree : org.apache.hadoop.yarn.util.WindowsBasedProcessTree@69e4d7d

 // :: INFO mapred.MapTask: Processing split: hdfs://192.168.62.129:9000/user/vm/sort_in/file2:0+15

 // :: INFO mapred.MapTask: (EQUATOR)  kvi ()

 // :: INFO mapred.MapTask: mapreduce.task.io.sort.mb:

 // :: INFO mapred.MapTask: soft limit at

 // :: INFO mapred.MapTask: bufstart = ; bufvoid =

 // :: INFO mapred.MapTask: kvstart = ; length =

 // :: INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer

 // :: INFO mapred.LocalJobRunner:

 // :: INFO mapred.MapTask: Starting flush of map output

 // :: INFO mapred.MapTask: Spilling map output

 // :: INFO mapred.MapTask: bufstart = ; bufend = ; bufvoid =

 // :: INFO mapred.MapTask: kvstart = (); kvend = (); length = /

 // :: INFO mapred.MapTask: Finished spill

 // :: INFO mapred.Task: Task:attempt_local1592166400_0001_m_000001_0 is done. And is in the process of committing

 // :: INFO mapred.LocalJobRunner: map

 // :: INFO mapred.Task: Task 'attempt_local1592166400_0001_m_000001_0' done.

 // :: INFO mapred.LocalJobRunner: Finishing task: attempt_local1592166400_0001_m_000001_0

 // :: INFO mapred.LocalJobRunner: Starting task: attempt_local1592166400_0001_m_000002_0

 // :: INFO mapreduce.Job: Job job_local1592166400_0001 running in uber mode : false

 // :: INFO util.ProcfsBasedProcessTree: ProcfsBasedProcessTree currently is supported only on Linux.

 // :: INFO mapreduce.Job:  map % reduce %

 // :: INFO mapred.Task:  Using ResourceCalculatorProcessTree : org.apache.hadoop.yarn.util.WindowsBasedProcessTree@4e931efa

 // :: INFO mapred.MapTask: Processing split: hdfs://192.168.62.129:9000/user/vm/sort_in/file3:0+8

 // :: INFO mapred.MapTask: (EQUATOR)  kvi ()

 // :: INFO mapred.MapTask: mapreduce.task.io.sort.mb:

 // :: INFO mapred.MapTask: soft limit at

 // :: INFO mapred.MapTask: bufstart = ; bufvoid =

 // :: INFO mapred.MapTask: kvstart = ; length =

 // :: INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer

 // :: INFO mapred.LocalJobRunner:

 // :: INFO mapred.MapTask: Starting flush of map output

 // :: INFO mapred.MapTask: Spilling map output

 // :: INFO mapred.MapTask: bufstart = ; bufend = ; bufvoid =

 // :: INFO mapred.MapTask: kvstart = (); kvend = (); length = /

 // :: INFO mapred.MapTask: Finished spill

 // :: INFO mapred.Task: Task:attempt_local1592166400_0001_m_000002_0 is done. And is in the process of committing

 // :: INFO mapred.LocalJobRunner: map

 // :: INFO mapred.Task: Task 'attempt_local1592166400_0001_m_000002_0' done.

 // :: INFO mapred.LocalJobRunner: Finishing task: attempt_local1592166400_0001_m_000002_0

 // :: INFO mapred.LocalJobRunner: map task executor complete.

 // :: INFO mapred.LocalJobRunner: Waiting for reduce tasks

 // :: INFO mapred.LocalJobRunner: Starting task: attempt_local1592166400_0001_r_000000_0

 // :: INFO util.ProcfsBasedProcessTree: ProcfsBasedProcessTree currently is supported only on Linux.

 // :: INFO mapred.Task:  Using ResourceCalculatorProcessTree : org.apache.hadoop.yarn.util.WindowsBasedProcessTree@

 // :: INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@2129404b

 // :: INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=, maxSingleShuffleLimit=, mergeThreshold=, ioSortFactor=, memToMemMergeOutputsThreshold=

 // :: INFO reduce.EventFetcher: attempt_local1592166400_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events

 // :: INFO reduce.LocalFetcher: localfetcher# about to shuffle output of map attempt_local1592166400_0001_m_000002_0 decomp:  len:  to MEMORY

 // :: INFO reduce.InMemoryMapOutput: Read  bytes from map-output for attempt_local1592166400_0001_m_000002_0

 // :: INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: , inMemoryMapOutputs.size() -> , commitMemory -> , usedMemory ->

 // :: INFO reduce.LocalFetcher: localfetcher# about to shuffle output of map attempt_local1592166400_0001_m_000000_0 decomp:  len:  to MEMORY

 // :: INFO reduce.InMemoryMapOutput: Read  bytes from map-output for attempt_local1592166400_0001_m_000000_0

 // :: INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: , inMemoryMapOutputs.size() -> , commitMemory -> , usedMemory ->

 // :: INFO reduce.LocalFetcher: localfetcher# about to shuffle output of map attempt_local1592166400_0001_m_000001_0 decomp:  len:  to MEMORY

 // :: INFO reduce.InMemoryMapOutput: Read  bytes from map-output for attempt_local1592166400_0001_m_000001_0

 // :: INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: , inMemoryMapOutputs.size() -> , commitMemory -> , usedMemory ->

 // :: INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning

 // :: INFO mapred.LocalJobRunner:  /  copied.

 // :: INFO reduce.MergeManagerImpl: finalMerge called with  in-memory map-outputs and  on-disk map-outputs

 // :: INFO mapred.Merger: Merging  sorted segments

 // :: INFO mapred.Merger: Down to the last merge-pass, with  segments left of total size:  bytes

 // :: INFO reduce.MergeManagerImpl: Merged  segments,  bytes to disk to satisfy reduce memory limit

 // :: INFO reduce.MergeManagerImpl: Merging  files,  bytes from disk

 // :: INFO reduce.MergeManagerImpl: Merging  segments,  bytes from memory into reduce

 // :: INFO mapred.Merger: Merging  sorted segments

 // :: INFO mapred.Merger: Down to the last merge-pass, with  segments left of total size:  bytes

 // :: INFO mapred.LocalJobRunner:  /  copied.

 // :: INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords

 // :: INFO mapred.Task: Task:attempt_local1592166400_0001_r_000000_0 is done. And is in the process of committing

 // :: INFO mapred.LocalJobRunner:  /  copied.

 // :: INFO mapred.Task: Task attempt_local1592166400_0001_r_000000_0 is allowed to commit now

 // :: INFO output.FileOutputCommitter: Saved output of task 'attempt_local1592166400_0001_r_000000_0' to hdfs://192.168.62.129:9000/user/vm/sort_out/_temporary/0/task_local1592166400_0001_r_000000

 // :: INFO mapred.LocalJobRunner: reduce > reduce

 // :: INFO mapred.Task: Task 'attempt_local1592166400_0001_r_000000_0' done.

 // :: INFO mapred.LocalJobRunner: Finishing task: attempt_local1592166400_0001_r_000000_0

 // :: INFO mapred.LocalJobRunner: reduce task executor complete.

 // :: INFO mapreduce.Job:  map % reduce %

 // :: INFO mapreduce.Job: Job job_local1592166400_0001 completed successfully

 // :: INFO mapreduce.Job: Counters:

     File System Counters

         FILE: Number of bytes read=

         FILE: Number of bytes written=

         FILE: Number of read operations=

         FILE: Number of large read operations=

         FILE: Number of write operations=

         HDFS: Number of bytes read=

         HDFS: Number of bytes written=

         HDFS: Number of read operations=

         HDFS: Number of large read operations=

         HDFS: Number of write operations=

     Map-Reduce Framework

         Map input records=

         Map output records=

         Map output bytes=

         Map output materialized bytes=

         Input split bytes=

         Combine input records=

         Combine output records=

         Reduce input groups=

         Reduce shuffle bytes=

         Reduce input records=

         Reduce output records=

         Spilled Records=

         Shuffled Maps =

         Failed Shuffles=

         Merged Map outputs=

         GC time elapsed (ms)=

         CPU time spent (ms)=

         Physical memory (bytes) snapshot=

         Virtual memory (bytes) snapshot=

         Total committed heap usage (bytes)=

     Shuffle Errors

         BAD_ID=

         CONNECTION=

         IO_ERROR=

         WRONG_LENGTH=

         WRONG_MAP=

         WRONG_REDUCE=

     File Input Format Counters

         Bytes Read=

     File Output Format Counters

         Bytes Written=

4. 可能出现的问题

4.1 权限问题，无法访问HDFS

修改集群hdfs-site.xml配置，关闭hadoop集群的权限校验。

<name>dfs.permissions</name>

<value>false</value>

</property>

4.2 出现NullPointerException异常

在环境变量中配置%HADOOP_HOME%为C:\Download\hadoop-2.6.0\

下载winutils.exe和hadoop.dll到C:\Download\hadoop-2.6.0\bin

注意：网上很多资料说的是下载hadoop-common-2.2.0-bin-master.zip，但很多不支持hadoop2.6.0版本。需要下载支持hadoop2.6.0版本的程序。

4.3 程序执行失败

需要执行Run on Hadoop，而不是Java Application。