本文简单记录了一下使用VMware workstation 10、CentOS和HDP 2.0.6(Hadoop 2.2)发行版构建Hadoop开发测试环境的全部流程。这个过程中我遇到了不少问题,也耽误了不少的时间,所以将此文奉上,希望对大家有所帮助。

本文使用两台虚拟机搭建真实集群环境,操作系统为Cent OS 6.5。可以使用VMware Workstation的简易安装模式来进行。

0. 安装CentOS 6.5虚拟机



根据向导设置系统用户、CPU、内存、磁盘和网络。这里为了让yum能连接互联网,需要选择桥接模式。


然后等待安装结束(使用SSD硬盘不到10分钟),这个过程会自动安装VMware Tools。下面正式开始配置系统和HDP。

1. 服务器基本设置

vim /etc/hosts
192.168.1.210 hdp01
192.168.1.220 hdp02
vim /etc/selinux/config
SELINUX=disabled
vim /etc/sysconfig/network
HOSTNAME=hdp01 #主机名分别为hdp01, hdp02
1
2
3
4
5
6
7
vim /etc/hosts
192.168.1.210   hdp01
192.168.1.220   hdp02
vim /etc/selinux/config
SELINUX=disabled
vim /etc/sysconfig/network
HOSTNAME=hdp01     #主机名分别为hdp01, hdp02

关闭不必要的服务:

chkconfig NetworkManager off
chkconfig abrt-ccpp off
chkconfig abrtd off
chkconfig acpid off
chkconfig atd off
chkconfig bluetooth off
chkconfig cpuspeed off
chkconfig cpuspeed off
chkconfig ip6tables off
chkconfig iptables off
chkconfig netconsole off
chkconfig netfs off
chkconfig postfix off
chkconfig restorecond off
chkconfig httpd off
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
chkconfig NetworkManager off
chkconfig abrt-ccpp off
chkconfig abrtd off
chkconfig acpid off
chkconfig atd off
chkconfig bluetooth off
chkconfig cpuspeed off
chkconfig cpuspeed off
chkconfig ip6tables off
chkconfig iptables off
chkconfig netconsole off
chkconfig netfs off
chkconfig postfix off
chkconfig restorecond off
chkconfig httpd off

完成后重启一下。

2. 在hdp01上安装ambari

(1).下载HDP repo

下载HDP提供的yum repo文件并拷贝到/etc/yum.repos.d中

[root@hdp01 ~]# wget http://public-repo-1.hortonworks.com/ambari/centos6/1.x/updates/1.4.1.61/ambari.repo
--2014-03-10 04:57:58-- http://public-repo-1.hortonworks.com/ambari/centos6/1.x/updates/1.4.1.61/ambari.repoResolving public-repo-1.hortonworks.com... 54.230.127.224, 205.251.212.150, 54.230.124.207, ...
Connecting to public-repo-1.hortonworks.com|54.230.127.224|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 770 [binary/octet-stream]
Saving to: “ambari.repo”
100%[======================================>] 770 --.-K/s in 0s
2014-03-10 04:58:01 (58.8 MB/s) - “ambari.repo” saved [770/770]
[root@hdp01 ~]# cp ambari.repo /etc/yum.repos.d/
(2).使用yum安装ambari-server
[root@hdp01 ~]# yum –y install ambari-server
...
Total download size: 49 M
Installed size: 113 M
....
Installed:
ambari-server.noarch 0:1.4.1.61-1
Dependency Installed:
postgresql.x86_64 0:8.4.20-1.el6_5 postgresql-libs.x86_64 0:8.4.20-1.el6_5 postgresql-server.x86_64 0:8.4.20-1.el6_5
Complete!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
[root@hdp01 ~]# wget http://public-repo-1.hortonworks.com/ambari/centos6/1.x/updates/1.4.1.61/ambari.repo
--2014-03-10 04:57:58--  http://public-repo-1.hortonworks.com/ambari/centos6/1.x/updates/1.4.1.61/ambari.repoResolving public-repo-1.hortonworks.com... 54.230.127.224, 205.251.212.150, 54.230.124.207, ...
Connecting to public-repo-1.hortonworks.com|54.230.127.224|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 770 [binary/octet-stream]
Saving to: “ambari.repo”
100%[======================================>] 770         --.-K/s   in 0s      
2014-03-10 04:58:01 (58.8 MB/s) - “ambari.repo” saved [770/770]
[root@hdp01 ~]# cp ambari.repo /etc/yum.repos.d/
(2).使用yum安装ambari-server
[root@hdp01 ~]# yum –y install ambari-server
...
Total download size: 49 M
Installed size: 113 M
....
Installed:
  ambari-server.noarch 0:1.4.1.61-1                                                                                                                          
Dependency Installed:
  postgresql.x86_64 0:8.4.20-1.el6_5              postgresql-libs.x86_64 0:8.4.20-1.el6_5              postgresql-server.x86_64 0:8.4.20-1.el6_5            
Complete!


3. 配置root用户的ssh互信

分别在hdp01和hdp02生成key,再通过ssh-copy-id拷贝到hdp01和hdp02上去。

[root@hdp01 ~]# ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Created directory ' /root/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /hroot/.ssh/id_rsa.
...
[root@hdp02 .ssh]# ssh-copy-id hdp01
The authenticity of host 'hdp01 (192.168.1.210)' can't be established.
RSA key fingerprint is 90:3b:db:2d:c4:34:49:03:e6:d7:cc:cb:b7:60:4d:d0.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'hdp01,192.168.1.210' (RSA) to the list of known hosts.
root@hdp01's password:
Now try logging into the machine, with "ssh 'hdp01'", and check in:
.ssh/authorized_keys
to make sure we haven't added extra keys that you weren't expecting.

[root@hdp02 .ssh]# ssh-copy-id hdp02
The authenticity of host 'hdp02 (192.168.1.220)' can't be established.
RSA key fingerprint is 11:cb:c9:9e:b6:c0:a1:95:98:fa:42:aa:95:5f:cf:98.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'hdp02,192.168.1.220' (RSA) to the list of known hosts.
root@hdp02's password:
Now try logging into the machine, with "ssh 'hdp02'", and check in:
.ssh/authorized_keys
to make sure we haven't added extra keys that you weren't expecting.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
[root@hdp01 ~]# ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Created directory ' /root/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /hroot/.ssh/id_rsa.
...
[root@hdp02 .ssh]# ssh-copy-id hdp01
The authenticity of host 'hdp01 (192.168.1.210)' can't be established.
RSA key fingerprint is 90:3b:db:2d:c4:34:49:03:e6:d7:cc:cb:b7:60:4d:d0.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'hdp01,192.168.1.210' (RSA) to the list of known hosts.
root@hdp01's password:
Now try logging into the machine, with "ssh 'hdp01'", and check in:
  .ssh/authorized_keys
to make sure we haven't added extra keys that you weren't expecting.
 
[root@hdp02 .ssh]# ssh-copy-id hdp02
The authenticity of host 'hdp02 (192.168.1.220)' can't be established.
RSA key fingerprint is 11:cb:c9:9e:b6:c0:a1:95:98:fa:42:aa:95:5f:cf:98.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'hdp02,192.168.1.220' (RSA) to the list of known hosts.
root@hdp02's password:
Now try logging into the machine, with "ssh 'hdp02'", and check in:
  .ssh/authorized_keys
to make sure we haven't added extra keys that you weren't expecting.


4. 配置ambari server

Apache Ambari是基于Web的Apache Hadoop的自动部署、管理和监控工具。这里ambari server的metastore使用了自带了postgre数据库。

[root@hdp01 ~]# ambari-server setup
Using python /usr/bin/python2.6
Initializing...
Setup ambari-server
Checking SELinux...
SELinux status is 'disabled'
Customize user account for ambari-server daemon [y/n] (n)?
Adjusting ambari-server permissions and ownership...
Checking iptables...
Checking JDK...
To download the Oracle JDK you must accept the license terms found at http://www.oracle.com/technetwork/java/javase/terms/license/index.html and not accepting will cancel the Ambari Server setup.
Do you accept the Oracle Binary Code License Agreement [y/n] (y)?
Downloading JDK from http://public-repo-1.hortonworks.com/ARTIFACTS/jdk-6u31-linux-x64.bin to /var/lib/ambari-server/resources/jdk-6u31-linux-x64.bin
JDK distribution size is 85581913 bytes
dk-6u31-linux-x64.bin... 100% (81.6 MB of 81.6 MB)
Successfully downloaded JDK distribution to /var/lib/ambari-server/resources/jdk-6u31-linux-x64.bin
Installing JDK to /usr/jdk64
Successfully installed JDK to /usr/jdk64/jdk1.6.0_31
Downloading JCE Policy archive from http://public-repo-1.hortonworks.com/ARTIFACTS/jce_policy-6.zip to /var/lib/ambari-server/resources/jce_policy-6.zip
Successfully downloaded JCE Policy archive to /var/lib/ambari-server/resources/jce_policy-6.zip
Completing setup...
Configuring database...
Enter advanced database configuration [y/n] (n)? y
==============================================================================
Choose one of the following options:
[1] - PostgreSQL (Embedded)
[2] - Oracle
==============================================================================
Enter choice (1): 1
Database Name (ambari):
Username (ambari):
Enter Database Password (bigdata):
Default properties detected. Using built-in database.
Checking PostgreSQL...
Running initdb: This may take upto a minute.
About to start PostgreSQL
Configuring local database...
Connecting to the database. Attempt 1...
Configuring PostgreSQL...
Restarting PostgreSQL
Ambari Server 'setup' completed successfully.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
[root@hdp01 ~]# ambari-server setup
Using python  /usr/bin/python2.6
Initializing...
Setup ambari-server
Checking SELinux...
SELinux status is 'disabled'
Customize user account for ambari-server daemon [y/n] (n)?
Adjusting ambari-server permissions and ownership...
Checking iptables...
Checking JDK...
To download the Oracle JDK you must accept the license terms found at http://www.oracle.com/technetwork/java/javase/terms/license/index.html and not accepting will cancel the Ambari Server setup.
Do you accept the Oracle Binary Code License Agreement [y/n] (y)?
Downloading JDK from http://public-repo-1.hortonworks.com/ARTIFACTS/jdk-6u31-linux-x64.bin to /var/lib/ambari-server/resources/jdk-6u31-linux-x64.bin
JDK distribution size is 85581913 bytes
dk-6u31-linux-x64.bin... 100% (81.6 MB of 81.6 MB)
Successfully downloaded JDK distribution to /var/lib/ambari-server/resources/jdk-6u31-linux-x64.bin
Installing JDK to /usr/jdk64
Successfully installed JDK to /usr/jdk64/jdk1.6.0_31
Downloading JCE Policy archive from http://public-repo-1.hortonworks.com/ARTIFACTS/jce_policy-6.zip to /var/lib/ambari-server/resources/jce_policy-6.zip
Successfully downloaded JCE Policy archive to /var/lib/ambari-server/resources/jce_policy-6.zip
Completing setup...
Configuring database...
Enter advanced database configuration [y/n] (n)? y
==============================================================================
Choose one of the following options:
[1] - PostgreSQL (Embedded)
[2] - Oracle
==============================================================================
Enter choice (1): 1
Database Name (ambari):
Username (ambari):
Enter Database Password (bigdata):
Default properties detected. Using built-in database.
Checking PostgreSQL...
Running initdb: This may take upto a minute.
About to start PostgreSQL
Configuring local database...
Connecting to the database. Attempt 1...
Configuring PostgreSQL...
Restarting PostgreSQL
Ambari Server 'setup' completed successfully.

使用root用户来启动ambari server

[root@hdp01 ~]$ ambari-server start
Using python /usr/bin/python2.6
Starting ambari-server
Unable to check iptables status when starting without root privileges.
Please do not forget to disable or adjust iptables if needed
Unable to check PostgreSQL server status when starting without root privileges.
Please do not forget to start PostgreSQL server.
Server PID at: /var/run/ambari-server/ambari-server.pid
Server out at: /var/log/ambari-server/ambari-server.out
Server log at: /var/log/ambari-server/ambari-server.log
Ambari Server 'start' completed successfully.
1
2
3
4
5
6
7
8
9
10
11
[root@hdp01 ~]$ ambari-server start
Using python  /usr/bin/python2.6
Starting ambari-server
Unable to check iptables status when starting without root privileges.
Please do not forget to disable or adjust iptables if needed
Unable to check PostgreSQL server status when starting without root privileges.
Please do not forget to start PostgreSQL server.
Server PID at: /var/run/ambari-server/ambari-server.pid
Server out at: /var/log/ambari-server/ambari-server.out
Server log at: /var/log/ambari-server/ambari-server.log
Ambari Server 'start' completed successfully.


5.安装mysql

使用mysql-server来存hive metastore。

首先安装remi软件源(为了能通过yum安装Mysql 5.5):

[root@hdp01 ~]# yum install -y epel-release
Installed:
epel-release.noarch 0:6-8
Complete!
[root@hdp01 ~]# rpm -Uvh http://rpms.famillecollet.com/enterprise/remi-release-6.rpm
Retrieving http://rpms.famillecollet.com/enterprise/remi-release-6.rpm
warning: /var/tmp/rpm-tmp.JSZuMv: Header V3 DSA/SHA1 Signature, key ID 00f97f56: NOKEY
Preparing... ########################################### [100%]
1:remi-release ########################################### [100%]

[root@hdp01 ~]# yum install –y mysql-server
......
Total download size: 12 M
......
[root@hdp01 ~]# yum --enablerepo=remi,remi-test list mysql mysql-server
Loaded plugins: fastestmirror, refresh-packagekit, security
Loading mirror speeds from cached hostfile
......
Available Packages
mysql.x86_64 5.5.36-1.el6.remi
mysql-server.x86_64 5.5.36-1.el6.remi

[root@hdp01 ~]# yum --enablerepo=remi,remi-test install mysql mysql-server
Loaded plugins: fastestmirror, refresh-packagekit, security
Loading mirror speeds from cached hostfile
......
Total download size: 20 M
......
[root@hdp01 ~]# /usr/bin/mysql_secure_installation
[root@hdp01 ~]# chkconfig --level 235 mysqld on
[root@hdp01 ~]# /usr/bin/mysql_secure_installation
......
Enter current password for root (enter for none):
OK, successfully used password, moving on...
Change the root password? [Y/n] n
... skipping.
Remove anonymous users? [Y/n] Y
... Success!
Disallow root login remotely? [Y/n] Y
... Success!
Remove test database and access to it? [Y/n] Y
- Dropping test database...
... Success!
- Removing privileges on test database...
... Success!
Reload privilege tables now? [Y/n] Y
... Success!
All done! If you've completed all of the above steps, your MySQL installation should now be secure.
Thanks for using MySQL!
[root@hdp01 ~]# service mysqld start
Starting mysqld: [ OK ]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
[root@hdp01 ~]# yum install -y epel-release
Installed:
  epel-release.noarch 0:6-8                                                                                                                                                                          
Complete!
[root@hdp01 ~]# rpm -Uvh http://rpms.famillecollet.com/enterprise/remi-release-6.rpm
Retrieving http://rpms.famillecollet.com/enterprise/remi-release-6.rpm
warning: /var/tmp/rpm-tmp.JSZuMv: Header V3 DSA/SHA1 Signature, key ID 00f97f56: NOKEY
Preparing...                ########################################### [100%]
   1:remi-release           ########################################### [100%]
 
[root@hdp01 ~]# yum install –y mysql-server
......
Total download size: 12 M
......
[root@hdp01 ~]# yum --enablerepo=remi,remi-test list mysql mysql-server
Loaded plugins: fastestmirror, refresh-packagekit, security
Loading mirror speeds from cached hostfile
......
Available Packages
mysql.x86_64                                                                                   5.5.36-1.el6.remi                                                                            
mysql-server.x86_64                                                                            5.5.36-1.el6.remi  
 
[root@hdp01 ~]# yum --enablerepo=remi,remi-test install mysql mysql-server
Loaded plugins: fastestmirror, refresh-packagekit, security
Loading mirror speeds from cached hostfile
......
Total download size: 20 M
......
[root@hdp01 ~]# /usr/bin/mysql_secure_installation
[root@hdp01 ~]# chkconfig --level 235 mysqld on
[root@hdp01 ~]# /usr/bin/mysql_secure_installation
......
Enter current password for root (enter for none):
OK, successfully used password, moving on...
Change the root password? [Y/n] n
... skipping.
Remove anonymous users? [Y/n] Y
... Success!
Disallow root login remotely? [Y/n] Y
... Success!
Remove test database and access to it? [Y/n] Y
- Dropping test database...
... Success!
- Removing privileges on test database...
... Success!
Reload privilege tables now? [Y/n] Y
... Success!
All done!  If you've completed all of the above steps, your MySQL installation should now be secure.
Thanks for using MySQL!
[root@hdp01 ~]# service mysqld start
Starting mysqld:                                           [  OK  ]

下面创建数据库和用户

[root@hdp01 ~]# mysql –u root –p
mysql> create database hive;
Query OK, 1 row affected (0.00 sec)
mysql> create user "hive" identified by "hive123";
Query OK, 0 rows affected (0.00 sec)
mysql> grant all privileges on hive.* to hive;
Query OK, 0 rows affected (0.00 sec)
mysql> flush privileges;
Query OK, 0 rows affected (0.00 sec)
1
2
3
4
5
6
7
8
9
[root@hdp01 ~]# mysql –u root –p
mysql> create database hive;
Query OK, 1 row affected (0.00 sec)
mysql> create user "hive" identified by "hive123";
Query OK, 0 rows affected (0.00 sec)
mysql> grant all privileges on hive.* to hive;
Query OK, 0 rows affected (0.00 sec)
mysql> flush privileges;
Query OK, 0 rows affected (0.00 sec)


6.使用浏览器打开, 输入admin/admin

http://hdp01:8080/#/login

Name your cluster: debugo_test

Stack: HDP 2.0.6

Target Hosts: hdp01,hdp02

Host Registration Information:


由于之前配置了root用户的ssh互信,这里需要选择/root/.ssh下面id.rsa私钥文件,然后Register and confirm继续:

下面如果出现os_type_check.sh脚本执行失败导致的Local OS is not compatible with cluster primary OS报错,这是一个BUG,可以直接修改该os_type_check.sh使得输出里面直接在输出结果之前的RES=0。


成功后,ambari-agent 安装完成,可以通过ambari-agent命令来控制:

[root@hdp02 Desktop]# ambari-agent status
ambari-agent currently not running
Usage: /usr/sbin/ambari-agent {start|stop|restart|status}
#在hdp01和hdp02上让ambari-agent在开机时启动
[root@hdp02 Desktop]# chkconfig ambari-agent –level 35 on
1
2
3
4
5
[root@hdp02 Desktop]# ambari-agent status
ambari-agent currently not running
Usage: /usr/sbin/ambari-agent {start|stop|restart|status}
#在hdp01和hdp02上让ambari-agent在开机时启动
[root@hdp02 Desktop]# chkconfig ambari-agent –level 35 on

下一步选择要安装的组件,这里不选择Nagios, Ganglia和Oozie。对于Hive,使用前面安装的mysql-server:


另外将YARN的yarn.acl.enable设置为false。就进行下一步的Deploy了。这是一个极为漫长的过程,中途遇到failure就retry一下。大约一小时后安装完成:


Next以后就进入了期待已久的Dashboard界面,此时安装的组件已经全部启动。

7.开发环境的配置

下载eclipse 4.3(kepler),maven-3.2.1到/opt下,设置环境变量

[root@hdp01 opt]# vim /etc/profile
export JAVA_HOME=/usr/jdk64/jdk1.6.0_31
export MAVEN_HOME=/opt/apache-maven-3.2.1
export PATH=$PATH:$JAVA_HOME/bin:$MAVEN_HOME/bin
export CLASSPATH=.:$JAVA_HOME/lib:$JAVA_HOME/lib/tools.jar
[root@hdp01 opt]# chgrp –R hadoop apache-maven-3.2.1/ eclipse/ workspace/
[root@hdp01 opt]# useradd hadoop
[root@hdp01 opt]# echo “hadoop” > passwd –stdin hadoop
1
2
3
4
5
6
7
8
[root@hdp01 opt]# vim /etc/profile
export JAVA_HOME=/usr/jdk64/jdk1.6.0_31
export MAVEN_HOME=/opt/apache-maven-3.2.1
export PATH=$PATH:$JAVA_HOME/bin:$MAVEN_HOME/bin
export CLASSPATH=.:$JAVA_HOME/lib:$JAVA_HOME/lib/tools.jar
[root@hdp01 opt]# chgrp –R hadoop apache-maven-3.2.1/ eclipse/ workspace/
[root@hdp01 opt]# useradd hadoop
[root@hdp01 opt]# echo “hadoop” > passwd –stdin hadoop

打开eclipse -> help -> Install new softwares,下载maven插件( http://download.eclipse.org/m2e-wtp/releases/kepler/ )。安装完成后重启eclipse,就可以正式开始hadoop之旅了。

8. WordCount的编译

(1). 新建一个maven项目


(2). Create a simple project(skip archetype selection)


(3). 如果出现JRE安装相关的Warning

Build path specifies execution environment J2SE-1.5. There are no JREs installed in the workspace that are strictly compatible with this environment.

可以在项目properties页中删除JRE1.5SE这一项,然后Add Library -> JRE System Library -> workspace default JRE即可。


(4). WordCount.java

在com.debugo.com.mapred包下创建WordCount类:

package com.debugo.hadoop.mapred;

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {

public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}

public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: wordcount <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);

job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
package com.debugo.hadoop.mapred;
 
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
 
public class WordCount {
 
  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{
    
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
      
    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }
  
  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();
 
    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }
  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length != 2) {
      System.err.println("Usage: wordcount <in> <out>");
      System.exit(2);
    }
    Job job = new Job(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
 
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

编辑pom.xml,添加依赖库。通过maven的repository里可以查得(http://mvnrepository.com/artifact/org.apache.hadoop)

<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-jobclient</artifactId>
<version>2.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.3.0</version>
</dependency>
</dependencies>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
    <dependencies>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>3.8.1</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.3.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>2.3.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-jobclient</artifactId>
            <version>2.3.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.3.0</version>
        </dependency>
    </dependencies>

这里需要注意的是,直接运行会包map任务找不到WordCount中的子类,所以要在mvn install之后将自己项目这个包再次引入到mvn项目中来。

mvn install:install-file -DgroupId=com.debugo.hadoopDartifactId=mr -Dpackaging=jar -Dversion=0.1 -Dfile=mr-0.0.1-SNAPSHOT.jar -DgeneratePOM=true

然后添加

<dependency>
<groupId>com.debugo.hadoop</groupId>
<artifactId>mr</artifactId>
<version>0.1</version>
</dependency>
1
2
3
4
5
<dependency>
    <groupId>com.debugo.hadoop</groupId>
    <artifactId>mr</artifactId>
    <version>0.1</version>
</dependency>

另外,http://www.cnblogs.com/spork/archive/2010/04/21/1717592.html,也是一个很好的解决方案。

编辑Run Configuration,设置运行参数”/input /output”。

然后创建/input目录: hdfs dfs -mkdir /input

再使用hdfs dfs -put a.txt /input将一些文本传到该目录下。

最后执行这个项目,成功后结果就会输出到/output dfs目录中。

[2014-03-13 09:52:20,282] INFO 19952[main] - org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1380) - Counters: 49
File System Counters
FILE: Number of bytes read=5263
FILE: Number of bytes written=183603
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=6739
HDFS: Number of bytes written=3827
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=3075
Total time spent by all reduces in occupied slots (ms)=6294
Total time spent by all map tasks (ms)=3075
Total time spent by all reduce tasks (ms)=3147
Total vcore-seconds taken by all map tasks=3075
Total vcore-seconds taken by all reduce tasks=3147
Total megabyte-seconds taken by all map tasks=4723200
Total megabyte-seconds taken by all reduce tasks=9667584
Map-Reduce Framework
Map input records=144
Map output records=960
Map output bytes=10358
Map output materialized bytes=5263
Input split bytes=104
Combine input records=960
Combine output records=361
Reduce input groups=361
Reduce shuffle bytes=5263
Reduce input records=361
Reduce output records=361
Spilled Records=722
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=26
CPU time spent (ms)=2290
Physical memory (bytes) snapshot=1309593600
Virtual memory (bytes) snapshot=8647901184
Total committed heap usage (bytes)=2021654528
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=6635
File Output Format Counters
Bytes Written=3827
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
[2014-03-13 09:52:20,282] INFO 19952[main] - org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1380) - Counters: 49
    File System Counters
        FILE: Number of bytes read=5263
        FILE: Number of bytes written=183603
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=6739
        HDFS: Number of bytes written=3827
        HDFS: Number of read operations=6
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
    Job Counters
        Launched map tasks=1
        Launched reduce tasks=1
        Data-local map tasks=1
        Total time spent by all maps in occupied slots (ms)=3075
        Total time spent by all reduces in occupied slots (ms)=6294
        Total time spent by all map tasks (ms)=3075
        Total time spent by all reduce tasks (ms)=3147
        Total vcore-seconds taken by all map tasks=3075
        Total vcore-seconds taken by all reduce tasks=3147
        Total megabyte-seconds taken by all map tasks=4723200
        Total megabyte-seconds taken by all reduce tasks=9667584
    Map-Reduce Framework
        Map input records=144
        Map output records=960
        Map output bytes=10358
        Map output materialized bytes=5263
        Input split bytes=104
        Combine input records=960
        Combine output records=361
        Reduce input groups=361
        Reduce shuffle bytes=5263
        Reduce input records=361
        Reduce output records=361
        Spilled Records=722
        Shuffled Maps =1
        Failed Shuffles=0
        Merged Map outputs=1
        GC time elapsed (ms)=26
        CPU time spent (ms)=2290
        Physical memory (bytes) snapshot=1309593600
        Virtual memory (bytes) snapshot=8647901184
        Total committed heap usage (bytes)=2021654528
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters
        Bytes Read=6635
    File Output Format Counters
        Bytes Written=3827

^^

参考文献:

使用YUM安装MySQL 5.5 http://www.linuxidc.com/Linux/2012-07/65098.htm

HDP官方文档

Canon的maven构建hadoop 1.x版本项目指南 http://blog.fens.me/hadoop-maven-eclipse/

使用HDP快速搭建Hadoop开发环境 | Debugo的更多相关文章

  1. CentOS 7快速搭建Nodejs开发环境

    Node.js是一个事件驱动I/O服务端JavaScript环境,基于Google的V8引擎,V8引擎执行Javascript的速度非常快,性能非常好.学习Nodejs首先需要会安装环境.这里我介绍如 ...

  2. Linux下快速搭建php开发环境

    php开发环境快速搭建 一.Linux下快速搭建php开发环境 1.安装XAMPP for Linux XAMPP(Apache+MySQL+PHP+PERL)是一个功能强大的建站集成软件包,使用XA ...

  3. Windows下快速搭建安卓开发环境android-studio

    Windows下快速搭建安卓开发环境android-studio 发布时间:2018-01-18 来源:网络 上传者:用户 关键字: 安卓 搭建 Android Windows 快速 环境 Studi ...

  4. 【IntelliJ IDEA新手入门】IDEA如何快速搭建Java开发环境

    作为IntelliJ IDEA mac新手,IDEA如何快速搭建Java开发环境呢? 今天小编就给大家带来了IntelliJ IDEA mac使用教程,想知道IDEA如何快速搭建Java开发环境?那就 ...

  5. 利用CodeBlocks结合freeglut快速搭建OpenGL开发环境

    利用CodeBlocks结合freeglut快速搭建OpenGL开发环境 2018-12-19 10:15:48 再次超越梦想 阅读数 180更多 分类专栏: 我的开发日记   版权声明:本文为博主原 ...

  6. IDEA如何快速搭建Java开发环境

    作为IntelliJ IDEA mac新手,IDEA如何快速搭建Java开发环境呢?今天小编就给大家带来了IntelliJ IDEA mac使用教程,想知道IDEA如何快速搭建Java开发环境? 全局 ...

  7. 【Hadoop】:Windows下使用IDEA搭建Hadoop开发环境

    笔者鼓弄了两个星期,终于把所有有关hadoop的环境配置好了,一是虚拟机上的完全分布式集群,但是为了平时写代码的方便,则在windows上也配置了hadoop的伪分布式集群,同时在IDEA上就可以编写 ...

  8. 基于Eclipse搭建hadoop开发环境

    一.基础环境准备 1.Eclipse 下载地址:http://pan.baidu.com/s/1slArxAP 2.JDK1.8  下载地址:http://pan.baidu.com/s/1i5iNy ...

  9. Linux下搭建hadoop开发环境-超详细

    先决条件:开发机器需要联网 已安装java 已安装Desktop组 1.上传安装软件到linux上: 2.安装maven,用于管理项目依赖包:以hadoop用户安装apache-maven-3.0.5 ...

随机推荐

  1. uva12105 Bigger is Better

    更简单的做法:定义状态dp[i][j]表示在已经用了i根火柴的情况下拼出来了剩余部分(是剩余部分,不是已经拼出来了的)为j(需要%m)的最大长度,一个辅助数组p[i][j]表示状态[i][j]的最高位 ...

  2. CAD交互绘制带颜色宽度的直线(网页版)

    用户可以在CAD控件视区任意位置绘制直线. 主要用到函数说明: _DMxDrawX::DrawLine 绘制一个直线.详细说明如下: 参数 说明 DOUBLE dX1 直线的开始点x坐标 DOUBLE ...

  3. 第1节 flume:9、flume的多个agent串联(级联)

    3.两个agent级联 需求分析: 第一个agent负责收集文件当中的数据,通过网络发送到第二个agent当中去,第二个agent负责接收第一个agent发送的数据,并将数据保存到hdfs上面去 第一 ...

  4. SQL Prompt 格式化SQL会自动插入分号的问题

    一.问题 安装新版SQL Prompt,格式化SQL都会自动在SQL末端插入分号 格式化前 格式化后 二.解决方法 选择SQL Prompt下的Options... 选择左侧的Format下Style ...

  5. int (*a)[10]和int *a[10]的区别

    有点晚了,放个链接,睡觉. https://stackoverflow.com/questions/13910749/difference-between-ptr10-and-ptr10

  6. Java 一些常见问题(持续更新)

    1. Java 内部类 内部类有四种常见的类型:成员内部类.局部内部类.匿名内部类和静态内部类. 1.成员内部类:定义为另一个类的里面如下: class Circle { double radius ...

  7. openmediavault 4.1.3 插件开发

    参考网址:https://forum.openmediavault.... 创建应用GUI 创建应用目录:/var/www/openmediavault/js/omv/module/admin/ser ...

  8. python抓取知识星球精选帖,制作为pdf文件

    版权声明:本文为xing_star原创文章,转载请注明出处! 本文同步自http://javaexception.com/archives/90 背景: 这两年知识付费越来越热,我也加入了不少知识星球 ...

  9. 关于程序计数器(PC)和条件控制转移 引起的性能差异

    关于PC(程序计数器) 冯 ·诺伊曼计算机体系结构的主要内容之一就是“程序预存储,计算机自动执行”! 处理器要执行的程序(指令序列)都是以二进制代码序列方式预存储在计算机的存储器中,处理器将这些代码逐 ...

  10. hdu1856 选出更多的孩子

    题目大意: 老师选取2个学生对应的号码,这两人视作朋友,同时朋友的朋友也可以看成自己的朋友. 最后老师选出一个人数最多的朋友圈. 这里学生的人数不大于10^7,所以操作时需要极为注意,操作步数能省则省 ...