Hama安装及示例运行

Hama介绍

Apache Hama是一个纯BSP（Bulk Synchronous Parallel）计算框架，模仿了Google的Pregel。用来处理大规模的科学计算，特别是矩阵和图计算。

BSP概念由Valiant（2010图灵奖获得者）在1990年提出，具体参看wikipedia。Google在2009年发表了<Pregel: A System for Large-Scale Graph Processing>论文，在分布式条件下实现了BSP模型。

Hama安装

安装环境：

OS: Ubuntu 12.04 64

JAVA: jdk1.6.0_30

Hadoop: hadoop-1.0.4

安装Hama之前，应该首先确保系统中已经安装了hadoop，我这里选用的目前最新版本hadoop-1.0.4。

第一步：下载并解压文件

hama的下载地址：http://mirror.bit.edu.cn/apache/hama/0.6.0/ 我这里选用北京理工的apache镜像。

解压文件到安装目录。我喜欢把hadoop和hama都安装在用户目录下，这样整个系统都比较干净。

tar -xvzf hama-0.6.0.tar.gz

第二步：修改配置文件

进入$HAMA_HOME/conf文件夹。

修改hama-env.sh文件。加入JAVA_HOME变量。

修改hama-site.xml文件。我的hama-site.xml配置文件如下：

<?xmlversion="1.0"?>
<?xml-stylesheettype="text/xsl"href="configuration.xsl"?>
<configuration>
<property>
<name>bsp.master.address</name>
<value>LenovoE46a:40000</value>
<description>The address of the bsp master server. Either the
literal string "local" or a host:port for distributed mode
</description>
</property>
<property>
<name>fs.default.name</name>
<value>LenovoE46a:9000/</value>
<description>
The name of the default file system. Either the literal string
"local" or a host:port for HDFS.
</description>
</property>
<property>
<name>hama.zookeeper.quorum</name>
<value>LenovoE46a</value>
<description>Comma separated list of servers in the ZooKeeper Quorum.
For example, "host1.mydomain.com,host2.mydomain.com,host3.mydomain.com".
By default this is set to localhost for local and pseudo-distributed modes
of operation. For a fully-distributed setup, this should be set to a full
list of ZooKeeper quorum servers. If HAMA_MANAGES_ZK is set in hama-env.sh
this is the list of servers which we will start/stop zookeeper on.
</description>
</property>
<property>
<name>hama.zookeeper.property.clientPort</name>
<value>2181</value>
</property>
</configuration>

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <property>

    <name>bsp.master.address</name>

    <value>LenovoE46a:40000</value>

    <description>The address of the bsp master server. Either the

    literal string "local" or a host:port for distributed mode

    </description>

  </property>

  <property>

    <name>fs.default.name</name>

    <value>LenovoE46a:9000/</value>

    <description>

      The name of the default file system. Either the literal string

      "local" or a host:port for HDFS.

    </description>

  </property>

  <property>

    <name>hama.zookeeper.quorum</name>

    <value>LenovoE46a</value>

    <description>Comma separated list of servers in the ZooKeeper Quorum.

    For example, "host1.mydomain.com,host2.mydomain.com,host3.mydomain.com".

    By default this is set to localhost for local and pseudo-distributed modes

    of operation. For a fully-distributed setup, this should be set to a full

    list of ZooKeeper quorum servers. If HAMA_MANAGES_ZK is set in hama-env.sh

    this is the list of servers which we will start/stop zookeeper on.

    </description>

  </property>

  <property>

    <name>hama.zookeeper.property.clientPort</name>

    <value>2181</value>

  </property>

</configuration>

解释一下，bsp.master.address参数设置成bsp master地址。fs.default.name参数设置成hadoop里namenode的地址。hama.zookeeper.quorum和hama.zookeeper.property.clientPort两个参数和zookeeper有关，设置成为zookeeper的quorum server即可，单机伪分布式就是本机地址。

第三步：运行Hama

首先启动Hadoop，

% $HADOOP_HOME/bin/start-all.sh

再启动Hama

% $HAMA_HOME/bin/start-bspd.sh

查看所有的进程，检查是否启动成功。

jps

第四步：运行例子程序

这里我们选用Pagerank例子程序。

首先上传数据到HDFS，数据的格式为：

Site1\tSite2\tSite3

Site2\tSite3

Site3

执行Hama，其中/tmp/input/input.txt和/tmp/pagerank-output分别为输入文件和输出文件夹。

bin/hama jar ../hama-

.6.0-examples.jar pagerank /tmp/input/input.txt /tmp/pagerank-output

成功！

第四周周结

所做的事情：

1.在eclipse里实现了五个结点的单源最短路径算法

实现结果：

输入文件：

1 0|2|2,10,4,5,

2 10|1|3,1,4,2,

3 MAX|0|5,4,

4 5|1|5,2,3,9,2,3,

5 MAX|0|3,6,1,7,

最终迭代结果：

1 0|2|2,10,4,5,

2 8|2|3,1,4,2,

3 9|2|5,4,

4 5|2|5,2,3,9,2,3,

5 7|2|3,6,1,7,

第一次map之后输出的中间结果文件：

1 0|2|2,10,4,5,

2 10|1|

2 MAX|0|3,1,4,2,

2 10|1|

2 MAX|0|3,1,4,2,

3 MAX|0|5,4,

4 5|1|

4 MAX|0|5,2,3,9,2,3,

4 5|1|

4 MAX|0|5,2,3,9,2,3,

5 MAX|0|3,6,1,7,

输出目录：mapred.local.dir因为没有配置，默认值：${hadoop.tmp.dir}/mapred/local

即datanode节点的/usr/local/hadoop/tmp，但是在reduce用完或者job停止之后被直接删除。

2.通过对mapreduce工作机制的理解，自己总结一些可以着手的优化方法：

（1）自定义combiner函数，在map任务的节点对输出先做一次合并，以减少传输到reducer的数据量。如在本例中，可以将上述map输出的中间结果中的<k,v>相同的对合并。或采取压缩数据

（2）InputFormat将数据先进行预处理，Split的数目决定了Map的数目

（3）自定义Partitioner函数，可以指定Reduce任务。默认采用的是hash(key)modR,分区比较平衡。