https://www.java.net/forum/topic/jxta/jxta-community-forum/hadoop-port-jxta-p2p-framework

——————————————————————————————————————————————————————————————————————

besn0847
Offline
Joined: 2010-06-01
 
 

Hi,

I started few months ack a port of Hadoop DFS to JXTA to use it to share files across all my PCs and ensure automated replication.

Current version is 0.6.0 and is quite stable. Still have to finalize the Windows poirt but workds fine on other platforms.

Source : https://github.com/besn0847/Jxtadoop
Info : http://fbe-big-data.blogspot.fr/
Binaries : http://sourceforge.net/projects/jxtadoop/

Feel free to comment / criticize ...

Franck

——————————————————————————————————————————

Description

Hadoop is designed to work in large datacenters with thousands of servers connected to each others in the Hadoop cloud. This project focuses on the Distributed File System part of Hadoop (HDFS).

The goal of this project is to provide an alternative to direct IP
connectivity required for Hadoop. Instead, the DFS layer has been
modified to use a Peer-2-Peer framework which allows direct connectivity
in datacenters as well as indirect connectivity to bypass firewall
constraints.

The typical use case is the use of servers in various DMZs with hundreds
of gigabytes of data not used which can be leveraged to provide a
massive storage cloud for Hadoop.

The first release of Jxtadoop focused on providing the data storage layer with P2P capabilities based on the JXTA framework.

——————————————————————————————————————————————————————

History

Authors:

Genesis

This project started about 1 year and half ago when i thought about creating a private Hadoop cloud to leverage existing servers not located in the same sites. The initial goal was to create a private storage cloud using HDFS.

This cloud was setup using OpenVpn between all the sites but this was not ideal because :

  • Using a VPN involves to have all the traffic flowing to one central place even if the nodes are on the same local LAN;
  • This solution requires to install a VPN on remote server which couldn't be easily controlled;

Hence the decision to look into an alternative to direct IP connectivity requested by Hadoop nodes.

Concept

The concept is quite simple : the IP connectivity layer is replaced by a P2P one which can handle either direct connections (thru use of multicast) or indirect connections (thru the use of relays & rendez-vous).

——————————————————————————————————————————————————

Peer-to-Peer

Authors:

This wiki section describes the peer-to-peer layer chosen for Jxtadoop.

Multiple peer-to-peer frameworks exist today and some of them are based on Java. Jxta is a framework designed since 2001 and its current version is 2.7 dating back to H1 2011. This is one of the most comprehensive P2P Java framework even though it is quite complex and is not currently active.

More information can be found at :

I also recommend the reading of Jérôme Verstrynge book : Practical JXTA II (http://amzn.to/zIQ8NH) which is a very good introduction.

Few reasons drove the choice of JXTA :

  1. JXTA is a longstanding P2P framework (10 years) with recent updates in 2011
  2. JXTA is developped in Java making seamless integration with Hadoop
  3. JXTA can cope with LAN and enable direct communications through use of multicasting
  4. For none direct communications (firewalls, NAT, internet...), JXTA provides the rendez-vous and peers infrastructure to enable those communications
  5. JXTA provides sockets capabilities which can replace the Hadoop sockets without requiring in-depth Hadoop code rework
  6. Communications in JXTA can be fully authorized and encrypted to secure communications out of the corporate LAN
  7. JXTA provides PeerGroup concepts which can be used to isolate datanodes ...

However if you want to support this project and start diving into JXTA you need to know that support it pretty limited, the document is quite poor and the community is small. So your investment can be quite important.

————————————————————————————————————————————————————————

Architecture

Authors:

The JXTA P2P layer has been implemented aside the Namenode and Datanodes. This layer uses with the basic JXTA features. There is one PeerGroup dedicated for NN and DN RPC and DATA communications.

Security features will be added in the future along with multiple peer groups to isolate and secure RPC comms from DATA comms.

At this P2P level, a monitor is implemented to identify datanode when they connect and disconnect. A notification is then sent to the Namenode which will update the datanode hosts map accordingly.
This has been designed that way since many storage nodes could be connected and disconnected quite often.

On top of this P2P layer, JXTA sockets have been used to minimize the rework at the Namenode and DataNode level.

For the first version the following components have been removed : Balancer, Secondary Namenode and Jetty server.

——————————————————————————————————————————————————————————

Code Changes

Authors:

The following code changes have been made to Hadoop DFS 0.20.2. The next version will be based on Hadoop 1.0.0.

P2P Infrastructure

JXTA layer deployed with a unique peergroup for all comms :
. NN-to-DN : RPC
. DN-to-DN : RPC + data comms

Hadoop RPC Server

The RPC server classes have been modified to support JXTA sockets.

Hadoop Data Block Server

The socket server used to exchange data blocks has been modified to support JXTA sockets.

Components Removal

The following components have been removed from the first version :
. Balancer
. Secondary Namenode
. Http Web Server

Local Buffering

When the used FsShell has a colocated Datanode, the file is loaded to
the local DN only. The replication will then take place in the backend.

Datanodes Notifications

If a Datanode disconnects from the P2P cloud, a notification is
raised by the peer monitor and sent to the Namenode which will remove it
from the hosts map.

Full list of modified classes

————————————————————————————————————————————————————

Roadmap

Authors:

This is the roadmap page


 


Jxtadoop Admin

2012-01-26

Future work (thoughts)

  1. Re-include the removed components (balancer, secondary namenode, jetty server)
  2. Re-work the code to use the Hadoop 1.0.0 branch

Last edit: Jxtadoop Admin 2012-01-28

——————————————————————————————————————————————————————

Instructions

Authors:

Instructions to start the Namenode & Datanodes

Namenode

1/ Set the JAVA_HOME environment variable
2/ Unzip the jxtadoop-datanode-x.y.z.zip to the target directory
+ chmod the executable in bin/ directory
3/ Edit the etc/hdfs-p2p.xml and set the following 2 properties :
hadoop.p2p.rpc.rdv
hadoop.p2p.rpc.relay
Note that this 2 properties are mandatory even if the same multicast network
to avoid issues with multiple namenodes running in the same network.
4/ Initiliaze the namenode :
> bin/hadoop namenode -format
5/ Start up the namenode
> bin/start-namenode.sh

Datanode

1/ Set the JAVA_HOME environment variable
2/ Unzip the jxtadoop-datanode-x.y.z.zip to the target directory
+ chmod the executable in bin/ directory
3/ Edit the etc/hdfs-p2p.xml and set the following 2 properties :
hadoop.p2p.rpc.rdv
hadoop.p2p.rpc.relay
4/ Start up the namenode
> bin/start-datanode.sh

DFSClient

You can use the DFSClient as per Hadoop. For example

bin/hadoop fs -mkdir /test
bin/hadoop fs -chmod 777 /test
bin/hadoop fs -put ~/tmp/myfile /test
bin/hadoop fs get /test/myfile /tmp

Contact

Mail to : jxtadoop@besnard.mobi

Known issues

i1/ The JXTA layer may generate P2P exceptions upon sockets closure;

Hadoop port to Jxta P2P Framework的更多相关文章

  1. Hadoop数据分析实例:P2P借款人信用风险实时监控模型设计

    Hadoop数据分析实例:P2P借款人信用风险实时监控模型设计 一提到hadoop相信熟悉IT领域或者经常关注互联网新闻的朋友都应该很熟悉了,当然,这种熟悉可能也只是听着名字耳熟,但并不知道它具体是什 ...

  2. Hadoop官方文档翻译——MapReduce Tutorial

    MapReduce Tutorial(个人指导) Purpose(目的) Prerequisites(必备条件) Overview(综述) Inputs and Outputs(输入输出) MapRe ...

  3. 搭建单节点Hadoop应用环境

    虚拟机: VirtualBox 5 Server操作系统: Ubuntu Server 14.04.3 LTS 如果对虚拟机空间和性能不做考虑, 且不习惯用Linux命令, 你也可以使用Ubuntu ...

  4. 【转载 Hadoop&Spark 动手实践 2】Hadoop2.7.3 HDFS理论与动手实践

    简介 HDFS(Hadoop Distributed File System )Hadoop分布式文件系统.是根据google发表的论文翻版的.论文为GFS(Google File System)Go ...

  5. How To Setup Apache Hadoop On CentOS

    he Apache Hadoop software library is a framework that allows for the distributed processing of large ...

  6. jxta 2.8x启动了

    http://chaupal.github.io/ ———————————————————————————————————————————————————————————————————— 至少两个月 ...

  7. Hadoop

    Hadoop应用场景 Hadoop是专为离线处理和大规模数据分析而设计的,它并不适合那种对几个记录随机读写的在线事务处理模式. 大数据存储:Hadoop最适合一次写入.多次读取的数据存储需求,如数据仓 ...

  8. Hadoop基础——第一弹:Hadoop介绍

    一.基础 1.了解Java.Linux操作系统相关知识 2.如需精进,应为水平要达到一定标准,能够阅读国外相关技术网站,eg:http://hadoop.apache.org/ 二.什么是Hadoop ...

  9. [Hadoop] Hadoop学习笔记之Hadoop基础

    1 Hadoop是什么? Google公司发表了两篇论文:一篇论文是“The Google File System”,介绍如何实现分布式地存储海量数据:另一篇论文是“Mapreduce:Simplif ...

随机推荐

  1. Html,Css,Javascript及其他的注释方法详解

    一.HTML的注释方法<!-- html注释:START -->内容<!-- html注释:END --> 包含在“<!--”与“-->”之间的内容将会被浏览器忽略 ...

  2. oracle Instance status: READY–lsnrctl status|start|stop

    监听器启动,并不一定会认识数据库实例,启动监听器,请判别相关实例是否 READY [oracle@redhat4 ~]$ lsnrctl status LSNRCTL for Linux: Versi ...

  3. mysql 行列动态转换(列联表,交叉表)

    mysql 行列动态转换(列联表,交叉表) (1)动态,适用于列不确定情况 create table table_name( id int primary key, col1 char(2), col ...

  4. Qt之QTableView添加复选框(QAbstractTableModel)

    简述 使用QTableView,经常会遇到复选框,要实现一个好的复选框,除了常规的功能外,还应注意以下几点: 三态:不选/半选/全选 自定义风格(样式) 下面我们介绍一下常见的实现方式: 编辑委托. ...

  5. LA 3890 (半平面交) Most Distant Point from the Sea

    题意: 给出一个凸n边形,求多边形内部一点使得该点到边的最小距离最大. 分析: 最小值最大可以用二分. 多边形每条边的左边是一个半平面,将这n个半平面向左移动距离x,则将这个凸多边形缩小了.如果这n个 ...

  6. SJ9012: IE6 IE7 不支持 JSON 对象

    标准参考 JSON 是一种数据交换格式,RFC 4627 对 JSON 进行了详细描述. 根据 ECMA-262(ECMAScript)第 5 版中描述,JSON 是一个包含了函数 parse 和 s ...

  7. 《分销系统-原创第一章》之“多用户角色权限访问模块问题”的解决思路( 位运算 + ActionFilterAttribute )

    此项目需求就是根据给用户分配的权限,进行相应的权限模块浏览功能,因为项目不是很大,所以权限没有去用一张表去存,我的解决思路如下,希望大家给点建议. 数据库用户表结构如下: 数据库表梳理: BankUs ...

  8. 浏览器检测是否安装flash插件,若没有安装,则弹出安装提示

    说白了其实就是在html中前途flash的使用代码 <!--    html嵌入flash,检测浏览器是否安装flash插件,并提示安装.-->    <object type=&q ...

  9. handler.post 为什么要将thread对象post到handler中执行呢?

    转载网址:http://www.cnblogs.com/crazypebble/archive/2011/03/23/1991829.html在Android中使用Handler和Thread线程执行 ...

  10. MySQL 5.6 复制:GTID 的优点和限制(第一部分)

    全局事务标示符(Global Transactions Identifier)是MySQL 5.6复制的一个新特性.它为维护特定的复制拓扑结构下服务器的DBA们大幅度改善他们的工作状况提供了多种可能性 ...