Nutch+HBase

当我们为nutch的架构发愁的时候,nutch的开发人员送来了nutchbase。我一些简单的测试表明,在hadoop0.20.1和hbase0.20.2上,稍加修改可以运行起来。 
它的优点很明显:架构合理.

开发者是这样说的,引用自jira 
http://issues.apache.org/jira/browse/NUTCH-650

A) Why integrate with hbase?

All your data in a central location 
No more segment/crawldb/linkdb merges. 
No
more "missing" data in a job. There are a lot of places where we copy
data from one structure to another just so that it is available in a
later job. For example, during parsing we don't have access to a URL's
fetch status. So we copy fetch status into content metadata. This will
no longer be necessary with hbase integration. 
A much simpler data
model. If you want to update a small part in a single record, now you
have to write a MR job that reads the relevant directory, change the
single record, remove old directory and rename new directory. With
hbase, you can just update that record. Also, hbase gives us access to
Yahoo! Pig, which I think, with its SQL-ish language may be easier for
people to understand and use. 
B) Design 
Design is actually rather straightforward.

We
store everything (fetch time, status, content, parsed text, outlinks,
inlinks, etc.) in hbase. I have written a small utility class that
creates "webtable" with necessary columns. 
So now most jobs just take the name of the table as input. 
There
are two main classes for interfacing with hbase. ImmutableRowPart wraps
around a RowResult and has helper getters (getStatus(), getContent(),
etc.). RowPart is similar to ImmutableRowPart but also has setters. The
idea is that RowPart also wraps RowResult but also keeps a list of
updates done to that row. So when getSomething is called, it first
checks if Something is already updated (if so then returns the updated
version) or returns from RowResult. RowPart can also create a
BatchUpdate from its list of updates. 
URLs are stores in reversed
host order. For example, http://bar.foo.com:8983/to/index.html?a=b
becomes com.foo.bar:http:8983/to/index.html?a=b. This way, URLs from the
same tld/host/domain are stored closer to each other. TableUtil has
methods for reversing and unreversing URLs. 
CrawlDatum Status-es are simplifed. Since everything is in central location now, no point in having a DB and FETCH status. 
Jobs:

Each
job marks rows so that the next job knows which rows to read. For
example, if GeneratorHbase decides that a URL should be generated it
marks the URL with a TMP_FETCH_MARK (Marking a url is simply creating a
special metadata field.) When FetcherHbase runs, it skips over anything
without this special mark. 
InjectorHbase: First, a job runs where
injected urls are marked. Then in the next job, if a row has the mark
but nothing else (here, I assumed that if a row has "status:" column,
that it already exists), InjectorHbase initializes the row. 
GeneratorHbase: Supports max-per-host configuration and topN. Marks generated urls with a marker. 
FetcherHbase: Very similar to original Fetcher. Marks urls successfully fetched. Skips over URLs not marked by GeneratorHbase 
ParseTable: Similar to original Parser. Outlinks are stored "outlinks:<fromUrl>" -> "anchor". 
UpdateTable: Does updatedb's and invertlink's job. Also clears any markers. 
IndexerHbase: Indexes the entire table. Skips over URLs not parsed successfully.

Nutch+HBase的更多相关文章

  1. 【Nutch2.2.1基础教程之2.2】集成Nutch/Hbase/Solr构建搜索引擎之二:内容分析

    请先参见"集成Nutch/Hbase/Solr构建搜索引擎之一:安装及运行",搭建测试环境 http://blog.csdn.net/jediael_lu/article/deta ...

  2. 【Nutch2.2.1基础教程之2.1】集成Nutch/Hbase/Solr构建搜索引擎之一:安装及运行【单机环境】

    1.下载相关软件,并解压 版本号如下: (1)apache-nutch-2.2.1 (2) hbase-0.90.4 (3)solr-4.9.0 并解压至/usr/search 2.Nutch的配置 ...

  3. nutch,hbase,zookeeper兼容性问题

    nutch-2.1使用gora-0.2.1, gora-0.2.1使用hbase-0.90.4,hbase-0.90.4和hadoop-1.1.1不兼容,hbase-0.94.4和gora-0.2.1 ...

  4. 【Nutch2.3基础教程】集成Nutch/Hadoop/Hbase/Solr构建搜索引擎:安装及运行【集群环境】

    1.下载相关软件,并解压 版本号如下: (1)apache-nutch-2.3 (2) hadoop-1.2.1 (3)hbase-0.92.1 (4)solr-4.9.0 并解压至/opt/jedi ...

  5. 搜索引擎系列 ---lucene简介 创建索引和搜索初步

    一.什么是Lucene? Lucene最初是由Doug Cutting开发的,2000年3月,发布第一个版本,是一个全文检索引擎的架构,提供了完整的查询引擎和索引引擎 :Lucene得名于Doug妻子 ...

  6. lucene简介 创建索引和搜索初步

    lucene简介 创建索引和搜索初步 一.什么是Lucene? Lucene最初是由Doug Cutting开发的,2000年3月,发布第一个版本,是一个全文检索引擎的架构,提供了完整的查询引擎和索引 ...

  7. apache-hadoop-1.2.1、hbase、hive、mahout、nutch、solr安装教程

    1 软件环境: VMware8.0 Ubuntu-12.10-desktop-i386 jdk-7u40-linux-i586.tar.gz hadoop-1.2.1.tar.gz eclipse-d ...

  8. 一个大数据方案:基于Nutch+Hadoop+Hbase+ElasticSearch的网络爬虫及搜索引擎

    网络爬虫架构在Nutch+Hadoop之上,是一个典型的分布式离线批量处理架构,有非常优异的吞吐量和抓取性能并提供了大量的配置定制选项.由于网络爬虫只负责网络资源的抓取,所以,需要一个分布式搜索引擎, ...

  9. 【架构】基于Nutch+Hadoop+Hbase+ElasticSearch的网络爬虫及搜索引擎

    网络爬虫架构在Nutch+Hadoop之上,是一个典型的分布式离线批量处理架构,有非常优异的吞吐量和抓取性能并提供了大量的配置定制选项.由于网络爬虫只负责网络资源的抓取,所以,需要一个分布式搜索引擎, ...

随机推荐

  1. 解决ORA-28000: the account is locked

    原文地址:http://yanwushu.sinaapp.com/ora-28000-the-account-is-locked/ 在oracle中.连续十次尝试登陆不成功.那么此账户将会被锁定(lo ...

  2. 奇怪的问题,InvalidateRect最后一个参数在XP下无效

    一直用的WIN2K系统,写的一个程序在本机正常,到XP系统的机器运行发现调整窗口大小时界面闪得厉害,程序比较大,而且这种闪烁还不好调试,因为单步调试没有闪烁效果,只能排除法找原因,最后以为找到原因了, ...

  3. 后台调用外部程序的完美实现(使用CreateDesktop建立隐藏桌面)

    最近在做的一个软件,其中有一部分功能需要调用其它的软件来完成,而那个软件只有可执行文件,根本没有源代码,幸好,我要做的事不难,只需要在我的程序启动后,将那个软件打开,在需要的时候,对其中的一个文本矿设 ...

  4. CentOS6.4 安装Mysql

    虽说,新版的数据包可能会带上一些新特性,但是数据库对我而言,还是稳定版优先.因为新特性不一定我会用到.. 下载安装: yum list | grep mysql 因为是准备搞开发用的,所以只要安装my ...

  5. 阿斯钢iojeg9uhweu9erhpu9hyw49

    http://www.huihui.cn/share/8424421 http://www.huihui.cn/share/8424375 http://www.huihui.cn/share/842 ...

  6. C语言深度解剖读书笔记(6.函数的核心)

    对于本节的函数内容其实就没什么难点了,但是对于函数这节又涉及到了顺序点的问题,我觉得可以还是忽略吧. 本节知识点: 1.函数中的顺序点:f(k,k++);  这样的问题大多跟编译器有关,不要去刻意追求 ...

  7. 【数据库摘要】12_Sql_存储过程

    SQL 存储过程 存储过程创建语法: create or replace procedure 存储过程名(param1 in type,param2 out type) as 变量1 类型(值范围); ...

  8. Lucene.Net 2.3.1开发介绍 —— 一、接触Lucene.Net

    原文:Lucene.Net 2.3.1开发介绍 -- 一.接触Lucene.Net 1.引用Lucene.Net类库找到Lucene.Net的源代码,在“C#\src\Lucene.Net”目录.打开 ...

  9. jQuery为啥要提供一个load()方法?

    上午的时候,找个闲暇事件整理之前整理的一些关于jQuery的东西,看到了一个之前做的jQuery的$(document).ready()与window.onload()方法的比較. 上面两个方法最重要 ...

  10. android面试题 不仅仅是面试是一个很好的学习

    下面的问题是在网上找到的总结,感谢您分享!希望,我们的共同进步,找到自己心仪的公司,: 1.android dvm 流程和Linux这个过程.无论是应用程序对同一概念: 答案:dvm是dalivk虚拟 ...