Nutch+HBase
Nutch+HBase
当我们为nutch的架构发愁的时候,nutch的开发人员送来了nutchbase。我一些简单的测试表明,在hadoop0.20.1和hbase0.20.2上,稍加修改可以运行起来。
它的优点很明显:架构合理.
开发者是这样说的,引用自jira
http://issues.apache.org/jira/browse/NUTCH-650
A) Why integrate with hbase?
All your data in a central location
No more segment/crawldb/linkdb merges.
No
more "missing" data in a job. There are a lot of places where we copy
data from one structure to another just so that it is available in a
later job. For example, during parsing we don't have access to a URL's
fetch status. So we copy fetch status into content metadata. This will
no longer be necessary with hbase integration.
A much simpler data
model. If you want to update a small part in a single record, now you
have to write a MR job that reads the relevant directory, change the
single record, remove old directory and rename new directory. With
hbase, you can just update that record. Also, hbase gives us access to
Yahoo! Pig, which I think, with its SQL-ish language may be easier for
people to understand and use.
B) Design
Design is actually rather straightforward.
We
store everything (fetch time, status, content, parsed text, outlinks,
inlinks, etc.) in hbase. I have written a small utility class that
creates "webtable" with necessary columns.
So now most jobs just take the name of the table as input.
There
are two main classes for interfacing with hbase. ImmutableRowPart wraps
around a RowResult and has helper getters (getStatus(), getContent(),
etc.). RowPart is similar to ImmutableRowPart but also has setters. The
idea is that RowPart also wraps RowResult but also keeps a list of
updates done to that row. So when getSomething is called, it first
checks if Something is already updated (if so then returns the updated
version) or returns from RowResult. RowPart can also create a
BatchUpdate from its list of updates.
URLs are stores in reversed
host order. For example, http://bar.foo.com:8983/to/index.html?a=b
becomes com.foo.bar:http:8983/to/index.html?a=b. This way, URLs from the
same tld/host/domain are stored closer to each other. TableUtil has
methods for reversing and unreversing URLs.
CrawlDatum Status-es are simplifed. Since everything is in central location now, no point in having a DB and FETCH status.
Jobs:
Each
job marks rows so that the next job knows which rows to read. For
example, if GeneratorHbase decides that a URL should be generated it
marks the URL with a TMP_FETCH_MARK (Marking a url is simply creating a
special metadata field.) When FetcherHbase runs, it skips over anything
without this special mark.
InjectorHbase: First, a job runs where
injected urls are marked. Then in the next job, if a row has the mark
but nothing else (here, I assumed that if a row has "status:" column,
that it already exists), InjectorHbase initializes the row.
GeneratorHbase: Supports max-per-host configuration and topN. Marks generated urls with a marker.
FetcherHbase: Very similar to original Fetcher. Marks urls successfully fetched. Skips over URLs not marked by GeneratorHbase
ParseTable: Similar to original Parser. Outlinks are stored "outlinks:<fromUrl>" -> "anchor".
UpdateTable: Does updatedb's and invertlink's job. Also clears any markers.
IndexerHbase: Indexes the entire table. Skips over URLs not parsed successfully.
Nutch+HBase的更多相关文章
- 【Nutch2.2.1基础教程之2.2】集成Nutch/Hbase/Solr构建搜索引擎之二:内容分析
请先参见"集成Nutch/Hbase/Solr构建搜索引擎之一:安装及运行",搭建测试环境 http://blog.csdn.net/jediael_lu/article/deta ...
- 【Nutch2.2.1基础教程之2.1】集成Nutch/Hbase/Solr构建搜索引擎之一:安装及运行【单机环境】
1.下载相关软件,并解压 版本号如下: (1)apache-nutch-2.2.1 (2) hbase-0.90.4 (3)solr-4.9.0 并解压至/usr/search 2.Nutch的配置 ...
- nutch,hbase,zookeeper兼容性问题
nutch-2.1使用gora-0.2.1, gora-0.2.1使用hbase-0.90.4,hbase-0.90.4和hadoop-1.1.1不兼容,hbase-0.94.4和gora-0.2.1 ...
- 【Nutch2.3基础教程】集成Nutch/Hadoop/Hbase/Solr构建搜索引擎:安装及运行【集群环境】
1.下载相关软件,并解压 版本号如下: (1)apache-nutch-2.3 (2) hadoop-1.2.1 (3)hbase-0.92.1 (4)solr-4.9.0 并解压至/opt/jedi ...
- 搜索引擎系列 ---lucene简介 创建索引和搜索初步
一.什么是Lucene? Lucene最初是由Doug Cutting开发的,2000年3月,发布第一个版本,是一个全文检索引擎的架构,提供了完整的查询引擎和索引引擎 :Lucene得名于Doug妻子 ...
- lucene简介 创建索引和搜索初步
lucene简介 创建索引和搜索初步 一.什么是Lucene? Lucene最初是由Doug Cutting开发的,2000年3月,发布第一个版本,是一个全文检索引擎的架构,提供了完整的查询引擎和索引 ...
- apache-hadoop-1.2.1、hbase、hive、mahout、nutch、solr安装教程
1 软件环境: VMware8.0 Ubuntu-12.10-desktop-i386 jdk-7u40-linux-i586.tar.gz hadoop-1.2.1.tar.gz eclipse-d ...
- 一个大数据方案:基于Nutch+Hadoop+Hbase+ElasticSearch的网络爬虫及搜索引擎
网络爬虫架构在Nutch+Hadoop之上,是一个典型的分布式离线批量处理架构,有非常优异的吞吐量和抓取性能并提供了大量的配置定制选项.由于网络爬虫只负责网络资源的抓取,所以,需要一个分布式搜索引擎, ...
- 【架构】基于Nutch+Hadoop+Hbase+ElasticSearch的网络爬虫及搜索引擎
网络爬虫架构在Nutch+Hadoop之上,是一个典型的分布式离线批量处理架构,有非常优异的吞吐量和抓取性能并提供了大量的配置定制选项.由于网络爬虫只负责网络资源的抓取,所以,需要一个分布式搜索引擎, ...
随机推荐
- 几种快速傅里叶变换(FFT)的C++实现
链接:http://blog.csdn.net/zwlforever/archive/2008/03/14/2183049.aspx一篇不错的FFT 文章,收藏一下. DFT的的正变换和反变换分别为( ...
- 【 .NET 面向对象程序设计进阶》】【 《.NET 面向对象编程基础》】【《正则表达式助手》】
<.NET 面向对象程序设计进阶> <.NET 面向对象程序设计进阶> <正则表达式助手>
- 如果一个Object对象可能是数组那么如何对其进行迭代
需求:一个方法传入的参数是Object类型(假设对象为“items”,使用Object类型也是为了使用多态而增加方法复用性),但已知这个Object对象可能是基本类型数组,也可能是对象数组,如何将这个 ...
- Android自学绝佳资料
本文转自stormzhang老师的博客:http://stormzhang.com/android/2014/07/07/learn-android-from-rookie 首先感谢stromzhan ...
- HDU 4870 Rating (2014 多校联合第一场 J)(概率)
题意: 一个人有两个TC的账号,一开始两个账号rating都是0,然后每次它会选择里面rating较小的一个账号去打比赛,每次比赛有p的概率+1分,有1-p的概率-2分,当然如果本身是<=2分的 ...
- 如何在SAS中重新构建限价指令簿(Limit Order Book):使用HashTable
在之前的一篇日志里(http://blog.csdn.net/u010501526/article/details/8875446),我将重新构建LOB(Limit Order Book)分为了三步 ...
- 使用visual c++ 2005编译64位可执行文件
最近需要将一个32位的程序移植到64位上,由于原来是使用vs2003写的,vs2003本身并不支持编译64位系统上,只能升级到vs2005以上版本.个人还是比较喜欢vs2005,对c++来说,vs20 ...
- libevent简单分析
一看名字就知道是围绕eventloop转的. 那首先肯定是eventloop是个什么?一般都是IO事件,timer事件的管理器. 那首先看如何new出来一个eventloop: 1.因为libeven ...
- hadoop 磁盘限额配置
配置方法: 在 hdfs-site.xml 里配置如下参数,注意,那个 value 的值是配置该磁盘保留的DFS不能使用的空间大小,单位是字节. (如果多块硬盘,则表示为每块硬盘保留这么多空间) &l ...
- 代码静态分析工具PC-LINT安装配置
代码静态分析工具PC-LINT安装配置--step by step 作者:ehui928 ...