【Nutch2.2.1基础教程之6】Nutch2.2.1抓取流程

一、抓取流程概述

1、nutch抓取流程

当使用crawl命令进行抓取任务时，其基本流程步骤如下：

（1）InjectorJob

开始第一个迭代

（2）GeneratorJob

（3）FetcherJob

（4）ParserJob

（5）DbUpdaterJob

（6）SolrIndexerJob

开始第二个迭代

（2）GeneratorJob
（3）FetcherJob
（4）ParserJob
（5）DbUpdaterJob
（6）SolrIndexerJob

开始第三个迭代

……

2、抓取日志

使用crawl命令进行抓取时，console输出日志如下：

InjectorJob: starting at 2014-07-08 10:41:27

InjectorJob: Injecting urlDir: urls

InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.

InjectorJob: total number of urls rejected by filters: 0

InjectorJob: total number of urls injected after normalization and filtering: 2

Injector: finished at 2014-07-08 10:41:32, elapsed: 00:00:05

Tue Jul 8 10:41:33 CST 2014 : Iteration 1 of 5

Generating batchId

Generating a new fetchlist

GeneratorJob: starting at 2014-07-08 10:41:34

GeneratorJob: Selecting best-scoring urls due for fetch.

GeneratorJob: starting

GeneratorJob: filtering: false

GeneratorJob: normalizing: false

GeneratorJob: topN: 50000

GeneratorJob: finished at 2014-07-08 10:41:39, time elapsed: 00:00:05

GeneratorJob: generated batch id: 1404787293-26339

Fetching :

FetcherJob: starting

FetcherJob: batchId: 1404787293-26339

Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.

FetcherJob: threads: 50

FetcherJob: parsing: false

FetcherJob: resuming: false

FetcherJob : timelimit set for : 1404798101129

Using queue mode : byHost

Fetcher: threads: 50

QueueFeeder finished: total 2 records. Hit by time limit :0

fetching http://www.csdn.net/ (queue crawl delay=5000ms)

Fetcher: throughput threshold: -1

Fetcher: throughput threshold sequence: 5

fetching http://www.itpub.net/ (queue crawl delay=5000ms)

-finishing thread FetcherThread47, activeThreads=48

-finishing thread FetcherThread46, activeThreads=47

-finishing thread FetcherThread45, activeThreads=46

-finishing thread FetcherThread44, activeThreads=45

-finishing thread FetcherThread43, activeThreads=44

-finishing thread FetcherThread42, activeThreads=43

-finishing thread FetcherThread41, activeThreads=42

-finishing thread FetcherThread40, activeThreads=41

-finishing thread FetcherThread39, activeThreads=40

-finishing thread FetcherThread38, activeThreads=39

-finishing thread FetcherThread37, activeThreads=38

-finishing thread FetcherThread36, activeThreads=37

-finishing thread FetcherThread35, activeThreads=36

-finishing thread FetcherThread34, activeThreads=35

-finishing thread FetcherThread33, activeThreads=34

-finishing thread FetcherThread32, activeThreads=33

-finishing thread FetcherThread31, activeThreads=32

-finishing thread FetcherThread30, activeThreads=31

-finishing thread FetcherThread29, activeThreads=30

-finishing thread FetcherThread48, activeThreads=29

-finishing thread FetcherThread27, activeThreads=29

-finishing thread FetcherThread26, activeThreads=28

-finishing thread FetcherThread25, activeThreads=27

-finishing thread FetcherThread24, activeThreads=26

-finishing thread FetcherThread23, activeThreads=25

-finishing thread FetcherThread22, activeThreads=24

-finishing thread FetcherThread21, activeThreads=23

-finishing thread FetcherThread20, activeThreads=22

-finishing thread FetcherThread19, activeThreads=21

-finishing thread FetcherThread18, activeThreads=20

-finishing thread FetcherThread17, activeThreads=19

-finishing thread FetcherThread16, activeThreads=18

-finishing thread FetcherThread15, activeThreads=17

-finishing thread FetcherThread14, activeThreads=16

-finishing thread FetcherThread13, activeThreads=15

-finishing thread FetcherThread12, activeThreads=14

-finishing thread FetcherThread11, activeThreads=13

-finishing thread FetcherThread10, activeThreads=12

-finishing thread FetcherThread9, activeThreads=11

-finishing thread FetcherThread8, activeThreads=10

-finishing thread FetcherThread7, activeThreads=9

-finishing thread FetcherThread5, activeThreads=8

-finishing thread FetcherThread4, activeThreads=7

-finishing thread FetcherThread3, activeThreads=6

-finishing thread FetcherThread2, activeThreads=5

-finishing thread FetcherThread49, activeThreads=4

-finishing thread FetcherThread6, activeThreads=3

-finishing thread FetcherThread28, activeThreads=2

-finishing thread FetcherThread0, activeThreads=1

fetch of http://www.itpub.net/ failed with: java.io.IOException: unzipBestEffort returned null

-finishing thread FetcherThread1, activeThreads=0

0/0 spinwaiting/active, 2 pages, 1 errors, 0.4 0 pages/s, 93 93 kb/s, 0 URLs in 0 queues

-activeThreads=0

FetcherJob: done

Parsing :

ParserJob: starting

ParserJob: resuming:    false

ParserJob: forced reparse:      false

ParserJob: batchId:     1404787293-26339

Parsing http://www.csdn.net/

http://www.csdn.net/ skipped. Content of size 92777 was truncated to 59561

Parsing http://www.itpub.net/

ParserJob: success

CrawlDB update for csdnitpub

DbUpdaterJob: starting

DbUpdaterJob: done

Indexing csdnitpub on SOLR index -> http://ip:8983/solr/

SolrIndexerJob: starting

SolrIndexerJob: done.

SOLR dedup -> http://ip:8983/solr/

Tue Jul 8 10:42:18 CST 2014 : Iteration 2 of 5

Generating batchId

Generating a new fetchlist

GeneratorJob: starting at 2014-07-08 10:42:19

GeneratorJob: Selecting best-scoring urls due for fetch.

GeneratorJob: starting

GeneratorJob: filtering: false

GeneratorJob: normalizing: false

GeneratorJob: topN: 50000

GeneratorJob: finished at 2014-07-08 10:42:25, time elapsed: 00:00:05

GeneratorJob: generated batch id: 1404787338-30453

Fetching :

FetcherJob: starting

FetcherJob: batchId: 1404787338-30453

Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.

FetcherJob: threads: 50

FetcherJob: parsing: false

FetcherJob: resuming: false

FetcherJob : timelimit set for : 1404798146676

Using queue mode : byHost

Fetcher: threads: 50

QueueFeeder finished: total 0 records. Hit by time limit :0

二、使用命令进行逐步抓取

1、InjectorJob

此步骤将seed.txt中的url注入抓取队列中进行初始化。

（1）基本命令

$ bin/nutch inject

Usage: InjectorJob <url_dir> [-crawlId <id>]

$ bin/nutch inject urls

InjectorJob: starting at 2014-12-20 22:32:01

InjectorJob: Injecting urlDir: urls

InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.

InjectorJob: total number of urls rejected by filters: 0

InjectorJob: total number of urls injected after normalization and filtering: 1

Injector: finished at 2014-12-20 22:32:15, elapsed: 00:00:14

其中urls/seed.txt的内容如下：

http://stackoverflow.com/

（2）查看注入的url

上述步骤会在hbase中新建一个表，表名为test_1_webpage，url的相应内容会写入这张表

hbase(main):002:0> scan '334_webpage'

ROW                              COLUMN+CELL                                                                               

 com.stackoverflow:http/         column=f:fi, timestamp=1408953100271, value=\x00'\x8D\x00                                  

 com.stackoverflow:http/         column=f:ts, timestamp=1408953100271, value=\x00\x00\x01H\x0C&\x11\x8D                     

 com.stackoverflow:http/         column=mk:_injmrk_, timestamp=1408953100271, value=y                                       

 com.stackoverflow:http/         column=mk:dist, timestamp=1408953100271, value=0                                           

 com.stackoverflow:http/         column=mtdt:_csh_, timestamp=1408953100271, value=?\x80\x00\x00                            

 com.stackoverflow:http/         column=s:s, timestamp=1408953100271, value=?\x80\x00\x00                                   

1 row(s) in 0.3020 seconds

(3)关于**_webpage表

对于每一个任务，均会生成一个crawlId_webpage的表，所有已抓取及未抓取的url相关信息均会存入此表。

若url未抓取，则该url相应的行信息较少。若url已经抓取，则抓取到的内容也会放入该行，如网页内容等。

2、GeneratorJob

（1）基本命令

[jediael@jediael local]$  bin/nutch generate -crawlId 334

GeneratorJob: starting at 2014-08-25 15:57:12

GeneratorJob: Selecting best-scoring urls due for fetch.

GeneratorJob: starting

GeneratorJob: filtering: true

GeneratorJob: normalizing: true

GeneratorJob: finished at 2014-08-25 15:57:18, time elapsed: 00:00:06

GeneratorJob: generated batch id: 1408953432-1171377744

（2）命令选项

[root@jediael local]# bin/nutch generate

Usage: GeneratorJob [-topN N] [-crawlId id] [-noFilter] [-noNorm] [-adddays numDays]

 -topN <N>      - number of top URLs to be selected, default is Long.MAX_VALUE 

   -crawlId <id>  - the id to prefix the schemas to operate on, default: storage.crawl.id)"); 

   -noFilter      - do not activate the filter plugin to filter the url, default is true 

    -noNorm        - do not activate the normalizer plugin to normalize the url, default is true 

    -adddays       - Adds numDays to the current time to facilitate crawling urls already fetched sooner then db.fetch.interval.default. Default value is 0.

    -batchId       - the batch id 

----------------------

Please set the params.

（3）查看数据库

hbase(main):003:0> scan '334_webpage' 

ROW                              COLUMN+CELL                                                                                

 com.stackoverflow:http/         column=f:bid, timestamp=1408953437910, value=1408953432-1171377744                         

 com.stackoverflow:http/         column=f:fi, timestamp=1408953100271, value=\x00'\x8D\x00                                  

 com.stackoverflow:http/         column=f:ts, timestamp=1408953100271, value=\x00\x00\x01H\x0C&\x11\x8D                     

 com.stackoverflow:http/         column=mk:_gnmrk_, timestamp=1408953437910, value=1408953432-1171377744                    

 com.stackoverflow:http/         column=mk:_injmrk_, timestamp=1408953100271, value=y                                       

 com.stackoverflow:http/         column=mk:dist, timestamp=1408953100271, value=0                                           

 com.stackoverflow:http/         column=mtdt:_csh_, timestamp=1408953100271, value=?\x80\x00\x00                            

 com.stackoverflow:http/         column=s:s, timestamp=1408953100271, value=?\x80\x00\x00                                   

1 row(s) in 0.0490 seconds

此步骤新增了f:bid，mk:_gnmrk_  两列。

3、FetcherJob

（1）基本命令

[jediael@jediael local]$  bin/nutch generate -crawlId 334

GeneratorJob: starting at 2014-08-25 15:57:12

GeneratorJob: Selecting best-scoring urls due for fetch.

GeneratorJob: starting

GeneratorJob: filtering: true

GeneratorJob: normalizing: true

GeneratorJob: finished at 2014-08-25 15:57:18, time elapsed: 00:00:06

GeneratorJob: generated batch id: 1408953432-1171377744

[jediael@jediael local]$  bin/nutch fetch -all -crawlId 334

FetcherJob: starting

FetcherJob: fetching all

Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.

FetcherJob: threads: 10

FetcherJob: parsing: false

FetcherJob: resuming: false

FetcherJob : timelimit set for : -1

Using queue mode : byHost

Fetcher: threads: 10

QueueFeeder finished: total 1 records. Hit by time limit :0

Fetcher: throughput threshold: -1

Fetcher: throughput threshold sequence: 5

fetching http://stackoverflow.com/ (queue crawl delay=5000ms)

-finishing thread FetcherThread1, activeThreads=8

-finishing thread FetcherThread7, activeThreads=7

-finishing thread FetcherThread6, activeThreads=6

-finishing thread FetcherThread5, activeThreads=5

-finishing thread FetcherThread4, activeThreads=4

-finishing thread FetcherThread3, activeThreads=3

-finishing thread FetcherThread2, activeThreads=2

-finishing thread FetcherThread8, activeThreads=1

-finishing thread FetcherThread9, activeThreads=1

-finishing thread FetcherThread0, activeThreads=0

0/0 spinwaiting/active, 1 pages, 0 errors, 0.2 0 pages/s, 102 102 kb/s, 0 URLs in 0 queues

-activeThreads=0

FetcherJob: done

（2）查看数据库

见db1.txt

新增f:bas，column=f:cnt，column=f:prot，f:pts，f:st，f:ts，f:typ，h:Cache-Control，h:Connection，h:Content-Encoding，h:Content-Length, h:Content-Type,h:Date,h:Expires, h:Last-Modified,h:Set-Cookie,h:Vary,h:X-Frame-Options, mk:_ftcmrk_等字段

4、ParserJob

（1）基本命令

[jediael@jediael local]$ bin/nutch parse  -all -crawlId 334

ParserJob: starting

ParserJob: resuming:    false

ParserJob: forced reparse:      false

ParserJob: parsing all

Parsing http://stackoverflow.com/

ParserJob: success

（2）命令参数

[root@jediael local]# bin/nutch parse 

Usage: ParserJob (<batchId> | -all) [-crawlId <id>] [-resume] [-force]

    <batchId>     - symbolic batch ID created by Generator

    -crawlId <id> - the id to prefix the schemas to operate on,

                    (default: storage.crawl.id)

    -all          - consider pages from all crawl jobs

    -resume       - resume a previous incomplete job

    -force        - force re-parsing even if a page is already parsed

（3）查看数据库

见db_parse.txt

新增了很多类似column=ol:http://stackoverflow.com/help的列，在此例中共有115个。

5、DbUpdaterJob

（1）基本命令

[jediael@jediael local]$ bin/nutch updatedb -crawlId 334

DbUpdaterJob: starting

DbUpdaterJob: done

（2）查看数据库

见db_updatedb.txt

解释了上述的115个column=ol:http，并生成了115行新数据，举其中一个例子如下：

com.stackoverflow:http/users/39 column=f:fi, timestamp=1408954979355, value=\x00'\x8D\x00                                  

 44974/silviu-oncioiu                                                                                                       

 com.stackoverflow:http/users/39 column=f:st, timestamp=1408954979355, value=\x00\x00\x00\x01                               

 44974/silviu-oncioiu                                                                                                       

 com.stackoverflow:http/users/39 column=f:ts, timestamp=1408954979355, value=\x00\x00\x01H\x0CB\xD4\x09                     

 44974/silviu-oncioiu                                                                                                       

 com.stackoverflow:http/users/39 column=mk:dist, timestamp=1408954979355, value=1                                           

 44974/silviu-oncioiu                                                                                                       

 com.stackoverflow:http/users/39 column=mtdt:_csh_, timestamp=1408954979355, value=<\x0Ex5                                  

 44974/silviu-oncioiu                                                                                                       

 com.stackoverflow:http/users/39 column=s:s, timestamp=1408954979355, value=<\x0Ex5                                         

 44974/silviu-oncioiu                                                                                                       

 com.stackoverflow:http/users/39 column=f:fi, timestamp=1408954979355, value=\x00'\x8D\x00                                  

 74525/laosi                                                                                                                

 com.stackoverflow:http/users/39 column=f:st, timestamp=1408954979355, value=\x00\x00\x00\x01                               

 74525/laosi                                                                                                                

 com.stackoverflow:http/users/39 column=f:ts, timestamp=1408954979355, value=\x00\x00\x01H\x0CB\xD4\x09                     

 74525/laosi                                                                                                                

 com.stackoverflow:http/users/39 column=mk:dist, timestamp=1408954979355, value=1                                           

 74525/laosi                                                                                                                

 com.stackoverflow:http/users/39 column=mtdt:_csh_, timestamp=1408954979355, value=<\x0Ex5                                  

 74525/laosi                                                                                                                

 com.stackoverflow:http/users/39 column=s:s, timestamp=1408954979355, value=<\x0Ex5                                         

 74525/laosi 

此时数据已准备好，等待下一轮的抓取。

6、SolrIndexerJob

（1）基本命令

[jediael@jediael local]$  bin/nutch solrindex http://****/solr/  -all -crawlId 334

SolrIndexerJob: starting

Adding 1 documents

SolrIndexerJob: done.

（2）命令参数

[root@jediael local]# bin/nutch solrindex 

Usage: SolrIndexerJob <solr url> (<batchId> | -all | -reindex) [-crawlId <id>]

（3）查看数据库

无变化

【Nutch2.2.1基础教程之6】Nutch2.2.1抓取流程的更多相关文章

【Nutch2.2.1基础教程之2.2】集成Nutch/Hbase/Solr构建搜索引擎之二：内容分析
请先参见"集成Nutch/Hbase/Solr构建搜索引擎之一:安装及运行",搭建测试环境 http://blog.csdn.net/jediael_lu/article/deta ...
【Nutch2.2.1基础教程之3】Nutch2.2.1配置文件
nutch-site.xml 在nutch2.2.1中,有两份配置文件:nutch-default.xml与nutch-site.xml. 其中前者是nutch自带的默认属性,一般情况下不要修改. 如 ...
【Nutch2.2.1基础教程之3】Nutch2.2.1配置文件分类： H3_NUTCH 2014-08-18 16:33 1376人阅读评论(0) 收藏
nutch-site.xml 在nutch2.2.1中,有两份配置文件:nutch-default.xml与nutch-site.xml. 其中前者是nutch自带的默认属性,一般情况下不要修改. 如 ...
【Nutch2.2.1基础教程之1】nutch相关异常
1.在任务一开始运行,注入Url时即出现以下错误. InjectorJob: Injecting urlDir: urls InjectorJob: Using class org.apache.go ...
【Nutch2.2.1基础教程之2.1】集成Nutch/Hbase/Solr构建搜索引擎之一：安装及运行【单机环境】
1.下载相关软件,并解压版本号如下: (1)apache-nutch-2.2.1 (2) hbase-0.90.4 (3)solr-4.9.0 并解压至/usr/search 2.Nutch的配置 ...
【Nutch2.2.1基础教程之6】Nutch2.2.1抓取流程分类： H3_NUTCH 2014-08-15 21:39 2530人阅读评论(1) 收藏
一.抓取流程概述 1.nutch抓取流程当使用crawl命令进行抓取任务时,其基本流程步骤如下: (1)InjectorJob 开始第一个迭代 (2)GeneratorJob (3)FetcherJ ...
【Nutch2.2.1基础教程之1】nutch相关异常分类： H3_NUTCH 2014-08-08 21:46 1549人阅读评论(2) 收藏
1.在任务一开始运行,注入Url时即出现以下错误. InjectorJob: Injecting urlDir: urls InjectorJob: Using class org.apache.go ...
OpenVAS漏洞扫描基础教程之OpenVAS概述及安装及配置OpenVAS服务
OpenVAS漏洞扫描基础教程之OpenVAS概述及安装及配置OpenVAS服务 1. OpenVAS基础知识 OpenVAS(Open Vulnerability Assessment Sys ...
Python基础教程之List对象转
Python基础教程之List对象时间:2014-01-19 来源:服务器之家投稿:root 1.PyListObject对象typedef struct { PyObjec ...

随机推荐

Vim光标移动
最近全面转换开发到Mac OS下,用MacVim作为IDE.记录一些Vim基本操作给自己备忘. 此次所说的都是在common-mode(c-mode,在Vim又名normal-mode,就是刚进入vi ...
Python实现合并排序MergeSort
def merge(sort_list, start, mid, end): left_list = sort_list[start:mid] right_list = sort_list[mid:e ...
Hibernate学习笔记--核心编程
参考资料:Java Web核心框架 http://blog.csdn.net/lsh6688/article/details/7611950 补充:ThreadLocal的使用:http://www. ...
iOS工程结构
好的架构不是设计出来的,而是进化而来的! 写在前面从2011年底开始学习iOS开发,到现在也已经快3年了,虽然中途没有一直进行iOS的开发(总是在Android和iOS间切换),但始终没 ...
11G在线重建索引
SQL> select count(*) from test_idx; COUNT(*) ---------- 19087751 SQL> select segment_name,segm ...
C#实现目录复制
摘自:http://www.cnblogs.com/zxjay/archive/2008/10/29/1322517.html FCL提供了文件移动.文件复制.目录移动的方法,但没提供目录复制的 ...
Struct2(三) Struct2 标签
在上一篇 Struct2(二)中,我们新建了工程Struct2test用来验证hello World 程序,在index.jsp中,我们添加了一个Struct2 uri 标签用来创建一个指向hello ...
Windows 8.1下使用IE 64位
Internet Options -> Advanced -> Settings Security组对Enable 64-bit processes for Enhanced Prote ...
poj 2049 Let it Bead（polya模板）
Description Cannery Row percent of the target audience insists that the bracelets be unique. (Just ...
HTML之一天学会html(常用标签+网页架构)
1. 网页文件的创建新建一个文本文件,将其命名为xxx.html或者xxx.htm(注意后缀名) 2. 简单的html页面的编写在网页中都是通过标签来指定相应的显示内容,所有的页面内容都必须在 ...

【Nutch2.2.1基础教程之6】Nutch2.2.1抓取流程

【Nutch2.2.1基础教程之6】Nutch2.2.1抓取流程的更多相关文章

随机推荐

热门专题