Nutch2.x 演示抓取第一个网站
http://www.micmiu.com/opensource/nutch/nutch2x-crawl-first-website/?utm_source=tuicool&utm_medium=referral
下面演示的过程是基于目前 Nutch 2.2.1 自己编译配置的版本。
在编译后 bin目录下有两个脚本文件:nutch 和 crawl ,在命令行下执行各命令即可查看具体使用说明:
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
$ nutch
Usage: nutch COMMAND
where COMMAND is one of:
inject inject new urls into the database
hostinject creates or updates an existing host table from a text file
generate generate new batches to fetch from crawl db
fetch fetch URLs marked during generate
parse parse URLs marked during fetch
updatedb update web table after parsing
updatehostdb update host table after parsing
readdb read/dump records from page database
readhostdb display entries from the hostDB
elasticindex run the elasticsearch indexer
solrindex run the solr indexer on parsed batches
solrdedup remove duplicates from solr
parsechecker check the parser for a given url
indexchecker check the indexing filters for a given url
plugin load a plugin and run one of its classes main()
nutchserver run a (local) Nutch server on a user defined port
junit runs the given JUnit test
or
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
|
Shell
|
1
2
|
$ crawl
Missing seedDir : crawl <seedDir> <crawlID> <solrURL> <numberOfRounds>
|
在Nutch2.x版本中,爬取流程所涉及的命令做了优化,整合到了crawl 命令中,使用者只需要执行一个命令 crawl 即可完成爬取流程,而不必像老版本中那样,必须依次地执行 inject、generate、fetch、parse等命令。对于初学者来说仍然可以依次执行相关命令 ,仔细观察每执行一步引起的数据变化。下面以抓取 本人博客网站为例详细说明下抓取的过程:
[准备]:创建需要抓取的URL
- 首先启动hbase (本文是在单机模式下演示的)
- mkdir -p urls
- cd urls
- touch seed.txt
- echo ‘http://micmiu.com’ >seed.txt
下面每一步执行后都可以查看HBase中数据的变化情况。
[第一步]:inject
|
1
2
3
4
5
6
7
|
$ nutch inject urls -crawlId micmiublog
InjectorJob: starting at 2015-01-12 09:42:46
InjectorJob: Injecting urlDir: urls
2015-01-12 09:42:47.096 java[14509:4735452] Unable to load realm info from SCDynamicStore
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 1
|
查看HBase中得数据:
|
1
2
3
4
5
6
7
8
9
|
hbase(main):016:0> scan 'micmiublog_webpage'
ROW COLUMN+CELL
com.micmiu:http/ column=f:fi, timestamp=1421026970740, value=\x00'\x8D\x00
com.micmiu:http/ column=f:ts, timestamp=1421026970740, value=\x00\x00\x01J\xDB\xCE\xBC\xF2
com.micmiu:http/ column=mk:_injmrk_, timestamp=1421026970740, value=y
com.micmiu:http/ column=mk:dist, timestamp=1421026970740, value=0
com.micmiu:http/ column=mtdt:_csh_, timestamp=1421026970740, value=?\x80\x00\x00
com.micmiu:http/ column=s:s, timestamp=1421026970740, value=?\x80\x00\x00
1 row(s) in 0.1010 seconds
|
[第二步]:generate
|
1
2
3
4
5
6
7
8
9
10
|
$ nutch generate -topN 5 -crawlId micmiublog
GeneratorJob: starting at 2015-01-12 09:47:09
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: normalizing: true
GeneratorJob: topN: 5
2015-01-12 09:47:09.822 java[14533:4744993] Unable to load realm info from SCDynamicStore
GeneratorJob: finished at 2015-01-12 09:47:13, time elapsed: 00:00:03
GeneratorJob: generated batch id: 1421027229-1374349927
|
查看HBase中得数据:
|
1
2
3
4
5
6
7
8
9
10
11
|
hbase(main):018:0> scan 'micmiublog_webpage'
ROW COLUMN+CELL
com.micmiu:http/ column=f:bid, timestamp=1421027232815, value=1421027229-1374349927
com.micmiu:http/ column=f:fi, timestamp=1421026970740, value=\x00'\x8D\x00
com.micmiu:http/ column=f:ts, timestamp=1421026970740, value=\x00\x00\x01J\xDB\xCE\xBC\xF2
com.micmiu:http/ column=mk:_gnmrk_, timestamp=1421027232815, value=1421027229-1374349927
com.micmiu:http/ column=mk:_injmrk_, timestamp=1421026970740, value=y
com.micmiu:http/ column=mk:dist, timestamp=1421026970740, value=0
com.micmiu:http/ column=mtdt:_csh_, timestamp=1421026970740, value=?\x80\x00\x00
com.micmiu:http/ column=s:s, timestamp=1421026970740, value=?\x80\x00\x00
1 row(s) in 0.0580 seconds
|
[第三步]:fetch
ps:上一步执行的日志中 GenerateorJob batch id 的值 作为下面命令的参数 batchId的值
也可以从hbase中重查询到:
|
1
2
3
4
|
hbase(main):025:0> get 'micmiublog_webpage','com.micmiu:http/',{COLUMNS => 'f:bid'}
COLUMN CELL
f:bid timestamp=1421027232815, value=1421027229-1374349927
1 row(s) in 0.0060 seconds
|
下面执行 fetch 命令:
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
$ nutch fetch 1421027229-1374349927 -crawlId micmiublog -threads 10
FetcherJob: starting
FetcherJob: batchId: 1421027229-1374349927
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
2015-01-12 09:49:37.095 java[14546:4753667] Unable to load realm info from SCDynamicStore
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 1 records. Hit by time limit :0
fetching http://micmiu.com/ (queue crawl delay=5000ms)
-finishing thread FetcherThread1, activeThreads=1
-finishing thread FetcherThread2, activeThreads=1
-finishing thread FetcherThread3, activeThreads=1
-finishing thread FetcherThread4, activeThreads=1
-finishing thread FetcherThread5, activeThreads=1
-finishing thread FetcherThread6, activeThreads=1
-finishing thread FetcherThread7, activeThreads=1
-finishing thread FetcherThread8, activeThreads=1
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
-finishing thread FetcherThread9, activeThreads=1
-finishing thread FetcherThread0, activeThreads=0
0/0 spinwaiting/active, 1 pages, 0 errors, 0.2 0 pages/s, 0 0 kb/s, 0 URLs in 0 queues
-activeThreads=0
FetcherJob: done
|
查看HBase中得数据:
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
|
hbase(main):019:0> scan 'micmiublog_webpage'
ROW COLUMN+CELL
com.micmiu:http/ column=f:bas, timestamp=1421027385487, value=http://micmiu.com/
com.micmiu:http/ column=f:bid, timestamp=1421027232815, value=1421027229-1374349927
com.micmiu:http/ column=f:cnt, timestamp=1421027385487, value=
com.micmiu:http/ column=f:fi, timestamp=1421026970740, value=\x00'\x8D\x00
com.micmiu:http/ column=f:prot, timestamp=1421027385487, value=\x18\x02,http://www.micmiu.com/\x00\x00
com.micmiu:http/ column=f:pts, timestamp=1421027385487, value=\x00\x00\x01J\xDB\xCE\xBC\xF2
com.micmiu:http/ column=f:rpr, timestamp=1421027385487, value=http://micmiu.com/
com.micmiu:http/ column=f:st, timestamp=1421027385487, value=\x00\x00\x00\x05
com.micmiu:http/ column=f:ts, timestamp=1421027385487, value=\x00\x00\x01J\xDB\xD5\x17%
com.micmiu:http/ column=f:typ, timestamp=1421027385487, value=text/html
com.micmiu:http/ column=h:Cache-Control, timestamp=1421027385487, value=no-store, no-cache, must-revalidate, post-check=0, pre-check=0
com.micmiu:http/ column=h:Connection, timestamp=1421027385487, value=close
com.micmiu:http/ column=h:Content-Encoding, timestamp=1421027385487, value=gzip
com.micmiu:http/ column=h:Content-Length, timestamp=1421027385487, value=20
com.micmiu:http/ column=h:Content-Type, timestamp=1421027385487, value=text/html; charset=UTF-8
com.micmiu:http/ column=h:Date, timestamp=1421027385487, value=Mon, 12 Jan 2015 01:49:41 GMT
com.micmiu:http/ column=h:Expires, timestamp=1421027385487, value=Thu, 19 Nov 1981 08:52:00 GMT
com.micmiu:http/ column=h:Location, timestamp=1421027385487, value=http://www.micmiu.com/
com.micmiu:http/ column=h:Pragma, timestamp=1421027385487, value=no-cache
com.micmiu:http/ column=h:Server, timestamp=1421027385487, value=LiteSpeed
com.micmiu:http/ column=h:Set-Cookie, timestamp=1421027385487, value=PHPSESSID=5657f9f9da456a7bf6e243f78b7e0182; path=/
com.micmiu:http/ column=h:Vary, timestamp=1421027385487, value=Cookie
com.micmiu:http/ column=h:X-Pingback, timestamp=1421027385487, value=http://www.micmiu.com/xmlrpc.php
com.micmiu:http/ column=h:X-Powered-By, timestamp=1421027385487, value=PHP/5.3.29
com.micmiu:http/ column=mk:_ftcmrk_, timestamp=1421027385487, value=1421027229-1374349927
com.micmiu:http/ column=mk:_gnmrk_, timestamp=1421027232815, value=1421027229-1374349927
com.micmiu:http/ column=mk:_injmrk_, timestamp=1421026970740, value=y
com.micmiu:http/ column=mk:dist, timestamp=1421026970740, value=0
com.micmiu:http/ column=mtdt:___rdrdsc__, timestamp=1421027385487, value=y
com.micmiu:http/ column=mtdt:_csh_, timestamp=1421026970740, value=?\x80\x00\x00
com.micmiu:http/ column=ol:http://www.micmiu.com/, timestamp=1421027385487, value=
com.micmiu:http/ column=s:s, timestamp=1421026970740, value=?\x80\x00\x00
1 row(s) in 0.0980 seconds
|
[第四步]:parse
|
1
2
3
4
5
6
7
8
9
|
$ nutch parse 1421027229-1374349927 -crawlId micmiublog
ParserJob: starting
ParserJob: resuming: false
ParserJob: forced reparse: false
ParserJob: batchId: 1421027229-1374349927
2015-01-12 09:50:03.525 java[14559:4756783] Unable to load realm info from SCDynamicStore
Parsing http://micmiu.com/
http://micmiu.com/ skipped. Content of size 20 was truncated to 0
ParserJob: success
|
查看HBase中得数据:
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
|
hbase(main):020:0> scan 'micmiublog_webpage'
ROW COLUMN+CELL
com.micmiu:http/ column=f:bas, timestamp=1421027385487, value=http://micmiu.com/
com.micmiu:http/ column=f:bid, timestamp=1421027232815, value=1421027229-1374349927
com.micmiu:http/ column=f:cnt, timestamp=1421027385487, value=
com.micmiu:http/ column=f:fi, timestamp=1421026970740, value=\x00'\x8D\x00
com.micmiu:http/ column=f:prot, timestamp=1421027385487, value=\x18\x02,http://www.micmiu.com/\x00\x00
com.micmiu:http/ column=f:pts, timestamp=1421027385487, value=\x00\x00\x01J\xDB\xCE\xBC\xF2
com.micmiu:http/ column=f:rpr, timestamp=1421027385487, value=http://micmiu.com/
com.micmiu:http/ column=f:st, timestamp=1421027385487, value=\x00\x00\x00\x05
com.micmiu:http/ column=f:ts, timestamp=1421027385487, value=\x00\x00\x01J\xDB\xD5\x17%
com.micmiu:http/ column=f:typ, timestamp=1421027385487, value=text/html
com.micmiu:http/ column=h:Cache-Control, timestamp=1421027385487, value=no-store, no-cache, must-revalidate, post-check=0, pre-check=0
com.micmiu:http/ column=h:Connection, timestamp=1421027385487, value=close
com.micmiu:http/ column=h:Content-Encoding, timestamp=1421027385487, value=gzip
com.micmiu:http/ column=h:Content-Length, timestamp=1421027385487, value=20
com.micmiu:http/ column=h:Content-Type, timestamp=1421027385487, value=text/html; charset=UTF-8
com.micmiu:http/ column=h:Date, timestamp=1421027385487, value=Mon, 12 Jan 2015 01:49:41 GMT
com.micmiu:http/ column=h:Expires, timestamp=1421027385487, value=Thu, 19 Nov 1981 08:52:00 GMT
com.micmiu:http/ column=h:Location, timestamp=1421027385487, value=http://www.micmiu.com/
com.micmiu:http/ column=h:Pragma, timestamp=1421027385487, value=no-cache
com.micmiu:http/ column=h:Server, timestamp=1421027385487, value=LiteSpeed
com.micmiu:http/ column=h:Set-Cookie, timestamp=1421027385487, value=PHPSESSID=5657f9f9da456a7bf6e243f78b7e0182; path=/
com.micmiu:http/ column=h:Vary, timestamp=1421027385487, value=Cookie
com.micmiu:http/ column=h:X-Pingback, timestamp=1421027385487, value=http://www.micmiu.com/xmlrpc.php
com.micmiu:http/ column=h:X-Powered-By, timestamp=1421027385487, value=PHP/5.3.29
com.micmiu:http/ column=mk:_ftcmrk_, timestamp=1421027385487, value=1421027229-1374349927
com.micmiu:http/ column=mk:_gnmrk_, timestamp=1421027232815, value=1421027229-1374349927
com.micmiu:http/ column=mk:_injmrk_, timestamp=1421026970740, value=y
com.micmiu:http/ column=mk:dist, timestamp=1421026970740, value=0
com.micmiu:http/ column=mtdt:___rdrdsc__, timestamp=1421027385487, value=y
com.micmiu:http/ column=mtdt:_csh_, timestamp=1421026970740, value=?\x80\x00\x00
com.micmiu:http/ column=ol:http://www.micmiu.com/, timestamp=1421027385487, value=
com.micmiu:http/ column=s:s, timestamp=1421026970740, value=?\x80\x00\x00
1 row(s) in 0.0690 seconds
|
[第五步]:updatedb
|
1
2
3
4
|
$ nutch updatedb -crawlId micmiublog
DbUpdaterJob: starting
2015-01-12 09:50:47.662 java[14572:4762452] Unable to load realm info from SCDynamicStore
DbUpdaterJob: done
|
查看HBase中得数据:
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
|
hbase(main):021:0> scan 'micmiublog_webpage'
ROW COLUMN+CELL
com.micmiu.www:http/ column=f:fi, timestamp=1421027452042, value=\x00'\x8D\x00
com.micmiu.www:http/ column=f:st, timestamp=1421027452042, value=\x00\x00\x00\x01
com.micmiu.www:http/ column=f:ts, timestamp=1421027452042, value=\x00\x00\x01J\xDB\xD6$f
com.micmiu.www:http/ column=mk:dist, timestamp=1421027452042, value=1
com.micmiu.www:http/ column=mtdt:_csh_, timestamp=1421027452042, value=?\x80\x00\x00
com.micmiu.www:http/ column=s:s, timestamp=1421027452042, value=?\x80\x00\x00
com.micmiu:http/ column=f:bas, timestamp=1421027385487, value=http://micmiu.com/
com.micmiu:http/ column=f:bid, timestamp=1421027232815, value=1421027229-1374349927
com.micmiu:http/ column=f:cnt, timestamp=1421027385487, value=
com.micmiu:http/ column=f:fi, timestamp=1421026970740, value=\x00'\x8D\x00
com.micmiu:http/ column=f:prot, timestamp=1421027385487, value=\x18\x02,http://www.micmiu.com/\x00\x00
com.micmiu:http/ column=f:pts, timestamp=1421027385487, value=\x00\x00\x01J\xDB\xCE\xBC\xF2
com.micmiu:http/ column=f:rpr, timestamp=1421027385487, value=http://micmiu.com/
com.micmiu:http/ column=f:st, timestamp=1421027385487, value=\x00\x00\x00\x05
com.micmiu:http/ column=f:ts, timestamp=1421027452042, value=\x00\x00\x01KvS\xDF%
com.micmiu:http/ column=f:typ, timestamp=1421027385487, value=text/html
com.micmiu:http/ column=h:Cache-Control, timestamp=1421027385487, value=no-store, no-cache, must-revalidate, post-check=0, pre-check=0
com.micmiu:http/ column=h:Connection, timestamp=1421027385487, value=close
com.micmiu:http/ column=h:Content-Encoding, timestamp=1421027385487, value=gzip
com.micmiu:http/ column=h:Content-Length, timestamp=1421027385487, value=20
com.micmiu:http/ column=h:Content-Type, timestamp=1421027385487, value=text/html; charset=UTF-8
com.micmiu:http/ column=h:Date, timestamp=1421027385487, value=Mon, 12 Jan 2015 01:49:41 GMT
com.micmiu:http/ column=h:Expires, timestamp=1421027385487, value=Thu, 19 Nov 1981 08:52:00 GMT
com.micmiu:http/ column=h:Location, timestamp=1421027385487, value=http://www.micmiu.com/
com.micmiu:http/ column=h:Pragma, timestamp=1421027385487, value=no-cache
com.micmiu:http/ column=h:Server, timestamp=1421027385487, value=LiteSpeed
com.micmiu:http/ column=h:Set-Cookie, timestamp=1421027385487, value=PHPSESSID=5657f9f9da456a7bf6e243f78b7e0182; path=/
com.micmiu:http/ column=h:Vary, timestamp=1421027385487, value=Cookie
com.micmiu:http/ column=h:X-Pingback, timestamp=1421027385487, value=http://www.micmiu.com/xmlrpc.php
com.micmiu:http/ column=h:X-Powered-By, timestamp=1421027385487, value=PHP/5.3.29
com.micmiu:http/ column=mk:_injmrk_, timestamp=1421026970740, value=y
com.micmiu:http/ column=mk:dist, timestamp=1421026970740, value=0
com.micmiu:http/ column=mtdt:_csh_, timestamp=1421026970740, value=?\x80\x00\x00
com.micmiu:http/ column=ol:http://www.micmiu.com/, timestamp=1421027385487, value=
com.micmiu:http/ column=s:s, timestamp=1421026970740, value=?\x80\x00\x00
2 row(s) in 0.1140 seconds
|
—————– EOF @Michael Sun —————–
原创文章,转载请注明: 转载自micmiu – 软件开发+生活点滴[ http://www.micmiu.com/ ]
本文链接地址: http://www.micmiu.com/opensource/nutch/nutch2x-crawl-first-website/
Nutch2.x 演示抓取第一个网站的更多相关文章
- 【Nutch2.2.1基础教程之6】Nutch2.2.1抓取流程
一.抓取流程概述 1.nutch抓取流程 当使用crawl命令进行抓取任务时,其基本流程步骤如下: (1)InjectorJob 开始第一个迭代 (2)GeneratorJob (3)FetcherJ ...
- Python抓取第一网贷中国网贷理财每日收益率指数
链接:http://www.p2p001.com/licai/index/id/147.html 所需获取数据链接类似于:http://www.p2p001.com/licai/shownews/id ...
- Python爬虫抓取某音乐网站MP3(下载歌曲、存入Sqlite)
最近右胳膊受伤,打了石膏在家休息.为了实现之前的想法,就用左手打字.写代码,查资料完成了这个资源小爬虫.网页爬虫, 最主要的是协议分析(必须要弄清楚自己的目的),另外就是要考虑对爬取的数据归类,存储. ...
- [Python爬虫] 之二十六:Selenium +phantomjs 利用 pyquery抓取智能电视网站图片信息
一.介绍 本例子用Selenium +phantomjs爬取智能电视网站(http://www.tvhome.com/news/)的资讯信息,输入给定关键字抓取图片信息. 给定关键字:数字:融合:电视 ...
- [Python爬虫] 之二十一:Selenium +phantomjs 利用 pyquery抓取36氪网站数据
一.介绍 本例子用Selenium +phantomjs爬取36氪网站(http://36kr.com/search/articles/电视?page=1)的资讯信息,输入给定关键字抓取资讯信息. 给 ...
- 【Nutch2.2.1基础教程之6】Nutch2.2.1抓取流程 分类: H3_NUTCH 2014-08-15 21:39 2530人阅读 评论(1) 收藏
一.抓取流程概述 1.nutch抓取流程 当使用crawl命令进行抓取任务时,其基本流程步骤如下: (1)InjectorJob 开始第一个迭代 (2)GeneratorJob (3)FetcherJ ...
- 吴裕雄--天生自然python爬虫:使用requests模块的get和post方式抓取中国旅游网站和有道翻译网站翻译内容数据
import requests url = 'http://www.cntour.cn/' strhtml = requests.get(url) print(strhtml.text) URL='h ...
- 抓取网站数据不再是难事了,Fizzler(So Easy)全能搞定
首先从标题说起,为啥说抓取网站数据不再难(其实抓取网站数据有一定难度),SO EASY!!!使用Fizzler全搞定,我相信大多数人或公司应该都有抓取别人网站数据的经历,比如说我们博客园每次发表完文章 ...
- 【VIP视频网站项目】VIP视频网站项目v1.0.3版本发布啦(程序一键安装+电影后台自动抓取+代码结构调整)
在线体验地址:http://vip.52tech.tech/ GIthub源码:https://github.com/xiugangzhang/vip.github.io 项目预览 主页面 登录页面 ...
随机推荐
- 事故记录-过多进程致使CPU卡死
早上到公司,发现邮箱内有报警信息显示CPU和IO使用都已超标.报警内容如下:主机: test-server-192.168.1.18 时间: 2015.11.15 15:25:17状态: PROBLE ...
- Adivisor
1.Adivisor是一种特殊的Aspect,Advisor代表spring中的Aspect 2.区别:advisor只持有一个Pointcut和一个advice,而aspect可以多个pointcu ...
- 05Spring_Bean属性的集合类型的注入
- 如何自定义kindeditor编辑器的工具栏items即去除不必要的工具栏或者保留部分工具栏
kindeditor编辑器的工具栏主要是指编辑器输入框上方的那些可以操作的菜单,默认情况下编辑器是给予了所有的工具栏.针对不同的用户,不同的项目,不同的环境,可能就需要保留部分工具栏.那么我们应该如何 ...
- python 遗传算法精简版
精简版遗传算法,算法中仅采用变异算子而没有使用交叉算子,但是进化依然很有效 from string import ascii_lowercase from random import choice, ...
- .NET面试题解析(07)-多线程编程与线程同步 (转)
http://www.cnblogs.com/anding/p/5301754.html 系列文章目录地址: .NET面试题解析(00)-开篇来谈谈面试 & 系列文章索引 关于线程的知识点其实 ...
- Objective-c复制对象的概念
- CSS 动画之十-图片+图片信息展示
这个动画主要是运用了一些css3的特性,效果是展示一张商品图片,然后在商品图片的制定位置显示该商品的详细信息.效果在chrome浏览器中预览. <!DOCTYPE html> <ht ...
- UIButton利用分类扩展方法(封装)
UIButton+BackgroundColor.h #import <UIKit/UIKit.h> @interface UIButton (BackgroundColor) - (vo ...
- 6.HBase In Action 第一章-HBase简介(1.2 HBase的使用场景和成功案例)
Sometimes the best way to understand a software product is to look at how it's used. The kinds of pr ...