在对nutch源代码运行ant runtime后,会创建一个runtime的目录,在runtime目录下有deploy和local 2个目录。

[jediael@jediael runtime]$ ls

deploy  local

这2个目录分别代表nutch的2种运行方式:部署模式及本地模式。

1、nutch.sh中关于2种运行方式的执行

if $local; then
# fix for the external Xerces lib issue with SAXParserFactory
NUTCH_OPTS="-Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl $NUTCH_OPTS"
EXEC_CALL="$JAVA $JAVA_HEAP_MAX $NUTCH_OPTS -classpath $CLASSPATH"
else
# check that hadoop can be found on the path
if [ $(which hadoop | wc -l ) -eq 0 ]; then
echo "Can't find Hadoop executable. Add HADOOP_HOME/bin to the path or run in local mode."
exit -1;
fi
# distributed mode
EXEC_CALL="hadoop jar $NUTCH_JOB"
fi # run it
exec $EXEC_CALL $CLASS "$@“

2、在deploy目录下执行命令即为deploy模式,local目录下执行命令即为local模式。

以下以inject为例,示范2种运行模式。

一、本地模式

1、基本用法:

$ bin/nutch inject
Usage: InjectorJob <url_dir> [-crawlId <id>]

用法一:未指定id

liaoliuqingdeMacBook-Air:local liaoliuqing$ bin/nutch inject urls
InjectorJob: starting at 2014-12-20 22:32:01
InjectorJob: Injecting urlDir: urls
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 1 Injector: finished at 2014-12-20 22:32:15, elapsed: 00:00:14

用法二:指定id

$ bin/nutch inject urls -crawlId 2
InjectorJob: starting at 2014-12-20 22:34:01
InjectorJob: Injecting urlDir: urls
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 1 Injector: finished at 2014-12-20 22:34:15, elapsed: 00:00:14

2、数据库中的数据变化

上述命令将在hbase数据库中新建一个表,表名为${id}_webpage,若未指定id,则表名为webpage.

然后将urls目录中的文件内容写入表中,作为爬虫种子。

hbase(main):003:0> scan 'webpage'
ROW COLUMN+CELL
com.163.www:http/ column=f:fi, timestamp=1419085934952, value=\x00'\x8D\x00
com.163.www:http/ column=f:ts, timestamp=1419085934952, value=\x00\x00\x01Jh
\x1C\xBC7
com.163.www:http/ column=mk:_injmrk_, timestamp=1419085934952, value=y
com.163.www:http/ column=mk:dist, timestamp=1419085934952, value=0
com.163.www:http/ column=mtdt:_csh_, timestamp=1419085934952, value=?\x80\x0
0\x00
com.163.www:http/ column=s:s, timestamp=1419085934952, value=?\x80\x00\x00
1 row(s) in 0.6140 seconds

当再次执行inject命令时,会增加新的url进入表中。

3、其它运行脚本

where COMMAND is one of:
inject inject new urls into the database
hostinject creates or updates an existing host table from a text file
generate generate new batches to fetch from crawl db
fetch fetch URLs marked during generate
parse parse URLs marked during fetch
updatedb update web table after parsing
updatehostdb update host table after parsing
readdb read/dump records from page database
readhostdb display entries from the hostDB
elasticindex run the elasticsearch indexer
solrindex run the solr indexer on parsed batches
solrdedup remove duplicates from solr
parsechecker check the parser for a given url
indexchecker check the indexing filters for a given url
plugin load a plugin and run one of its classes main()
nutchserver run a (local) Nutch server on a user defined port
junit runs the given JUnit test
or
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

可以逐步运行一个完整抓取流程中的各个步骤,形成一个整体的流程。

当使用crawl命令进行抓取任务时,其基本流程步骤如下:

(1)InjectorJob

开始第一个迭代

(2)GeneratorJob

(3)FetcherJob

(4)ParserJob

(5)DbUpdaterJob

(6)SolrIndexerJob

开始第二个迭代

(2)GeneratorJob

(3)FetcherJob

(4)ParserJob

(5)DbUpdaterJob

(6)SolrIndexerJob

开始第三个迭代

具体每个步骤的执行,请见http://blog.csdn.net/jediael_lu/article/details/38591067

4、nutch封装了一个crawl脚本,将各个关键步骤进行了封装,从而无需逐步运行抓取流程。

[jediael@jediael local]$ bin/crawl
Missing seedDir : crawl <seedDir> <crawlID> <solrURL> <numberOfRounds>

如:

[root@jediael44 bin]# ./crawl seed.txt TestCrawl http://localhost:8983/solr 2

二、部署模式

1、使用hadoop命令运行

注意:必须先启动hadoop及hbase。

[jediael@jediael deploy]$ hadoop jar apache-nutch-2.2.1.job org.apache.nutch.crawl.InjectorJob file:///opt/jediael/apache-nutch-2.2.1/runtime/deploy/urls/
14/12/20 23:26:50 INFO crawl.InjectorJob: InjectorJob: starting at 2014-12-20 23:26:50
14/12/20 23:26:50 INFO crawl.InjectorJob: InjectorJob: Injecting urlDir: file:/opt/jediael/apache-nutch-2.2.1/runtime/deploy/urls
14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:zookeeper.version=3.3.2-1031432, built on 11/05/2010 05:32 GMT
14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:host.name=jediael
14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:java.version=1.7.0_51
14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:java.vendor=Oracle Corporation
14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:java.home=/usr/java/jdk1.7.0_51/jre
14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:java.class.path=/opt/jediael/hadoop-1.2.1/libexec/../conf:/usr/java/jdk1.7.0_51/lib/tools.jar:/opt/jediael/hadoop-1.2.1/libexec/..:/opt/jediael/hadoop-1.2.1/libexec/../hadoop-core-1.2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/asm-3.2.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/aspectjrt-1.6.11.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/aspectjtools-1.6.11.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-beanutils-1.7.0.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-beanutils-core-1.8.0.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-cli-1.2.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-codec-1.4.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-collections-3.2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-configuration-1.6.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-daemon-1.0.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-digester-1.8.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-el-1.0.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-httpclient-3.0.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-io-2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-lang-2.4.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-logging-1.1.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-logging-api-1.0.4.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-math-2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-net-3.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/core-3.1.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/hadoop-capacity-scheduler-1.2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/hadoop-fairscheduler-1.2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/hadoop-thriftfs-1.2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/hsqldb-1.8.0.10.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jackson-core-asl-1.8.8.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jackson-mapper-asl-1.8.8.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jasper-compiler-5.5.12.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jasper-runtime-5.5.12.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jdeb-0.8.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jersey-core-1.8.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jersey-json-1.8.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jersey-server-1.8.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jets3t-0.6.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jetty-6.1.26.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jetty-util-6.1.26.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jsch-0.1.42.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/junit-4.5.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/kfs-0.2.2.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/log4j-1.2.15.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/mockito-all-1.8.5.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/oro-2.0.8.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/servlet-api-2.5-20081211.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/slf4j-api-1.4.3.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/slf4j-log4j12-1.4.3.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/xmlenc-0.52.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jsp-2.1/jsp-2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jsp-2.1/jsp-api-2.1.jar
14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:java.library.path=/opt/jediael/hadoop-1.2.1/libexec/../lib/native/Linux-amd64-64
14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/tmp
14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:java.compiler=<NA>
14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:os.name=Linux
14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:os.arch=amd64
14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:os.version=2.6.32-431.17.1.el6.x86_64
14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:user.name=jediael
14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:user.home=/home/jediael
14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:user.dir=/opt/jediael/apache-nutch-2.2.1/runtime/deploy
14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=180000 watcher=hconnection
14/12/20 23:26:52 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181
14/12/20 23:26:52 INFO zookeeper.ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session
14/12/20 23:26:52 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x14a5c24c9cf0657, negotiated timeout = 40000
14/12/20 23:26:52 INFO crawl.InjectorJob: InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
14/12/20 23:26:55 INFO input.FileInputFormat: Total input paths to process : 1
14/12/20 23:26:55 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/12/20 23:26:55 WARN snappy.LoadSnappy: Snappy native library not loaded
14/12/20 23:26:56 INFO mapred.JobClient: Running job: job_201412202325_0002
14/12/20 23:26:57 INFO mapred.JobClient: map 0% reduce 0%
14/12/20 23:27:15 INFO mapred.JobClient: map 100% reduce 0%
14/12/20 23:27:17 INFO mapred.JobClient: Job complete: job_201412202325_0002
14/12/20 23:27:18 INFO mapred.JobClient: Counters: 20
14/12/20 23:27:18 INFO mapred.JobClient: Job Counters
14/12/20 23:27:18 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=14058
14/12/20 23:27:18 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
14/12/20 23:27:18 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
14/12/20 23:27:18 INFO mapred.JobClient: Rack-local map tasks=1
14/12/20 23:27:18 INFO mapred.JobClient: Launched map tasks=1
14/12/20 23:27:18 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
14/12/20 23:27:18 INFO mapred.JobClient: File Output Format Counters
14/12/20 23:27:18 INFO mapred.JobClient: Bytes Written=0
14/12/20 23:27:18 INFO mapred.JobClient: injector
14/12/20 23:27:18 INFO mapred.JobClient: urls_injected=3
14/12/20 23:27:18 INFO mapred.JobClient: FileSystemCounters
14/12/20 23:27:18 INFO mapred.JobClient: FILE_BYTES_READ=149
14/12/20 23:27:18 INFO mapred.JobClient: HDFS_BYTES_READ=130
14/12/20 23:27:18 INFO mapred.JobClient: FILE_BYTES_WRITTEN=78488
14/12/20 23:27:18 INFO mapred.JobClient: File Input Format Counters
14/12/20 23:27:18 INFO mapred.JobClient: Bytes Read=149
14/12/20 23:27:18 INFO mapred.JobClient: Map-Reduce Framework
14/12/20 23:27:18 INFO mapred.JobClient: Map input records=6
14/12/20 23:27:18 INFO mapred.JobClient: Physical memory (bytes) snapshot=106311680
14/12/20 23:27:18 INFO mapred.JobClient: Spilled Records=0
14/12/20 23:27:18 INFO mapred.JobClient: CPU time spent (ms)=2420
14/12/20 23:27:18 INFO mapred.JobClient: Total committed heap usage (bytes)=29753344
14/12/20 23:27:18 INFO mapred.JobClient: Virtual memory (bytes) snapshot=736796672
14/12/20 23:27:18 INFO mapred.JobClient: Map output records=3
14/12/20 23:27:18 INFO mapred.JobClient: SPLIT_RAW_BYTES=130
14/12/20 23:27:18 INFO crawl.InjectorJob: InjectorJob: total number of urls rejected by filters: 0
14/12/20 23:27:18 INFO crawl.InjectorJob: InjectorJob: total number of urls injected after normalization and filtering: 3
14/12/20 23:27:18 INFO crawl.InjectorJob: Injector: finished at 2014-12-20 23:27:18, elapsed: 00:00:27

三、附带使用eclipse运行nutch的方式

此方法本质上是与部署模式一致的。

使用eclipse运行InjectorJob

eclipse输出内容:

InjectorJob: starting at 2014-12-20 23:13:24
InjectorJob: Injecting urlDir: /Users/liaoliuqing/99_Project/2.x/urls
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 1 Injector: finished at 2014-12-20 23:13:27, elapsed: 00:00:02

【Nutch基础教程之七】Nutch的2种运行模式:local及deploy的更多相关文章

  1. 【Nutch基础教程之七】Nutch的2种执行模式:local及deploy

    在对nutch源码执行ant runtime后,会创建一个runtime的文件夹.在runtime文件夹下有deploy和local 2个文件夹. [jediael@jediael runtime]$ ...

  2. spark学习(基础篇)--(第三节)Spark几种运行模式

    spark应用执行机制分析 前段时间一直在编写指标代码,一直采用的是--deploy-mode client方式开发测试,因此执行没遇到什么问题,但是放到生产上采用--master yarn-clus ...

  3. 从零学习Fluter(八):Flutter的四种运行模式--Debug、Release、Profile和test以及命名规范

    从零学习Fluter(八):Flutter的四种运行模式--Debug.Release.Profile和test以及命名规范 好几天没有跟新我的这个系列文章,一是因为这两天我又在之前的基础上,重新认识 ...

  4. PHP语言学习之php-fpm 三种运行模式

    本文主要向大家介绍了PHP语言学习之php-fpm 三种运行模式,通过具体的内容向大家展示,希望对大家学习php语言有所帮助. php-fpm配置 配置文件:php-fpm.conf 开启慢日志功能的 ...

  5. Tomcat Connector三种运行模式(BIO, NIO, APR)的比较和优化

    Tomcat Connector的三种不同的运行模式性能相差很大,有人测试过的结果如下: 这三种模式的不同之处如下: BIO: 一个线程处理一个请求.缺点:并发量高时,线程数较多,浪费资源. Tomc ...

  6. Spark on YARN两种运行模式介绍

    本文出自:Spark on YARN两种运行模式介绍http://www.aboutyun.com/thread-12294-1-1.html(出处: about云开发)   问题导读 1.Spark ...

  7. spark on mesos 两种运行模式

    spark on mesos 有粗粒度(coarse-grained)和细粒度(fine-grained)两种运行模式,细粒度模式在spark2.0后开始弃用. 细粒度模式 优点 spark默认运行的 ...

  8. 转:Windows下的PHP开发环境搭建——PHP线程安全与非线程安全、Apache版本选择,及详解五种运行模式。

    原文来自于:http://www.ituring.com.cn/article/128439 Windows下的PHP开发环境搭建——PHP线程安全与非线程安全.Apache版本选择,及详解五种运行模 ...

  9. Spark on YARN的两种运行模式

    Spark on YARN有两种运行模式,如下 1.yarn-cluster:适合于生产环境.        Spark的Driver运行在ApplicationMaster中,它负责向YARN Re ...

随机推荐

  1. 如何写一个网页标题title的闪动提示

    通过网页title来提示用户有新消息这个功能很常见,比如现在的微博,还有一些邮箱,这个功能都很常见.如何实现则个功能呢? 思路是:通过ajax访问后台,若有新消息,则将网页的title替换为 提示信息 ...

  2. HTML&CSS基础学习笔记1.17-表格的头部与尾部

    表格的头部和尾部 既然有标签表示表格的主体,那么自然表格的头部和尾部也有对应的标签. HTML中使用<thead>标签表示表格的头部,使用<tfoot>标签表示表格的尾部. 有 ...

  3. DataTables DOM定位

    datatables默认会打开部分特性,比如搜索框,分页显示等等,或许你不喜欢datatables这样去布局,可能你想把分页按钮放在底部的中间,搜索框放在顶部的左上角,不用担心datatables考虑 ...

  4. Android学习之AndroidManifest.xml清单之<uses-feature>

    无意之中看了几个小时的官方英文文档,关于<uses-feature>的介绍.有必要在这里记录一下,应该有很多人不知道<uses-feature>到底是做什么用的,因为我们平时根 ...

  5. HADOOP都升级到2.5啦~~~

    经过前年的初次接触,现在已大约能理解整个体系啦.MAPREDUCE都改成YARN啦. 不过,还得继续往前走,再能实用... HIVE,HBASE,ZOOKEEPER,, 思路如下: 安装系统,配置网络 ...

  6. bzoj 1192

    http://www.lydsy.com/JudgeOnline/problem.php?id=1192 好像学过一个东西: [0..2^(N+1)-1]内的数都的都可以由2^0,2^1,...,2^ ...

  7. dos判断系统版本

    可以通过VER命令的输出结果判断2K/XP/NT,楼主应该对比以下各版本的VER命令输出结果,参考MrPotter(HarryPotter) 的脚本即可, ver|find "XP" ...

  8. 2014.6.14模拟赛【bzoj1592】[Usaco2008 Feb]Making the Grade 路面修整

    Description FJ打算好好修一下农场中某条凹凸不平的土路.按奶牛们的要求,修好后的路面高度应当单调上升或单调下降,也就是说,高度上升与高度下降的路段不能同时出现在修好的路中. 整条路被分成了 ...

  9. 除了修改WEBCONFIG会导致WEB服务重启外,还有其他的什么操作会导致重启?

    1.修改WEBCONFIG文件 2.BIN文件夹下,添加.删除.覆盖文件 3.IIS应用程序池回收   参考文章:http://blog.csdn.net/hb_gx/archive/2007/05/ ...

  10. php面向对象编程之构造方法__construct()

    php中构造方法是对象创建完成后第一个被对象自动调用的方法.在每个类中都有一个构造方法,如果没有显示地声明它,那么类中都会默认存在一个没有参数且内容为空的构造方法. 构造方法的作用 通常构造方法被用来 ...