【Nutch2.2.1基础教程之3】Nutch2.2.1配置文件
nutch-site.xml
在nutch2.2.1中,有两份配置文件:nutch-default.xml与nutch-site.xml。
其中前者是nutch自带的默认属性,一般情况下不要修改。
如果需要修改默认属性,可以在nutch-site.xml中增加一个同名的属性,并修改其值。nutch-site.xml中的属性值会覆盖nutch-default.xml中的值。
1、db.ignore.external.links
若为true,则只抓取本域名内的网页,忽略外部链接。
可以在 regex-urlfilter.txt中增加过滤器达到同样效果,但如果过滤器过多,如几千个,则会大大影响nutch的性能。
<property>
<name>db.ignore.external.links</name>
<value>true</value>
<description>If true, outlinks leading from a page to external hosts
will be ignored. This is an effective way to limit the crawl to include
only initially injected hosts, without creating complex URLFilters.
</description>
</property>
2、fetcher.parse
能否在抓取的同时进行解释:可以,但不 建议这样做。
<property>
<name>fetcher.parse</name>
<value>false</value>
<description>If true, fetcher will parse content. NOTE: previous releases would
default to true. Since 2.0 this is set to false as a safer default.</description>
</property>
官方解释
N.B. In a parsing fetcher, outlinks are processed in the reduce phase (at least when outlinks are followed). If a fetcher's reducer stalls you may run out of memory or disk space,
usually after a very long reduce job. Behaviour typical to this is usually
observed in this situation.
In summary, if it is possible, users are advised not to use a parsing fetcher as it is heavy on IO and often leads to the above outcome.
3、db.max.outlinks.per.page
默认情况下,Nutch只抓取某个网页的100个外部链接,导致部分链接无法抓取。若要改变此情况,可以修改此配置项。
<property>
<name>db.max.outlinks.per.page</name>
<value>100</value>
<description>The maximum number of outlinks that we'll process for a page. If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed.
</description>
</property>
官方说明如下:http://wiki.apache.org/nutch/FAQ/
Nutch doesn't crawl relative URLs? Some pages are not indexed but my regex file and everything else is okay - what is going on?
The crawl tool has a default limitation of 100 outlinks of one page that are being fetched. To overcome this limitation change thedb.max.outlinks.per.page property to a higher
value or simply -1 (unlimited).
file: conf/nutch-default.xml
<property>
<name>db.max.outlinks.per.page</name>
<value>-1</value>
<description>The maximum number of outlinks that we'll process for a page.
If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
will be processed for a page; otherwise, all outlinks will be processed.
</description>
</property>
see also: http://www.mail-archive.com/nutch-user@lucene.apache.org/msg08665.html
4、file.content.limit http.content.limit ftp.content.limit
默认情况下,nutch只抓取网页的前65536个字节,之后的内容将被丢弃。
但对于某些大型网站,首页的内容远远不止65536个字节,甚至前面65536个字节里面均是一些布局信息,并没有任何的超链接。
因此修改默认值如下:
<property>
<name>file.content.limit</name>
<value>-1</value>
<description>The length limit for downloaded content using the file
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the http.content.limit setting.
</description>
</property> <property>
<name>http.content.limit</name>
<value>-1</value>
<description>The length limit for downloaded content using the http
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the file.content.limit setting.
</description>
</property> <property>
<name>ftp.content.limit</name>
<value>-1</value>
<description>The length limit for downloaded content, in bytes.
If this value is nonnegative (>=0), content longer than it will be truncated;
otherwise, no truncation at all.
Caution: classical ftp RFCs never defines partial transfer and, in fact,
some ftp servers out there do not handle client side forced close-down very
well. Our implementation tries its best to handle such situations smoothly.
</description>
</property>
【Nutch2.2.1基础教程之3】Nutch2.2.1配置文件的更多相关文章
- 【Nutch2.2.1基础教程之2.2】集成Nutch/Hbase/Solr构建搜索引擎之二:内容分析
请先参见"集成Nutch/Hbase/Solr构建搜索引擎之一:安装及运行",搭建测试环境 http://blog.csdn.net/jediael_lu/article/deta ...
- 【Nutch2.2.1基础教程之3】Nutch2.2.1配置文件 分类: H3_NUTCH 2014-08-18 16:33 1376人阅读 评论(0) 收藏
nutch-site.xml 在nutch2.2.1中,有两份配置文件:nutch-default.xml与nutch-site.xml. 其中前者是nutch自带的默认属性,一般情况下不要修改. 如 ...
- 【Nutch2.2.1基础教程之6】Nutch2.2.1抓取流程
一.抓取流程概述 1.nutch抓取流程 当使用crawl命令进行抓取任务时,其基本流程步骤如下: (1)InjectorJob 开始第一个迭代 (2)GeneratorJob (3)FetcherJ ...
- 【Nutch2.2.1基础教程之1】nutch相关异常
1.在任务一开始运行,注入Url时即出现以下错误. InjectorJob: Injecting urlDir: urls InjectorJob: Using class org.apache.go ...
- 【Nutch2.2.1基础教程之2.1】集成Nutch/Hbase/Solr构建搜索引擎之一:安装及运行【单机环境】
1.下载相关软件,并解压 版本号如下: (1)apache-nutch-2.2.1 (2) hbase-0.90.4 (3)solr-4.9.0 并解压至/usr/search 2.Nutch的配置 ...
- 【Nutch2.2.1基础教程之6】Nutch2.2.1抓取流程 分类: H3_NUTCH 2014-08-15 21:39 2530人阅读 评论(1) 收藏
一.抓取流程概述 1.nutch抓取流程 当使用crawl命令进行抓取任务时,其基本流程步骤如下: (1)InjectorJob 开始第一个迭代 (2)GeneratorJob (3)FetcherJ ...
- 【Nutch2.2.1基础教程之1】nutch相关异常 分类: H3_NUTCH 2014-08-08 21:46 1549人阅读 评论(2) 收藏
1.在任务一开始运行,注入Url时即出现以下错误. InjectorJob: Injecting urlDir: urls InjectorJob: Using class org.apache.go ...
- OpenVAS漏洞扫描基础教程之OpenVAS概述及安装及配置OpenVAS服务
OpenVAS漏洞扫描基础教程之OpenVAS概述及安装及配置OpenVAS服务 1. OpenVAS基础知识 OpenVAS(Open Vulnerability Assessment Sys ...
- Python基础教程之List对象 转
Python基础教程之List对象 时间:2014-01-19 来源:服务器之家 投稿:root 1.PyListObject对象typedef struct { PyObjec ...
随机推荐
- Hibernate学习笔记--Hibernate框架错误集合及解决
错误1:MappingException: Unknown entity解决方案 http://jingyan.baidu.com/article/e75aca8552761b142edac6cf.h ...
- Swift中数组集合-b
数组(Array)是一串有序的由相同类型元素构成的集合.数组中的集合元素是有序的,可以重复出现. 声明一个Array类型的时候可以使用下面的语句之一. var studentList1:Array&l ...
- 关于开源中文搜索引擎架构coreseek中算法详解
Coreseek 是一款中文全文检索/搜索软件,以GPLv2许可协议开源发布,基于Sphinx研发并独立发布,专攻中文搜索和信息处理领域,适用于行业/垂直搜索.论坛/站内搜索.数据库搜索.文档/文献 ...
- cf B. Vasya and Public Transport
http://codeforces.com/contest/355/problem/B #include <cstdio> #include <cstring> #includ ...
- 这样就算会了PHP么?-4
数组初步 <?php $ereg = 'tm'; $str = 'hello,tm,Tm,tM,TM.'; $rep_str = eregi_replace($ereg, 'TM', $str) ...
- Codeforces 22B Bargaining Table
http://www.codeforces.com/problemset/problem/22/B 题意:求出n*m的方格图中全是0的矩阵的最大周长 思路:枚举 #include<cstdio& ...
- 单线程Singleton模式的几个要点
1.Singleton模式中的实例构造器可以设置为protected以允许子类派生.2.Singleton模式一般不要支持ICIoneable接口,因为这可能会导致多个对象实例,与Singleton模 ...
- (转)PHP zval内存回收机制和refcount_gc和is_ref_gc
出处 : http://blog.sina.com.cn/s/blog_75a2f94f0101gygh.html 对于PHP这种需要同时处理多个请求的程序来说,申请和释放内存的时候应该慎之又慎,一不 ...
- java I/O之装饰者模式
装饰者: Decorator模式(别名Wrapper):动态将职责附加到对象上,若要扩展功能,装饰者提供了比继承更具弹性的代替方案. 装饰者模式意图: 动态的给一个对象添加额外的职责.Decorato ...
- Linux企业级项目实践之网络爬虫(18)——队列处理
所有的URL都接受管理,并在此进行流动.URL从管理模块的存储空间开始,一直到最后输出给磁盘上的URL索引,都由此部分调度.首先,给出URL调度的一般过程,如图所示.其流程的各个具体操作,后面详述.要 ...