【Nutch2.2.1基础教程之3】Nutch2.2.1配置文件

nutch-site.xml

在nutch2.2.1中，有两份配置文件：nutch-default.xml与nutch-site.xml。

其中前者是nutch自带的默认属性，一般情况下不要修改。

如果需要修改默认属性，可以在nutch-site.xml中增加一个同名的属性，并修改其值。nutch-site.xml中的属性值会覆盖nutch-default.xml中的值。

1、db.ignore.external.links

若为true，则只抓取本域名内的网页，忽略外部链接。

可以在 regex-urlfilter.txt中增加过滤器达到同样效果，但如果过滤器过多，如几千个，则会大大影响nutch的性能。

<property>

  <name>db.ignore.external.links</name>

  <value>true</value>

  <description>If true, outlinks leading from a page to external hosts

  will be ignored. This is an effective way to limit the crawl to include

  only initially injected hosts, without creating complex URLFilters.

  </description>

</property>

2、fetcher.parse

能否在抓取的同时进行解释：可以，但不建议这样做。

<property>

  <name>fetcher.parse</name>

  <value>false</value>

  <description>If true, fetcher will parse content. NOTE: previous releases would

  default to true. Since 2.0 this is set to false as a safer default.</description>

</property>

官方解释

N.B. In a parsing fetcher, outlinks are processed in the reduce phase (at least when outlinks are followed). If a fetcher's reducer stalls you may run out of memory or disk space,
usually after a very long reduce job. Behaviour typical to this is usually
observed in this situation.

In summary, if it is possible, users are advised not to use a parsing fetcher as it is heavy on IO and often leads to the above outcome.

3、db.max.outlinks.per.page

默认情况下，Nutch只抓取某个网页的100个外部链接，导致部分链接无法抓取。若要改变此情况，可以修改此配置项。

<property>

  <name>db.max.outlinks.per.page</name>

  <value>100</value>

  <description>The maximum number of outlinks that we'll process for a page.  If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks  will be processed for a page; otherwise, all outlinks will be processed.

  </description>

</property>

官方说明如下：http://wiki.apache.org/nutch/FAQ/

Nutch doesn't crawl relative URLs? Some pages are not indexed but my regex file and everything else is okay - what is going on?

The crawl tool has a default limitation of 100 outlinks of one page that are being fetched. To overcome this limitation change thedb.max.outlinks.per.page property to a higher
value or simply -1 (unlimited).

file: conf/nutch-default.xml

 <property>

   <name>db.max.outlinks.per.page</name>

   <value>-1</value>

   <description>The maximum number of outlinks that we'll process for a page.

   If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks

   will be processed for a page; otherwise, all outlinks will be processed.

   </description>

 </property>

4、file.content.limit http.content.limit ftp.content.limit

默认情况下，nutch只抓取网页的前65536个字节，之后的内容将被丢弃。

但对于某些大型网站，首页的内容远远不止65536个字节，甚至前面65536个字节里面均是一些布局信息，并没有任何的超链接。

因此修改默认值如下：

<property>

  <name>file.content.limit</name>

  <value>-1</value>

  <description>The length limit for downloaded content using the file

   protocol, in bytes. If this value is nonnegative (>=0), content longer

   than it will be truncated; otherwise, no truncation at all. Do not

   confuse this setting with the http.content.limit setting.

  </description>

</property>

<property>

  <name>http.content.limit</name>

  <value>-1</value>

  <description>The length limit for downloaded content using the http

  protocol, in bytes. If this value is nonnegative (>=0), content longer

  than it will be truncated; otherwise, no truncation at all. Do not

  confuse this setting with the file.content.limit setting.

  </description>

</property>

<property>

  <name>ftp.content.limit</name>

  <value>-1</value>

  <description>The length limit for downloaded content, in bytes.

  If this value is nonnegative (>=0), content longer than it will be truncated;

  otherwise, no truncation at all.

  Caution: classical ftp RFCs never defines partial transfer and, in fact,

  some ftp servers out there do not handle client side forced close-down very

  well. Our implementation tries its best to handle such situations smoothly.

  </description>

</property>

【Nutch2.2.1基础教程之3】Nutch2.2.1配置文件的更多相关文章

【Nutch2.2.1基础教程之2.2】集成Nutch/Hbase/Solr构建搜索引擎之二：内容分析
请先参见"集成Nutch/Hbase/Solr构建搜索引擎之一:安装及运行",搭建测试环境 http://blog.csdn.net/jediael_lu/article/deta ...
【Nutch2.2.1基础教程之3】Nutch2.2.1配置文件分类： H3_NUTCH 2014-08-18 16:33 1376人阅读评论(0) 收藏
nutch-site.xml 在nutch2.2.1中,有两份配置文件:nutch-default.xml与nutch-site.xml. 其中前者是nutch自带的默认属性,一般情况下不要修改. 如 ...
【Nutch2.2.1基础教程之6】Nutch2.2.1抓取流程
一.抓取流程概述 1.nutch抓取流程当使用crawl命令进行抓取任务时,其基本流程步骤如下: (1)InjectorJob 开始第一个迭代 (2)GeneratorJob (3)FetcherJ ...
【Nutch2.2.1基础教程之1】nutch相关异常
1.在任务一开始运行,注入Url时即出现以下错误. InjectorJob: Injecting urlDir: urls InjectorJob: Using class org.apache.go ...
【Nutch2.2.1基础教程之2.1】集成Nutch/Hbase/Solr构建搜索引擎之一：安装及运行【单机环境】
1.下载相关软件,并解压版本号如下: (1)apache-nutch-2.2.1 (2) hbase-0.90.4 (3)solr-4.9.0 并解压至/usr/search 2.Nutch的配置 ...
【Nutch2.2.1基础教程之6】Nutch2.2.1抓取流程分类： H3_NUTCH 2014-08-15 21:39 2530人阅读评论(1) 收藏
一.抓取流程概述 1.nutch抓取流程当使用crawl命令进行抓取任务时,其基本流程步骤如下: (1)InjectorJob 开始第一个迭代 (2)GeneratorJob (3)FetcherJ ...
【Nutch2.2.1基础教程之1】nutch相关异常分类： H3_NUTCH 2014-08-08 21:46 1549人阅读评论(2) 收藏
1.在任务一开始运行,注入Url时即出现以下错误. InjectorJob: Injecting urlDir: urls InjectorJob: Using class org.apache.go ...
OpenVAS漏洞扫描基础教程之OpenVAS概述及安装及配置OpenVAS服务
OpenVAS漏洞扫描基础教程之OpenVAS概述及安装及配置OpenVAS服务 1. OpenVAS基础知识 OpenVAS(Open Vulnerability Assessment Sys ...
Python基础教程之List对象转
Python基础教程之List对象时间:2014-01-19 来源:服务器之家投稿:root 1.PyListObject对象typedef struct { PyObjec ...

随机推荐

3月19日 html(一) html基础内容
---恢复内容开始--- 今天学习了html的第一节课,是些比较简单的基础知识,知道如何向网页里添加文本.图片.表格.超链接之类的,如何去编写这些代码. html(hyper text makeup ...
git强制更新
1.下载远程的库的内容 git fetch --all 2.把HEAD指向刚刚下载的最新的版本 git reset --hard origin/master
Ubuntu 下安装使用文件比较合并图形工具Meld
Meld是一款跨平台的文件比较合并工具使用Python开发,具体内容参照官网:http://meldmerge.org/ 注意以下环境要求: Requirements Python 2.7 (Pyth ...
android 中View, Window, Activity, WindowManager，ViewRoot几者之间的关系
(1)View:最基本的UI组件,表示屏幕上的一个矩形区域. (2)Window: 表示一个窗口,不一定有屏幕那么大,可以很大也可以很小: 它包含一个V ...
RFID电子标签加工的倒装工艺
倒装对于半导体封装领域的人员而言,是再熟悉不过的了.一般我们看到的集成电路多数以塑封为主,半导体芯片和外界进行信息沟通的通道,靠的就是集成电路的管脚.如果把集成电路外面的封装去掉,会发现每个集成电路内 ...
Android之断点续传下载
今天学习了Android开发中比较难的一个环节,就是断点续传下载,很多人看到这个标题就感觉头大,的确,如果没有良好的逻辑思维,这块的确很难搞明白.下面我就将自己学到的知识和一些见解写下供那些在这个环节 ...
Codeforces Round #322 (Div. 2) —— F. Zublicanes and Mumocrates
It's election time in Berland. The favorites are of course parties of zublicanes and mumocrates. The ...
Java 内存区域和GC机制-java概念理解
推荐几篇关于java内存介绍的文章 Java 内存区域和GC机制 http://www.cnblogs.com/hnrainll/archive/2013/11/06/3410042.html ...
关于Spring中的PagedListHolder分页类的分析
PagedListHolder 这个类可以对分页操作进行封装文件在:import org.springframework.beans.support.PagedListHolder;下默认是把查 ...
[原创作品] Express 4.x 接收表单数据
好久没有写博客,从现在开始,将介绍用nodejs进行web开发的介绍.欢迎加群讨论:164858883. 之前的express版本在接收表单数据时,可以统一用res.params['参数名'],但在4 ...

【Nutch2.2.1基础教程之3】Nutch2.2.1配置文件

【Nutch2.2.1基础教程之3】Nutch2.2.1配置文件的更多相关文章

随机推荐

热门专题