nutch2.3中nutch-site.xml设置说明

nutch-site.xml是运行nutch的非必须设置文件，也就是说你不设置，nutch照样可以运行。

nutch-site.xml是nutch-default.xml的一个客制化文件。

nutch-default.xml提供了Nutch可以设置的各种属性参数，但客制化的部分并不是在nutch-default.xml中实现的，而是需要通过修改nutch-site.xml来实现自己的客制化需求。

nutch-default.xml可以分为25个大块：

 <!-- general properties  -->

 <!-- file properties -->

 <!-- HTTP properties -->

 <!-- FTP properties -->

 <!-- web db properties -->

 <!-- generate properties -->

 <!-- urlpartitioner properties -->

 <!-- fetcher properties -->

 <!-- indexingfilter plugin properties -->

 <!-- BasicIndexingfilter plugin properties -->

 <!-- moreindexingfilter plugin properties -->

 <!-- AnchorIndexing filter plugin properties -->

 <!-- URL normalizer properties -->

 <!-- mime properties -->

 <!-- plugin properties -->

 <!-- parser properties -->

 <!-- urlfilter plugin properties -->

 <!-- scoring filters properties -->

 <!-- language-identifier plugin properties -->

 <!-- index-metadata plugin properties -->

 <!-- parse-metatags plugin properties -->

 <!-- Temporary Hadoop 0.17.x workaround. -->

 <!-- solr index properties -->

 <!-- elasticsearch index properties -->

 <!-- storage properties -->

http.max.delays

<property>

  <name>http.max.delays</name>

  <value>100</value>

  <description>The number of times a thread will delay when trying to

  fetch a page.  Each time it finds that a host is busy, it will wait

  fetcher.server.delay.  After http.max.delays attepts, it will give

  up on the page for now.</description>

</property>

爬虫的网络延时线程等待时间，以秒计时，默认的配时间是3秒，视网络状况而定。如果在爬虫运行的时候发现服务器返回了主机忙消息，则等待时间由fetcher.server.delay 决定，所以在网络状况不太好的情况下fetcher.server.delay 也设置稍大一点的值较好，此外还有一个http.timeout 也和网络状况有关系。

http.content.limit

<property>

  <name>http.content.limit</name>

  <value>65536</value>

  <description>The length limit for downloaded content using the http

  protocol, in bytes. If this value is nonnegative (>=0), content longer

  than it will be truncated; otherwise, no truncation at all. Do not

  confuse this setting with the file.content.limit setting.

  </description>

</property>

描述爬虫抓取的文档内容长度的配置项。原来的值是 65536 ，也就是说抓取到的一个文档截取 65KB左右，超过部分将被忽略，对于抓取特定内容的搜索引擎需要修改此项，比如XML文档。

db.fetch.interval.default和db.fetch.interval.max

<property>

  <name>db.fetch.interval.default</name>

  <value>2592000</value>

  <description>The default number of seconds between re-fetches of a page (30 days).

  </description>

</property>

<property>

  <name>db.fetch.interval.max</name>

  <value>7776000</value>

  <description>The maximum number of seconds between re-fetches of a page

  (90 days). After this period every page in the db will be re-tried, no

  matter what is its status.

  </description>

</property>

这个功能对定期自动爬取需求的开发有用，设置多少天重新爬一个页面。

fetcher.server.delay

<property>

  <name>fetcher.server.delay</name>

  <value>5.0</value>

  <description>The number of seconds the fetcher will delay between

   successive requests to the same server. Note that this might get

   overriden by a Crawl-Delay from a robots.txt and is used ONLY if

   fetcher.threads.per.queue is set to 1.

   </description>

</property>

fetcher.threads.fetch

<property>

  <name>fetcher.threads.fetch</name>

  <value>10</value>

  <description>The number of FetcherThreads the fetcher should use.

  This is also determines the maximum number of requests that are

  made at once (each FetcherThread handles one connection). The total

  number of threads running in distributed mode will be the number of

  fetcher threads * number of nodes as fetcher has one map task per node.

  </description>

</property>

最大抓取线程数量

fetcher.threads.per.queue

<property>

  <name>fetcher.threads.per.queue</name>

  <value>1</value>

  <description>This number is the maximum number of threads that

    should be allowed to access a queue at one time. Setting it to

    a value > 1 will cause the Crawl-Delay value from robots.txt to

    be ignored and the value of fetcher.server.min.delay to be used

    as a delay between successive requests to the same server instead

    of fetcher.server.delay.

   </description>

</property>

同一时刻抓取网站的最大线程数量

fetcher.verbose

<property>

  <name>fetcher.verbose</name>

  <value>false</value>

  <description>If true, fetcher will log more verbosely.</description>

</property>

如果是 true, 打印出更多详细信息

plugin.folders

<property>

  <name>plugin.folders</name>

  <value>plugins</value>

  <description>Directories where nutch plugins are located.  Each

  element may be a relative or absolute path.  If absolute, it is used

  as is.  If relative, it is searched for on the classpath.</description>

</property>

配置插件功能的配置项，plugin.folders制定插件加载路径

plugin.includes

<property>

  <name>plugin.includes</name>

 <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>

 <description>Regular expression naming plugin directory names to

  include.  Any plugin not matching this expression is excluded.

  In any case you need at least include the nutch-extensionpoints plugin. By

  default Nutch includes crawling just HTML and plain text via HTTP,

  and basic indexing and search plugins. In order to use HTTPS please enable

  protocol-httpclient, but be aware of possible intermittent problems with the

  underlying commons-httpclient library.

  </description>

</property>

配置插件功能的配置项， plugin.includes表示需要加载的插件列表

parser.character.encoding.default

<property>

  <name>parser.character.encoding.default</name>

  <value>windows-1252</value>

  <description>The character encoding to fall back to when no other information

  is available</description>

</property>

解析文档的时候使用的默认编码windows-1252 好像比较少用到的一种编码，我不太熟悉。

parser.html.impl

<property>

  <name>parser.html.impl</name>

  <value>neko</value>

  <description>HTML Parser implementation. Currently the following keywords

  are recognized: "neko" uses NekoHTML, "tagsoup" uses TagSoup.

  </description>

</property>

制定解析HTML文档的时候使用的解析器， NEKO功能比较强大，后面会有专门的文章介绍Neko 从HTML到 TEXT以及html片断的解析等功能做介绍。

lang.analyze.max.length

<property>

  <name>lang.analyze.max.length</name>

  <value>2048</value>

  <description> The maximum bytes of data to uses to indentify

  the language (0 means full content analysis).

  The larger is this value, the better is the analysis, but the

  slowest it is.

  </description>

</property>

和语言有关系，分词的时候会用到，不过我没用过这个配置项。还有几个重要的配置项在 nutch-site.xml里面配置。

nutch2.3中nutch-site.xml设置说明的更多相关文章

ImageView设置边框以及内部图片居中显示在AndroidStudio中添加shape.xml文件
效果如图边框设置:shape文件 <shape xmlns:android="http://schemas.android.com/apk/res/android"> ...
websphere中的会话超时设置和 web应用中web.xml中session-timeout关系
Tomcat默认的会话的超时时间设置设置Tomcat session有效期的三种方式有: 1.在tomcat/conf/web.xml中修改session-timeout的值,该设置是TOMCAT全 ...
织梦dedecms中html和xml格式的网站地图sitemap制作方法
sitemap是网站上各网页的列表.创建并提交sitemap有助于百度(Google)发现并了解您网站上的所有网页,包括百度通过传统抓取方式可能找不到的网页.还可以使用sitemap提供有关你网站的其 ...
Maven中的pom.xml配置文件详解
原文:http://blog.csdn.net/u012152619/article/details/51485297 <project xmlns="http://maven.apa ...
[原]在GeoServer中为OpenStreetMap数据设置OSM样式
转载请注明作者think8848和出处(http://think8848.cnblogs.com) 在前面几篇文章中,我们讲到了部署Postgresql,部署PostGis,部署GeoServer以及 ...
C#窗体中读取修改xml文件
由于之前没有操作过xml文件,尤其是在窗体中操作xml,脑子一直转不动,而且很抵制去做这个功能,终于还是突破了自己通过查询资料完成了这个功能,在此记录一下自己的成果. 功能说明:程序中存在的xml文件 ...
转 web项目中的web.xml元素解析
转 web项目中的web.xml元素解析发表于1年前(2014-11-26 15:45) 阅读(497) | 评论(0) 16人收藏此文章, 我要收藏赞0 上海源创会5月15日与你相约[玫瑰里 ...
ADT开发中的一些优化设置：代码背景色、代码字体大小、代码自动补全
初学Android开发,在网上找到一些ADT工具的优化,自己设置好了,截图保存下来.免得以后忘了. 1. 设置背景颜色: 色调85.饱和度90.亮度205 RGB:199.237.204 2. 设置代 ...
Android平台中实现对XML的三种解析方式
本文介绍在Android平台中实现对XML的三种解析方式. XML在各种开发中都广泛应用,Android也不例外.作为承载数据的一个重要角色,如何读写XML成为Android开发中一项重要的技能. 在 ...
REST Adapter实现SAP PI中的增强XML/JSON格式转换
SAP标准的REST adapter有着XML/JSON转换的功能,它很有用,因为一方面SAP PI/PO内部以XML格式处理数据,而另一方面,在处理REST架构风格的时候,JSON才是事实上的格式. ...

随机推荐

poj 1887 Testing the CATCHER_最长上升子序列
题意:题目太长没看,直接看输入输出猜出是最长下降子序列用了以前的代码直接a了,做法类似贪心,把最小的顺序数存在数组里面,每次二分更新数组得出最长上升子序列 #include<iostream& ...
Ubuntu 14.04 下使用IDEA开发Spark应用入门
网上有很多教程,有用sbt ,也有不用sbt的,看的头大,搞了半天,终于运行成功一个例子,如下: 1.官网下载http://www.jetbrains.com/idea/download/ Inter ...
java中如何将char数组转化为String
1.直接在构造String时建立. char data[] = {'s', 'g', 'k'}; String str = new String(data); 2.String有方法可以直接转换. S ...
IHttpModule接口事件执行获取Session 找了很多国内的都不对，从国外转过来一个测试可用的
我的环境,asp.net4.0框架集不多说上代码 public class MyHttpModule : IHttpModule { public void Init(HttpApplication ...
Server(Iocp)的那些烦恼
自G-Socket0.88版开源以来,得到很多朋友的支持.从1.0版本至2.0之前,内核几乎没有改变,经过多处的应用其稳定性和效率表现是相当不错的.这几年的经验总结成一句话:服务器程序不是有了一个好的 ...
Windows下搭建Eclipse+Android4.0开发环境
官方搭建步骤: http://developer.android.com/index.html 搭建好开发环境之前须要下载以下几个文件包: 一.安装Java执行环境JRE(没这个Eclipse执行不起 ...
mysql 新增删除用户和权限分配
请一定安此步骤来创建新的用户. 1. 新增用户 mysql>insert into mysql.user(Host,User,Password) values("localhost&q ...
实现winfrom进度条及进度信息提示，winfrom程序假死处理
1.方法一:使用线程功能描述:在用c#做WinFrom开发的过程中.我们经常需要用到进度条(ProgressBar)用于显示进度信息.这时候我们可能就需要用到多线程,如果不采用多线程控制进度条,窗口 ...
Tomcat无法安装 Check your settings and permissions Ignore and continue anyway
刚刚“sj”,把装在C盘的tomcat的文件夹给删除了,刚删完就想到干嘛不卸载啊,哎惯性思维啊,转而一想,tomcat这么简单安装,不怕不怕,后来一装,妈啊,装不了,百度之后原来是服务没有删除,好吧, ...
js类的几种写法
我们常用的有以下几种方法来用JavaScript写一个“类”: 1. 构造函数(public属性和方法) 1: function Person(iName, iAge){ 2: this.name=i ...

nutch2.3中nutch-site.xml设置说明

nutch2.3中nutch-site.xml设置说明的更多相关文章

随机推荐

热门专题