Heritrix可分为四大模块:

1、控制器CrawlController

2、待处理的uri列表  Frontier

3、线程池 ToeThread

4、各个步骤的处理器

(1)Pre-fetch processing chain:主要处理DNS-lookup, robots.txt,认证,抓取范围检查等。

(2)Fetch Processing chain:抓取处理器。对于每个协议,均有一个类作支持,如FetchHTTP。

(3)Extractor processing chain:内容提取器。用于提取页面中的链接。

(4)Write/index processing chain:将抓取的文件定入归档文件中,有时还会建立索引。

(5)Post-processing chain:更新抓取状态,检查第4步抓取的链接是否在抓取范围中。

附官方文档:

4. Overview of the crawler

The Heritrix Web Crawler is designed to be modular. Which modules to use can be set at runtime from the user interface. Our hope is that if you want the crawler to behave different from the default, it should only be a matter of writing a new module as a
replacement or in addition to the modules shipped with the crawler.

The rest of this document assumes you have a basic understanding of how to run a crawl (see: [Heritrix
User Guide
]). Since the crawler is written in the Java programming language, you also need a fairly good understanding of Java.

The crawler consists of core classes and pluggable modules. The core classes can be configured, but not replaced. The pluggable classes can be substituted by altering the configuration
of the crawler. A set of basic pluggable classes are shipped with the crawler, but if you have needs not met by these classes you could write your own.

Figure 1. Crawler overview


4.1. The CrawlController

The CrawlController collects all the classes which cooperate to perform a crawl, provides a high-level interface to the running crawl, and executes the "master thread" which doles out URIs from the Frontier to the ToeThreads. As the "global context" for
a crawl, subcomponents will usually reach each other through the CrawlController.

4.2. The Frontier

The Frontier is responsible for handing out the next URI to be crawled. It is responsible for maintaining politeness, that is making sure that no web server is crawled too heavily. After a URI is crawled, it is handed back to the Frontier along with any
newly discovered URIs that the Frontier should schedule for crawling.

It is the Frontier which keeps the state of the crawl. This includes, but is not limited to:

  • What URIs have been discovered

  • What URIs are being processed (fetched)

  • What URIs have been processed

The Frontier implements the Frontier interface and can be replaced by any Frontier that implements this interface. It should be noted though that writing a Frontier is not a trivial task.

The Frontier relies on the behavior of at least the following external processors: PreconditionEnforcer, LinksScoper and the FrontierScheduler (See below for more each of these Processors). The PreconditionEnforcer makes sure dns and robots are checked ahead
of any fetching. LinksScoper tests if we are interested in a particular URL -- whether the URL is 'within the crawl scope' and if so, what our level of interest in the URL is, the priority with which it should be fetched. The FrontierScheduler adds ('schedules')
URLs to the Frontier for crawling.

4.3. ToeThreads

The Heritrix web crawler is multi threaded. Every URI is handled by its own thread called a ToeThread. A ToeThread asks the Frontier for a new URI, sends it through all the processors and then asks for a new URI.

4.4. Processors

Processors are grouped into processor chains (Figure 2,
“Processor chains”
). Each chain does some processing on a URI. When a Processor is finished with a URI the ToeThread sends the URI to the next Processor until the URI has been processed by all the Processors. A processor has the option of telling the URI
to skip to a particular chain. Also if a processor throws a fatal error, the processing skips to the Post-processing chain.

Figure 2. Processor chains


The task performed by the different processing chains are as follows:

4.4.1. Pre-fetch processing chain

The first chain is responsible for investigating if the URI could be crawled at this point. That includes checking if all preconditions are met (DNS-lookup, fetching robots.txt, authentication). It is also possible to completely block the crawling of URIs
that have not passed through the scope check.

In the Pre-fetch processing chain the following processors should be included (or replacement modules that perform similar operations):

  • Preselector

    Last check if the URI should indeed be crawled. Can for example recheck scope. Useful if scope rules have been changed after the crawl starts. The scope is usually checked by the LinksScoper, before new URIs are added to the Frontier to be crawled. If the
    user changes the scope limits, it will not affect already queued URIs. By rechecking the scope at this point, you make sure that only URIs that are within current scope are being crawled.

  • PreconditionEnforcer

    Ensures that all preconditions for crawling a URI have been met. These currently include verifying that DNS and robots.txt information has been fetched for the URI.

4.4.2. Fetch processing chain

The processors in this chain are responsible for getting the data from the remote server. There should be one processor for each protocol that Heritrix supports: e.g. FetchHTTP.

4.4.3. Extractor processing chain

At this point the content of the document referenced by the URI is available and several processors will in turn try to get new links from it.

4.4.4. Write/index processing chain

This chain is responsible for writing the data to archive files. Heritrix comes with an ARCWriterProcessor which writes to the ARC format. New processors could be written to support other formats and even create indexes.

4.4.5. Post-processing chain

A URI should always pass through this chain even if a decision not to crawl the URI was done in a processor earlier in the chain. The post-processing chain must contain the following processors (or replacement modules that perform similar operations):

  • CrawlStateUpdater

    Updates the per-host information that may have been affected by the fetch. This is currently robots and IP address info.

  • LinksScoper

    Checks all links extracted from the current download against the crawl scope. Those that are out of scope are discarded. Logging of discarded URLs can be enabled.

  • FrontierScheduler

    'Schedules' any URLs stored as CandidateURIs found in the current CrawlURI with the frontier for crawling. Also schedules prerequisites if any.

版权声明:本文为博主原创文章,未经博主允许不得转载。

【Heritrix基础教程之3】Heritrix的基本架构 分类: H3_NUTCH 2014-06-01 16:56 1267人阅读 评论(0) 收藏的更多相关文章

  1. 【Nutch2.2.1基础教程之3】Nutch2.2.1配置文件 分类: H3_NUTCH 2014-08-18 16:33 1376人阅读 评论(0) 收藏

    nutch-site.xml 在nutch2.2.1中,有两份配置文件:nutch-default.xml与nutch-site.xml. 其中前者是nutch自带的默认属性,一般情况下不要修改. 如 ...

  2. makefile基础实例讲解 分类: C/C++ 2015-03-16 10:11 66人阅读 评论(0) 收藏

    一.makefile简介 定义:makefile定义了软件开发过程中,项目工程编译链.接接的方法和规则. 产生:由IDE自动生成或者开发者手动书写. 作用:Unix(MAC OS.Solars)和Li ...

  3. OC基础:内存(内存管理) 分类: ios学习 OC 2015-06-25 16:50 73人阅读 评论(0) 收藏

    自动释放池: @autoreleasepool { } 内存管理机制       谁污染,谁治理 垃圾回收机制:gc(Garbage collection),由系统管理内存,开发人员不需要管理. OC ...

  4. Mahout快速入门教程 分类: B10_计算机基础 2015-03-07 16:20 508人阅读 评论(0) 收藏

    Mahout 是一个很强大的数据挖掘工具,是一个分布式机器学习算法的集合,包括:被称为Taste的分布式协同过滤的实现.分类.聚类等.Mahout最大的优点就是基于hadoop实现,把很多以前运行于单 ...

  5. Avro基础 分类: C_OHTERS 2015-02-14 19:56 310人阅读 评论(0) 收藏

    一.Avro的基本功能 1.定义了数据模式文件的语法,一般使用json文件.以及一些数据基本类型与复杂类型. 2.定义了数据序列化到文件后的数据格式,此格式可供各种语言进行读取. 3.为部分语言定义了 ...

  6. 【solr基础教程之九】客户端 分类: H4_SOLR/LUCENCE 2014-07-30 15:28 904人阅读 评论(0) 收藏

    一.Java Script 1.由于Solr本身可以返回Json格式的结果,而JavaScript对于处理Json数据具有天然的优势,因此使用JavaScript实现Solr客户端是一个很好的选择. ...

  7. 【Heritrix基础教程之2】Heritrix基本内容介绍 分类: B1_JAVA H3_NUTCH 2014-06-01 13:02 878人阅读 评论(0) 收藏

    1.版本说明 (1)最新版本:3.3.0 (2)最新release版本:3.2.0 (3)重要历史版本:1.14.4 3.1.0及之前的版本:http://sourceforge.net/projec ...

  8. 【Heritrix基础教程之1】在Eclipse中配置Heritrix 分类: H3_NUTCH 2014-06-01 00:00 1262人阅读 评论(0) 收藏

    一.新建项目并将Heritrix源码导入 1.下载heritrix-1.14.4-src.zip和heritrix-1.14.4.zip两个压缩包,并解压,以后分别简称SRC包和ZIP包: 2.在Ec ...

  9. Berkeley DB基础教程 分类: H3_NUTCH 2014-05-29 15:21 2212人阅读 评论(0) 收藏

    一.Berkeley DB的介绍 (1)Berkeley DB是一个嵌入式数据库,它适合于管理海量的.简单的数据.如Google使用其来保存账户信息,Heritrix用其来保存froniter. (2 ...

随机推荐

  1. Python基础教程之第3章 使用字符串

    Python 2.7.5 (default, May 15 2013, 22:43:36) [MSC v.1500 32 bit (Intel)] on win32 Type "copyri ...

  2. java初始化过程中成员变量

    package day01; class Base{ int j; //1.j=0 Base(){ add(1); //2.调用子类add()方法 System.out.println(j); //4 ...

  3. BZOJ3674可持久化并查集(模板)

    没什么可说的,就是一个可持久化线段树维护一个数组fa以及deep按秩合并好了 注意一下强制在线 蒟蒻的我搞了好长时间QAQ 贴代码: #include<cstdio> #include&l ...

  4. Codeforces Round #277.5 解题报告

    又熬夜刷了cf,今天比正常多一题.比赛还没完但我知道F过不了了,一个半小时贡献给F还是没过--应该也没人Hack.写写解题报告吧= =. 解题报告例如以下: A题:选择排序直接搞,由于不要求最优交换次 ...

  5. hdu4605Magic Ball Game 树状数组

    //给一棵树.树的每个节点的子节点个数是0或2 //对于每个节点都有一个权值w[i] //一个权值为x的球在每个节点的情况有 //x=w[i] 这个球在该点不向下掉 //x<w[i] 这个球往左 ...

  6. Makefile的问题

    注意包含路径的变量,可能会带有空格而使得运行失败 直接定义:DEBUG  = false后面引用需要使用 $(DEBUG) 或者set  DEBUG  $(DEBUG)之后就可以直接引用了

  7. 【CS Round #44 (Div. 2 only) D】Count Squares

    [链接]点击打开链接 [题意] 给你一个0..n和0..m的区域. 你可以选定其中的4个点,然后组成一个正方形. 问你可以圈出多少个正方形. (正方形的边不一定和坐标轴平行) [题解] 首先,考虑只和 ...

  8. Codeforces_GYM_100741 A

    http://codeforces.com/gym/100741/problem/A A. Queries time limit per test 0.25 seconds memory limit ...

  9. 洛谷——P1179 数字统计

    https://www.luogu.org/problem/show?pid=1179 题目描述 请统计某个给定范围[L, R]的所有整数中,数字 2 出现的次数. 比如给定范围[2, 22],数字 ...

  10. 为easyUI的dataGrid加入自己的查询框

    dataGrid作为easyUI的一个核心组件,其功能上是非常强大的. 可是外观上似乎就有点差强人意了,首先说一下我对dataGrid外观的2点感受 1.图标不好看,且尺寸非常小(16x16)-- 关 ...