Heritrix可分为四大模块:

1、控制器CrawlController

2、待处理的uri列表  Frontier

3、线程池 ToeThread

4、各个步骤的处理器

(1)Pre-fetch processing chain:主要处理DNS-lookup, robots.txt,认证,抓取范围检查等。

(2)Fetch Processing chain:抓取处理器。

对于每一个协议,均有一个类作支持,如FetchHTTP。

(3)Extractor processing chain:内容提取器。

用于提取页面中的链接。

(4)Write/index processing chain:将抓取的文件定入归档文件里,有时还会建立索引。

(5)Post-processing chain:更新抓取状态,检查第4步抓取的链接是否在抓取范围中。

附官方文档:

4. Overview of the crawler

The Heritrix Web Crawler is designed to be modular. Which modules to use can be set at runtime from the user interface. Our hope is that if you want the crawler to behave different from the default, it should only be a matter of writing a new module as a
replacement or in addition to the modules shipped with the crawler.

The rest of this document assumes you have a basic understanding of how to run a crawl (see: [Heritrix
User Guide
]). Since the crawler is written in the Java programming language, you also need a fairly good understanding of Java.

The crawler consists of core classes and pluggable modules. The core classes can be configured, but not replaced. The pluggable classes can be substituted by altering the configuration
of the crawler. A set of basic pluggable classes are shipped with the crawler, but if you have needs not met by these classes you could write your own.

Figure 1. Crawler overview

watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvamVkaWFlbF9sdQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center" alt="">

4.1. The CrawlController

The CrawlController collects all the classes which cooperate to perform a crawl, provides a high-level interface to the running crawl, and executes the "master thread" which doles out URIs from the Frontier to the ToeThreads. As the "global context" for
a crawl, subcomponents will usually reach each other through the CrawlController.

4.2. The Frontier

The Frontier is responsible for handing out the next URI to be crawled. It is responsible for maintaining politeness, that is making sure that no web server is crawled too heavily. After a URI is crawled, it is handed back to the Frontier along with any
newly discovered URIs that the Frontier should schedule for crawling.

It is the Frontier which keeps the state of the crawl. This includes, but is not limited to:

  • What URIs have been discovered

  • What URIs are being processed (fetched)

  • What URIs have been processed

The Frontier implements the Frontier interface and can be replaced by any Frontier that implements this interface. It should be noted though that writing a Frontier is not a trivial task.

The Frontier relies on the behavior of at least the following external processors: PreconditionEnforcer, LinksScoper and the FrontierScheduler (See below for more each of these Processors). The PreconditionEnforcer makes sure dns and robots are checked ahead
of any fetching. LinksScoper tests if we are interested in a particular URL -- whether the URL is 'within the crawl scope' and if so, what our level of interest in the URL is, the priority with which it should be fetched. The FrontierScheduler adds ('schedules')
URLs to the Frontier for crawling.

4.3. ToeThreads

The Heritrix web crawler is multi threaded. Every URI is handled by its own thread called a ToeThread. A ToeThread asks the Frontier for a new URI, sends it through all the processors and then asks for a new URI.

4.4. Processors

Processors are grouped into processor chains (Figure 2,
“Processor chains”
). Each chain does some processing on a URI. When a Processor is finished with a URI the ToeThread sends the URI to the next Processor until the URI has been processed by all the Processors. A processor has the option of telling the URI
to skip to a particular chain. Also if a processor throws a fatal error, the processing skips to the Post-processing chain.

Figure 2. Processor chains

watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvamVkaWFlbF9sdQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center" alt="">

The task performed by the different processing chains are as follows:

4.4.1. Pre-fetch processing chain

The first chain is responsible for investigating if the URI could be crawled at this point. That includes checking if all preconditions are met (DNS-lookup, fetching robots.txt, authentication). It is also possible to completely block the crawling of URIs
that have not passed through the scope check.

In the Pre-fetch processing chain the following processors should be included (or replacement modules that perform similar operations):

  • Preselector

    Last check if the URI should indeed be crawled. Can for example recheck scope. Useful if scope rules have been changed after the crawl starts. The scope is usually checked by the LinksScoper, before new URIs are added to the Frontier to be crawled. If the
    user changes the scope limits, it will not affect already queued URIs. By rechecking the scope at this point, you make sure that only URIs that are within current scope are being crawled.

  • PreconditionEnforcer

    Ensures that all preconditions for crawling a URI have been met. These currently include verifying that DNS and robots.txt information has been fetched for the URI.

4.4.2. Fetch processing chain

The processors in this chain are responsible for getting the data from the remote server. There should be one processor for each protocol that Heritrix supports: e.g. FetchHTTP.

4.4.3. Extractor processing chain

At this point the content of the document referenced by the URI is available and several processors will in turn try to get new links from it.

4.4.4. Write/index processing chain

This chain is responsible for writing the data to archive files. Heritrix comes with an ARCWriterProcessor which writes to the ARC format. New processors could be written to support other formats and even create indexes.

4.4.5. Post-processing chain

A URI should always pass through this chain even if a decision not to crawl the URI was done in a processor earlier in the chain. The post-processing chain must contain the following processors (or replacement modules that perform similar operations):

  • CrawlStateUpdater

    Updates the per-host information that may have been affected by the fetch. This is currently robots and IP address info.

  • LinksScoper

    Checks all links extracted from the current download against the crawl scope. Those that are out of scope are discarded. Logging of discarded URLs can be enabled.

  • FrontierScheduler

    'Schedules' any URLs stored as CandidateURIs found in the current CrawlURI with the frontier for crawling. Also schedules prerequisites if any.

【Heritrix基础教程之3】Heritrix的基本架构的更多相关文章

  1. 【Heritrix基础教程之1】在Eclipse中配置Heritrix

    一.新建项目并将Heritrix源代码导入 1.下载heritrix-1.14.4-src.zip和heritrix-1.14.4.zip两个压缩包,并解压,以后分别简称SRC包和ZIP包: 2.在E ...

  2. 【Heritrix基础教程之2】Heritrix基本内容介绍

    1.版本说明 (1)最新版本:3.3.0 (2)最新release版本:3.2.0 (3)重要历史版本:1.14.4 3.1.0及之前的版本:http://sourceforge.net/projec ...

  3. 【Heritrix基础教程之3】Heritrix的基本架构 分类: H3_NUTCH 2014-06-01 16:56 1267人阅读 评论(0) 收藏

    Heritrix可分为四大模块: 1.控制器CrawlController 2.待处理的uri列表  Frontier 3.线程池 ToeThread 4.各个步骤的处理器 (1)Pre-fetch ...

  4. 【Heritrix基础教程之2】Heritrix基本内容介绍 分类: B1_JAVA H3_NUTCH 2014-06-01 13:02 878人阅读 评论(0) 收藏

    1.版本说明 (1)最新版本:3.3.0 (2)最新release版本:3.2.0 (3)重要历史版本:1.14.4 3.1.0及之前的版本:http://sourceforge.net/projec ...

  5. 【Heritrix基础教程之1】在Eclipse中配置Heritrix 分类: H3_NUTCH 2014-06-01 00:00 1262人阅读 评论(0) 收藏

    一.新建项目并将Heritrix源码导入 1.下载heritrix-1.14.4-src.zip和heritrix-1.14.4.zip两个压缩包,并解压,以后分别简称SRC包和ZIP包: 2.在Ec ...

  6. 【Heritrix基础教程之4】开始一个爬虫抓取的全流程代码分析

    在创建一个job后,就要开始job的运行,运行的全流程如下: 1.在界面上启动job 2.index.jsp 查看上述页面对应的源代码 <a href='"+request.getCo ...

  7. OpenVAS漏洞扫描基础教程之OpenVAS概述及安装及配置OpenVAS服务

    OpenVAS漏洞扫描基础教程之OpenVAS概述及安装及配置OpenVAS服务   1.  OpenVAS基础知识 OpenVAS(Open Vulnerability Assessment Sys ...

  8. Python基础教程之List对象 转

    Python基础教程之List对象 时间:2014-01-19    来源:服务器之家    投稿:root   1.PyListObject对象typedef struct {    PyObjec ...

  9. Python基础教程之udp和tcp协议介绍

    Python基础教程之udp和tcp协议介绍 UDP介绍 UDP --- 用户数据报协议,是一个无连接的简单的面向数据报的运输层协议.UDP不提供可靠性,它只是把应用程序传给IP层的数据报发送出去,但 ...

随机推荐

  1. pyqt 托盘例子学习

    # -*- coding: utf-8 -*- # python:2.x __author__ = 'Administrator' from PyQt4.QtGui import * from PyQ ...

  2. 简单的Dao设计模式

    简单的DAO设计模式 这两天学习到了DAO(Data Access Object 数据存取对象)设计模式.想谈谈自己的感受,刚开始接触是感觉有点难,觉得自己逻辑理不清,主要是以前学的知识比较零散没有很 ...

  3. 期望dp-hdu-4336-Card Collector

    题目链接: http://acm.hdu.edu.cn/showproblem.php?pid=4336 题目大意: 有n种卡片,每包中至多有一种卡片,概率分别为p1,p2,...pn,可能有的没有卡 ...

  4. iOS:UI系列之UINavigationController

    又到了总结的时间了,突然间感觉时间过得好快啊, 总觉的时间不够用,但是这也没办法啊, 只有自己挤时间了,虽然是零基础,但是这并不能代表什么啦,只要努力,收获总还是有的, 同时我也相信广大的博友肯定也有 ...

  5. ASP.NET内核几大对象、ASP.NET核心知识(6)--转载

    这篇博文主要介绍一下几个对象. 1)HttpContext 2)HttpRequest 3)HttpResponse 4)context. Server 5)context.Session HttpC ...

  6. Linq GroupJoin 使用

    备忘: var data = BoshccEntities.Current.TB_MB_1 .GroupJoin(BoshccEntities.Current.TB_MB_2, o => o.H ...

  7. BackgroundWorker 后台进程控制窗体label、richtextbook内容刷新

    昨天写了一个从文章中提取关键词的程序,写完处理的逻辑后又花了好几个小时在用户友好性上.加了几个progressBar,有显示总进度的.有显示分布进度的..因为程序要跑好几个小时才能执行好,只加个总进度 ...

  8. 使用DataReader读取数据

    List<User> allUsers = new List<User>(); SqlConnection conn = new SqlConnection(连接字符串); S ...

  9. android 中使用缓存加载数据

    最近app快完工了,但是很多列表加载,新闻咨询等数据一直从网络请求,速度很慢,影响用户体验,所以寻思用缓存来加载一些更新要求不太高的数据 废话不多说,上代码 欢迎转载,但请保留文章原始出处:)  博客 ...

  10. nginx+tomcat的集群和session复制

    前端服务器采用nginx,后端应用服务器采用tomcat.nginx负责负载均衡,session复制在tomcat上处理. 1.nginx安装(略) 2.nginx配置负载均衡 http { incl ...