原文地址：http://docs.pyspider.org/en/latest/Architecture/

Architecture

This document describes the reason why I made pyspider and the architecture.

Why

Two years ago, I was working on a vertical search engine. We are facing following needs on crawling:

collect 100-200 websites, they may on/offline or change their templates at any time

We need a really powerful monitor to find out which website is changing. And a good tool to help us write script/template for each website.
data should be collected in 5min when website updated

We solve this problem by check index page frequently, and use something like 'last update time' or 'last reply time' to determine which page is changed. In addition to this, we recheck pages after X days in case to prevent the omission.
pyspider will never stop as WWW is changing all the time

Furthermore, we have some APIs from our cooperators, the API may need POST, proxy, request signature etc. Full control from script is more convenient than some global parameters of components.

Overview

The following diagram shows an overview of the pyspider architecture with its components and an outline of the data flow that takes place inside the system.

Components are connected by message queue. Every component, including message queue, is running in their own process/thread, and replaceable. That means, when process is slow, you can have many instances of processor and make full use of multiple CPUs, or deploy to multiple machines. This architecture makes pyspider really fast. benchmarking.

Components

Scheduler

The Scheduler receives tasks from newtask_queue from processor. Decide whether the task is new or requires re-crawl. Sort tasks according to priority and feeding them to fetcher with traffic control (token bucket algorithm). Take care of periodic tasks, lost tasks and failed tasks and retry later.

All of above can be set via self.crawl API.

Note that in current implement of scheduler, only one scheduler is allowed.

Fetcher

The Fetcher is responsible for fetching web pages then send results to processor. For flexible, fetcher support Data URI and pages that rendered by JavaScript (via phantomjs). Fetch method, headers, cookies, proxy, etag etc can be controlled by script via API.

Phantomjs Fetcher

Phantomjs Fetcher works like a proxy. It's connected to general Fetcher, fetch and render pages with JavaScript enabled, output a general HTML back to Fetcher:

scheduler -> fetcher -> processor

                |

            phantomjs

                |

             internet

Processor

The Processor is responsible for running the script written by users to parse and extract information. Your script is running in an unlimited environment. Although we have various tools(like PyQuery) for you to extract information and links, you can use anything you want to deal with the response. You may refer to Script Environment and API Reference to get more information about script.

Processor will capture the exceptions and logs, send status(task track) and new tasks to scheduler, send results to Result Worker.

Result Worker (optional)

Result worker receives results from Processor. Pyspider has a built-in result worker to save result to resultdb. Overwrite it to deal with result by your needs.

WebUI

WebUI is a web frontend for everything. It contains:

script editor, debugger
project manager
task monitor
result viewer, exporter

Maybe webui is the most attractive part of pyspider. With this powerful UI, you can debug your scripts step by step just as pyspider do. Starting or stop a project. Finding which project is going wrong and what request is failed and try it again with debugger.

Data flow

The data flow in pyspider is just as your seen in diagram above:

Each script has a callback named on_start, when you press the Run button on WebUI. A new task of on_start is submitted to Scheduler as the entries of project.
Scheduler dispatches this on_start task with a Data URI as a normal task to Fetcher.
Fetcher makes a request and a response to it (for Data URI, it's a fake request and response, but has no difference with other normal tasks), then feeds to Processor.
Processor calls the on_start method and generated some new URL to crawl. Processor send a message to Scheduler that this task is finished and new tasks via message queue to Scheduler (here is no results for on_start in most case. If has results, Processor send them to result_queue).
Scheduler receives the new tasks, looking up in the database, determine whether the task is new or requires re-crawl, if so, put them into task queue. Dispatch tasks in order.
The process repeats (from step 3) and wouldn't stop till WWW is dead ;-). Scheduler will check periodic tasks to crawl latest data.

pyspider architecture--官方文档的更多相关文章

Spring 4 官方文档学习 Spring与Java EE技术的集成
本部分覆盖了以下内容: Chapter 28, Remoting and web services using Spring -- 使用Spring进行远程和web服务 Chapter 29, Ent ...
OGR 官方文档
OGR 官方文档 http://www.gdal.org/ogr/index.html The OGR Simple Features Library is a C++ open source lib ...
cassandra 3.x官方文档(5)---探测器
写在前面 cassandra3.x官方文档的非官方翻译.翻译内容水平全依赖本人英文水平和对cassandra的理解.所以强烈建议阅读英文版cassandra 3.x 官方文档.此文档一半是翻译,一半是 ...
Cuda 9.2 CuDnn7.0 官方文档解读
目录 Cuda 9.2 CuDnn7.0 官方文档解读准备工作(下载) 显卡驱动重装 CUDA安装系统要求处理之前安装的cuda文件下载的deb安装过程下载的runfile的安装过程安装完 ...
HBase官方文档
HBase官方文档目录序 1. 入门 1.1. 介绍 1.2. 快速开始 2. Apache HBase (TM)配置 2.1. 基础条件 2.2. HBase 运行模式: 独立和分布式 2.3. ...
Spring Cloud官方文档中文版-服务发现：Eureka客户端
官方文档地址为:http://cloud.spring.io/spring-cloud-static/Dalston.SR2/#_spring_cloud_netflix 文中例子我做了一些测试在:h ...
Akka Typed 官方文档之随手记
️ 引言近两年,一直在折腾用FP与OO共存的编程语言Scala,采取以函数式编程为主的方式,结合TDD和BDD的手段,采用Domain Driven Design的方法学,去构造DDDD应用(Dom ...
Lagom 官方文档之随手记
引言 Lagom是出品Akka的Lightbend公司推出的一个微服务框架,目前最新版本为1.6.2.Lagom一词出自瑞典语,意为"适量". https://www.lagomf ...
【AutoMapper官方文档】DTO与Domin Model相互转换（上）
写在前面 AutoMapper目录: [AutoMapper官方文档]DTO与Domin Model相互转换(上) [AutoMapper官方文档]DTO与Domin Model相互转换(中) [Au ...

随机推荐

二分图的最大独立集最大匹配解题 Hopcroft-Karp算法
二分图模型中的最大独立集问题:在二分图G=(X,Y;E)中求取最小的顶点集V* ⊂ {X,Y},使得边 V*任意两点之间没有边相连. 公式: 最大独立集顶点个数 = 总的顶点数(|X|+|Y|)- 最 ...
【Oracle】删除undo表空间时，表空间被占用：ORA-30042: Cannot offline the undo tablespace
特别注意:此办法只用于实在没有办法的时候,因为需要加入oracle中的隐含参数,慎用!!! 1. 先查一下是什么在占用undo SYS@ENMOEDU>select segment_name,o ...
Python中OpenCV2. VS. CV1
OpenCV的2.4.7.版本生成了python的CV2模块,可以直接载入: 有兴趣的可以参考这个教程:http://blog.csdn.net/sunny2038/article/details/9 ...
3 Python+Selenium的元素定位方法（id、class name、name、tag name）
[环境] Python3.6+selenium3.0.2+IE11+Win7 [定位方法] 1.通过ID定位方法:find_element_by_id('xx') 2.通过name定位方法:fin ...
bootstrap初用新得1
## 基本准备 1. 首先把相关软件窗口规划好,对于我的喜好,我喜欢把除了浏览器外的其他软件分为左右两个半屏.左边和右边很多软件之间是需要配合使用的: * 左边: scss文件,ps的guid ...
MindManager 2019新版上市，了解一下！
所有的等待都是值得的!MindManager在蓄力一年后,给各位思维导图爱好者带来了全新的MindManager 2019 for Windows.全新的版本包含英语.德语.法语.俄语.中文.日语,新 ...
vc++如何创建程序-设置断点-函数的覆盖，c++的多态性
---恢复内容开始--- 如何设置断点小笔记将光标移动到你想设置断点的地方,按一下F9键即可,或者你可以用鼠标左键点击小手图标. CommentOut多行注释函数的覆盖是在父类与子类之间的,函数的 ...
洛谷 P2365 任务安排_代价提前计算 + 好题
最开始,笔者将状态 fif_{i}fi 定义为1到i的最小花费 ,我们不难得到这样的一个状态转移方程,即 fi=(sumti−sumtj+S+Costj)∗(sumfi−sumfj)f_{i}=(s ...
滚动效果--marquee的使用
1. <marquee></marquee>标签,默认从最右侧往左滚动: 2. marquee 支持的属性 (1)behavior设置滚动方式: <marquee beh ...
用div布局，页面copyright部分始终居于
<!DOCTYPE HTML><html><head><meta http-equiv="Content-Type" content=&q ...

pyspider architecture--官方文档