pyspider architecture--官方文档
原文地址:http://docs.pyspider.org/en/latest/Architecture/
Architecture
This document describes the reason why I made pyspider and the architecture.
Why
Two years ago, I was working on a vertical search engine. We are facing following needs on crawling:
collect 100-200 websites, they may on/offline or change their templates at any time
We need a really powerful monitor to find out which website is changing. And a good tool to help us write script/template for each website.
data should be collected in 5min when website updated
We solve this problem by check index page frequently, and use something like 'last update time' or 'last reply time' to determine which page is changed. In addition to this, we recheck pages after X days in case to prevent the omission.
pyspider will never stop as WWW is changing all the time
Furthermore, we have some APIs from our cooperators, the API may need POST, proxy, request signature etc. Full control from script is more convenient than some global parameters of components.
Overview
The following diagram shows an overview of the pyspider architecture with its components and an outline of the data flow that takes place inside the system.
Components are connected by message queue. Every component, including message queue, is running in their own process/thread, and replaceable. That means, when process is slow, you can have many instances of processor and make full use of multiple CPUs, or deploy to multiple machines. This architecture makes pyspider really fast. benchmarking.
Components
Scheduler
The Scheduler receives tasks from newtask_queue from processor. Decide whether the task is new or requires re-crawl. Sort tasks according to priority and feeding them to fetcher with traffic control (token bucket algorithm). Take care of periodic tasks, lost tasks and failed tasks and retry later.
All of above can be set via self.crawl
API.
Note that in current implement of scheduler, only one scheduler is allowed.
Fetcher
The Fetcher is responsible for fetching web pages then send results to processor. For flexible, fetcher support Data URI and pages that rendered by JavaScript (via phantomjs). Fetch method, headers, cookies, proxy, etag etc can be controlled by script via API.
Phantomjs Fetcher
Phantomjs Fetcher works like a proxy. It's connected to general Fetcher, fetch and render pages with JavaScript enabled, output a general HTML back to Fetcher:
scheduler -> fetcher -> processor
|
phantomjs
|
internet
Processor
The Processor is responsible for running the script written by users to parse and extract information. Your script is running in an unlimited environment. Although we have various tools(like PyQuery) for you to extract information and links, you can use anything you want to deal with the response. You may refer to Script Environment and API Reference to get more information about script.
Processor will capture the exceptions and logs, send status(task track) and new tasks to scheduler
, send results to Result Worker
.
Result Worker (optional)
Result worker receives results from Processor
. Pyspider has a built-in result worker to save result to resultdb
. Overwrite it to deal with result by your needs.
WebUI
WebUI is a web frontend for everything. It contains:
- script editor, debugger
- project manager
- task monitor
- result viewer, exporter
Maybe webui is the most attractive part of pyspider. With this powerful UI, you can debug your scripts step by step just as pyspider do. Starting or stop a project. Finding which project is going wrong and what request is failed and try it again with debugger.
Data flow
The data flow in pyspider is just as your seen in diagram above:
- Each script has a callback named
on_start
, when you press theRun
button on WebUI. A new task ofon_start
is submitted to Scheduler as the entries of project. - Scheduler dispatches this
on_start
task with a Data URI as a normal task to Fetcher. - Fetcher makes a request and a response to it (for Data URI, it's a fake request and response, but has no difference with other normal tasks), then feeds to Processor.
- Processor calls the
on_start
method and generated some new URL to crawl. Processor send a message to Scheduler that this task is finished and new tasks via message queue to Scheduler (here is no results foron_start
in most case. If has results, Processor send them toresult_queue
). - Scheduler receives the new tasks, looking up in the database, determine whether the task is new or requires re-crawl, if so, put them into task queue. Dispatch tasks in order.
- The process repeats (from step 3) and wouldn't stop till WWW is dead ;-). Scheduler will check periodic tasks to crawl latest data.
pyspider architecture--官方文档的更多相关文章
- Spring 4 官方文档学习 Spring与Java EE技术的集成
本部分覆盖了以下内容: Chapter 28, Remoting and web services using Spring -- 使用Spring进行远程和web服务 Chapter 29, Ent ...
- OGR 官方文档
OGR 官方文档 http://www.gdal.org/ogr/index.html The OGR Simple Features Library is a C++ open source lib ...
- cassandra 3.x官方文档(5)---探测器
写在前面 cassandra3.x官方文档的非官方翻译.翻译内容水平全依赖本人英文水平和对cassandra的理解.所以强烈建议阅读英文版cassandra 3.x 官方文档.此文档一半是翻译,一半是 ...
- Cuda 9.2 CuDnn7.0 官方文档解读
目录 Cuda 9.2 CuDnn7.0 官方文档解读 准备工作(下载) 显卡驱动重装 CUDA安装 系统要求 处理之前安装的cuda文件 下载的deb安装过程 下载的runfile的安装过程 安装完 ...
- hbase官方文档(转)
FROM:http://www.just4e.com/hbase.html Apache HBase™ 参考指南 HBase 官方文档中文版 Copyright © 2012 Apache Soft ...
- HBase官方文档
HBase官方文档 目录 序 1. 入门 1.1. 介绍 1.2. 快速开始 2. Apache HBase (TM)配置 2.1. 基础条件 2.2. HBase 运行模式: 独立和分布式 2.3. ...
- Spring Cloud官方文档中文版-服务发现:Eureka客户端
官方文档地址为:http://cloud.spring.io/spring-cloud-static/Dalston.SR2/#_spring_cloud_netflix 文中例子我做了一些测试在:h ...
- Akka Typed 官方文档之随手记
️ 引言 近两年,一直在折腾用FP与OO共存的编程语言Scala,采取以函数式编程为主的方式,结合TDD和BDD的手段,采用Domain Driven Design的方法学,去构造DDDD应用(Dom ...
- Lagom 官方文档之随手记
引言 Lagom是出品Akka的Lightbend公司推出的一个微服务框架,目前最新版本为1.6.2.Lagom一词出自瑞典语,意为"适量". https://www.lagomf ...
- 【AutoMapper官方文档】DTO与Domin Model相互转换(上)
写在前面 AutoMapper目录: [AutoMapper官方文档]DTO与Domin Model相互转换(上) [AutoMapper官方文档]DTO与Domin Model相互转换(中) [Au ...
随机推荐
- ASP.NET CORE读取appsettings.json的配置
如何在appsettings.json配置应用程序设置,微软给出的方法:https://docs.microsoft.com/en-us/aspnet/core/fundamentals/config ...
- Oracle"TNS监听程序找不到符合协议堆栈要求的可用处理程序"解决方案
问题描述:在使用ETL工具通过odbc方式连接Oracle进行数据抽取的过程中,Oracle 监听日志报错如下: 根本原因就是Oracle的process和session已经达到了甚至超过了最大值,解 ...
- hiho1804 - 整数分解、组合数、乘法逆元
题目链接 题目叙述很啰嗦,可以简化为:n个球[1-1e5],放到m个不同的桶里,一共多少种不同的放法.[桶里可以不放] ---------------------------------------- ...
- 2 Selenium3.0+Python3.6环境搭建
[说明] 再次搭建一次环境,是因为遇到怎么都打不开IE的问题了,环境信息为:Selenium3.0+Python3.6+win7+ie10 [搭建步骤] 1.下载Python3.6,并点击安装和配置环 ...
- XML文件操作之dom4j
能够操作xml的api还是挺多的,DOM也是可以的,不过在此记录下dom4j的使用,感觉确实挺方便的 所需jar包官网地址:http://www.dom4j.org/dom4j-1.6.1/ dom4 ...
- Go 语言一本通
什么是GO语言? Go 是一个开源的编程语言,它能让构造简单.可靠且高效的软件变得容易. Go是从2007年末由Robert Griesemer, Rob Pike, Ken Thompson主持开发 ...
- 路飞学城Python-Day77
11-DIY一个web框架3 web框架 yuan功能总结 main.py: 启动文件,封装了socket 1 urls.py: 路径与视图函数映射关系 ---- url控制器 2 views.py ...
- 小白学习Spark系列四:RDD踩坑总结(scala+spark2.1 sql常用方法)
初次尝试用 Spark+scala 完成项目的重构,由于两者之前都没接触过,所以边学边用的过程大多艰难.首先面临的是如何快速上手,然后是代码调优.性能调优.本章主要记录自己在项目中遇到的问题以及解决方 ...
- int rc = -EINVAL是什么意思
rc应该是return code的意思,将函数返回值rc初始化为-EINVAL,EINVAL由POSIX.1规范中的一个宏,一般通过包含C标准头文件errno.h,表示参数无效(invalid arg ...
- BZOJ 1367 [Baltic2004]sequence (可并堆)
题面:BZOJ传送门 题目大意:给你一个序列$a$,让你构造一个递增序列$b$,使得$\sum |a_{i}-b_{i}|$最小,$a_{i},b_{i}$均为整数 神仙题.. 我们先考虑b不递减的情 ...