The data flow in Scrapy is controlled by the execution engine, and goes like this:
1. The Engine gets the initial Requests to crawl from the Spider.
2. The Engine schedules the Requests in the Scheduler and asks for the next Requests to crawl.
3. The Scheduler returns the next Requests to the Engine.
4. The Engine sends the Requests to the Downloader, passing through the Downloader Middlewares (see
process_request()).
5. Once the page finishes downloading the Downloader generates a Response (with that page) and sends it to the
Engine, passing through the Downloader Middlewares (see process_response()).
6. The Engine receives the Response from the Downloader and sends it to the Spider for processing, passing
through the Spider Middleware (see process_spider_input()).
7. The Spider processes the Response and returns scraped items and new Requests (to follow) to the Engine,
passing through the Spider Middleware (see process_spider_output()).
8. The Engine sends processed items to Item Pipelines, then send processed Requests to the Scheduler and asks
for possible next Requests to crawl.
9. The process repeats (from step 1) until there are no more requests from the Scheduler.

scrapy Data flow的更多相关文章

  1. SSIS Data Flow优化

    一,数据流设计优化 数据流有两个特性:流和在内存缓冲区中处理数据,根据数据流的这两个特性,对数据流进行优化. 1,流,同时对数据进行提取,转换和加载操作 流,就是在source提取数据时,转换组件处理 ...

  2. Data Flow的Error Output

    一,在Data Flow Task中,对于Error Row的处理通过Error Output Tab配置的. 1,操作失败的类型:Error(Conversion) 和 Truncation. 2, ...

  3. SSIS Data Flow 的 Execution Tree 和 Data Pipeline

    一,Execution Tree 执行树是数据流组件(转换和适配器)基于同步关系所建立的逻辑分组,每一个分组都是一个执行树的开始和结束,也可以将执行树理解为一个缓冲区的开始和结束,即缓冲区的整个生命周 ...

  4. SSIS的 Data Flow 和 Control Flow

    Control Flow 和 Data Flow,是SSIS Design中主要用到的两个Tab,理解这两个Tab的作用,对设计更高效的package十分重要. 一,Control Flow 在Con ...

  5. Intel® Threading Building Blocks (Intel® TBB) Developer Guide 中文 Parallelizing Data Flow and Dependence Graphs并行化data flow和依赖图

    https://www.threadingbuildingblocks.org/docs/help/index.htm Parallelizing Data Flow and Dependency G ...

  6. SSIS ->> Data Flow Design And Tuning

    Requirements: Source and destination system impact Processing time windows and performance Destinati ...

  7. SSIS ->> Control Flow And Data Flow

    In the Control Flow, the task is the smallest unit of work, and a task requires completion (success, ...

  8. Data Flow ->> Union All

    Wrox的<Professional Microsoft SQL Server 2012 Integration Services>一书中再讲Merge的时候有这样一段解释: This t ...

  9. Data Flow ->> Import Column & Export Column

    这两个transformation的作用是把DT_TEXT, DT_NTEXT, DT_IMAGE类型的数据在文件系统和数据库间导出或者导入.比如把某个数据库表的image类型的字段导出到文件系统成为 ...

随机推荐

  1. window、location、location.href、self、top简单介绍

    1.self:当前窗口对象(如果是在iframe里,则为该框架的窗口对象) 2.top:父窗口对象 3.window:典型情况下,浏览器会为每一个打开的html创建对应的window对象,如果这个文档 ...

  2. win10版office365激活序列码

    win10版office365激活序列码(在别的地方找到一个) : NKGG6-WBPCC-HXWMY-6DQGJ-CPQVG 1.在线安装Office2016预览版后它是不会自动激活的,需在Offi ...

  3. 小程序微信支付java

    https://blog.csdn.net/qq_33452819/article/details/70314204#

  4. OpenStack 安装:glance 安装

    接上一篇keystone, 这一篇介绍glance服务: 在开始操作之前,先用source环境变量,然后创建glance 用户,并设置密码为glance [root@linux-node1 ~]#op ...

  5. Java框架spring Boot学习笔记(一):开始第一个项目

    新建一个项目,选择Spring initializr 修改group和项目名 添加依赖包Web,MongoDB 设置保存位置和工程名 新建一个test的文件 输入代码: package com.xxx ...

  6. hdoj2089(入门数位dp)

    题目链接:https://vjudge.net/problem/HDU-2089 题意:给定一段区间求出该区间中不含4且不含连续的62的数的个数. 思路:这周开始做数位dp专题,给自己加油^_^,一直 ...

  7. Handler实现消息的定时发送

    话不多说,直接上代码 private Handler mHandler = new Handler() { @Override public void handleMessage(Message ms ...

  8. python note 06 编码方式

    1.有如下值li= [11,22,33,44,55,66,77,88,99,90],将所有大于 66 的值保存至字典的第一个key中,将小于 66 的值保存至第二个key的值中.即: {'k1': 大 ...

  9. ABP .NET corej 版本 第一篇

    ABP是“ASP.NET Boilerplate Project (ASP.NET样板项目)”的简称. ABP使用以下技术: 服务器端: l ASP.NET MVC 5.Web API 2.C# 5. ...

  10. 1.6Eigen中系数运算Reductions, visitors and broadcasting

    Eigen::Matrix2d mat; mat<<,, ,; cout<<"矩阵所有系数之和:"<<mat.sum();//1+2+3+4=1 ...