The data flow in Scrapy is controlled by the execution engine, and goes like this:
1. The Engine gets the initial Requests to crawl from the Spider.
2. The Engine schedules the Requests in the Scheduler and asks for the next Requests to crawl.
3. The Scheduler returns the next Requests to the Engine.
4. The Engine sends the Requests to the Downloader, passing through the Downloader Middlewares (see
process_request()).
5. Once the page finishes downloading the Downloader generates a Response (with that page) and sends it to the
Engine, passing through the Downloader Middlewares (see process_response()).
6. The Engine receives the Response from the Downloader and sends it to the Spider for processing, passing
through the Spider Middleware (see process_spider_input()).
7. The Spider processes the Response and returns scraped items and new Requests (to follow) to the Engine,
passing through the Spider Middleware (see process_spider_output()).
8. The Engine sends processed items to Item Pipelines, then send processed Requests to the Scheduler and asks
for possible next Requests to crawl.
9. The process repeats (from step 1) until there are no more requests from the Scheduler.

scrapy Data flow的更多相关文章

  1. SSIS Data Flow优化

    一,数据流设计优化 数据流有两个特性:流和在内存缓冲区中处理数据,根据数据流的这两个特性,对数据流进行优化. 1,流,同时对数据进行提取,转换和加载操作 流,就是在source提取数据时,转换组件处理 ...

  2. Data Flow的Error Output

    一,在Data Flow Task中,对于Error Row的处理通过Error Output Tab配置的. 1,操作失败的类型:Error(Conversion) 和 Truncation. 2, ...

  3. SSIS Data Flow 的 Execution Tree 和 Data Pipeline

    一,Execution Tree 执行树是数据流组件(转换和适配器)基于同步关系所建立的逻辑分组,每一个分组都是一个执行树的开始和结束,也可以将执行树理解为一个缓冲区的开始和结束,即缓冲区的整个生命周 ...

  4. SSIS的 Data Flow 和 Control Flow

    Control Flow 和 Data Flow,是SSIS Design中主要用到的两个Tab,理解这两个Tab的作用,对设计更高效的package十分重要. 一,Control Flow 在Con ...

  5. Intel® Threading Building Blocks (Intel® TBB) Developer Guide 中文 Parallelizing Data Flow and Dependence Graphs并行化data flow和依赖图

    https://www.threadingbuildingblocks.org/docs/help/index.htm Parallelizing Data Flow and Dependency G ...

  6. SSIS ->> Data Flow Design And Tuning

    Requirements: Source and destination system impact Processing time windows and performance Destinati ...

  7. SSIS ->> Control Flow And Data Flow

    In the Control Flow, the task is the smallest unit of work, and a task requires completion (success, ...

  8. Data Flow ->> Union All

    Wrox的<Professional Microsoft SQL Server 2012 Integration Services>一书中再讲Merge的时候有这样一段解释: This t ...

  9. Data Flow ->> Import Column & Export Column

    这两个transformation的作用是把DT_TEXT, DT_NTEXT, DT_IMAGE类型的数据在文件系统和数据库间导出或者导入.比如把某个数据库表的image类型的字段导出到文件系统成为 ...

随机推荐

  1. 解决no module named ipykernel_launcher

    解决no module named ipykernel_launcher 最近开hydrogen的时候,提示no module named ipykernel_launcher. 记得以前解决过这个问 ...

  2. Shapley值的一个应用

    看书有这样一个问题,某互联网公司今天需要加班,需要编写一个500行的程序代码,产品经理找了三个程序员来完成.按照完成量发奖金:1号普通程序员独立能写100行,2号大神程序员独立能写125行,3号美女程 ...

  3. JQuery的可见性选择器

    1. <div id="test" style="width:400px;height:200; background:#0000ff;display:block; ...

  4. Gson的入门使用

    Java对象和Json之间的互转,一般用的比较多的两个类库是Jackson和Gson,下面记录一下Gson的学习使用. 基础概念:  Serialization:序列化,使Java对象到Json字符串 ...

  5. POJ-2253.Frogger.(求每条路径中最大值的最小值,最短路变形)

    做到了这个题,感觉网上的博客是真的水,只有kuangbin大神一句话就点醒了我,所以我写这篇博客是为了让最短路的入门者尽快脱坑...... 本题思路:本题是最短路的变形,要求出最短路中的最大跳跃距离, ...

  6. 结构体指offsetof宏详细解析

    1.#define offsetof(TYPE, MEMBER) ((size_t) &((TYPE*)0)->MEMBER)     (include/linux/stddef.h) ...

  7. 725. Split Linked List in Parts把链表分成长度不超过1的若干部分

    [抄题]: Given a (singly) linked list with head node root, write a function to split the linked list in ...

  8. 拿来主义:treeview插件父子节点问题

    鄙人公司没有专门的前端,所以项目开发中都是前后端一起抡.最近用bootstrap用的比较频繁,发现bootstrap除了框架本身的样式组件外,还提供了多种插件供开发者选择.本篇博文讲的就是bootst ...

  9. 微信小程序开发 - 用户授权登陆

    准备:微信开发者工具下载地址:https://developers.weixin.qq.com/miniprogram/dev/devtools/download.html 微信小程序开发文档:htt ...

  10. CodeForces - 589D

    题目链接:http://codeforces.com/problemset/problem/589/D 思路:将每个人的信息转化为自变量为时间因变量为位置的一元方程.再一个个判断是否相遇. 若两人同向 ...