通过核心ＡＰＩ启动单个或多个scrapy爬虫

1. 可以使用API从脚本运行Scrapy，而不是运行Scrapy的典型方法scrapy crawl；Scrapy是基于Twisted异步网络库构建的，因此需要在Twisted容器内运行它，可以通过两个API来运行单个或多个爬虫scrapy.crawler.CrawlerProcess、scrapy.crawler.CrawlerRunner。

2. 启动爬虫的的第一个实用程序是scrapy.crawler.CrawlerProcess 。该类将为您启动Twisted reactor，配置日志记录并设置关闭处理程序，此类是所有Scrapy命令使用的类。

示例运行单个爬虫：

交流群：1029344413 源码、素材学习资料
import scrapy

from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):

    # Your spider definition

    ...

process = CrawlerProcess({

    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'

})

process.crawl(MySpider)

process.start() # the script will block here until the crawling is finished

通过CrawlerProcess传入参数，并使用get_project_settings获取Settings 项目设置的实例。

from scrapy.crawler import CrawlerProcess

from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())

# 'followall' is the name of one of the spiders of the project.

process.crawl('followall', domain='scrapinghub.com')

process.start() # the script will block here until the crawling is finished

还有另一个Scrapy实例方式可以更好地控制爬虫运行过程：scrapy.crawler.CrawlerRunner。此类封装了一些简单的帮助程序来运行多个爬虫程序，但它不会以任何方式启动或干扰现有的爬虫。
使用此类，显式运行reactor。如果已有爬虫在运行想在同一个进程中开启另一个Scrapy，建议您使用CrawlerRunner 而不是CrawlerProcess。
注意，爬虫结束后需要手动关闭Twisted reactor，通过向CrawlerRunner.crawl方法返回的延迟添加回调来实现。

下面是它的用法示例，在MySpider完成运行后手动停止容器的回调。

from twisted.internet import reactor

import scrapy

from scrapy.crawler import CrawlerRunner

from scrapy.utils.log import configure_logging

class MySpider(scrapy.Spider):

    # Your spider definition

    ...

configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})

runner = CrawlerRunner()

d = runner.crawl(MySpider)

d.addBoth(lambda _: reactor.stop())

reactor.run() # the script will block here until the crawling is finished

在同一个进程中运行多个蜘蛛

默认情况下，Scrapy在您运行时为每个进程运行一个蜘蛛。但是，Scrapy支持使用内部API为每个进程运行多个蜘蛛。

这是一个同时运行多个蜘蛛的示例：

import scrapy

from scrapy.crawler import CrawlerProcess

class MySpider1(scrapy.Spider):

    # Your first spider definition

    ...

class MySpider2(scrapy.Spider):

    # Your second spider definition

    ...

process = CrawlerProcess()

process.crawl(MySpider1)

process.crawl(MySpider2)

process.start() # the script will block here until all crawling jobs are finished

使用CrawlerRunner示例：

import scrapy

from twisted.internet import reactor

from scrapy.crawler import CrawlerRunner

from scrapy.utils.log import configure_logging

class MySpider1(scrapy.Spider):

    # Your first spider definition

    ...

class MySpider2(scrapy.Spider):

    # Your second spider definition

    ...

configure_logging()

runner = CrawlerRunner()

runner.crawl(MySpider1)

runner.crawl(MySpider2)

d = runner.join()

d.addBoth(lambda _: reactor.stop())

reactor.run() # the script will block here until all crawling jobs are finished

相同的示例，但通过异步运行爬虫蛛：

from twisted.internet import reactor, defer

from scrapy.crawler import CrawlerRunner

from scrapy.utils.log import configure_logging

class MySpider1(scrapy.Spider):

    # Your first spider definition

    ...

class MySpider2(scrapy.Spider):

    # Your second spider definition

    ...

configure_logging()

runner = CrawlerRunner()

@defer.inlineCallbacks

def crawl():

    yield runner.crawl(MySpider1)

    yield runner.crawl(MySpider2)

    reactor.stop()

crawl()

reactor.run() # the script will block here until the last crawl call is finished

通过核心ＡＰＩ启动单个或多个scrapy爬虫的更多相关文章

spark 入门学习核心api
spark入门教程(3)--Spark 核心API开发原创 2016年04月13日 20:52:28 标签: spark / 分布式 / 大数据 / 教程 / 应用 4999 本教程源于2016年3 ...
SDL 开发实战（二）：SDL 2.0 核心 API 解析
在上一篇文章 SDL 开发实战(一):SDL介绍及开发环境配置中,我们配置好了SDL的开发环境,并成功运行了SDL的Hello World 代码.但是可能大部分人还是读不太明白具体Hello Wol ...
Spark2.0学习（三）--------核心API
Spark核心API----------------- [SparkContext] 连接到spark集群,入口点. [HadoopRDD] 读取hadoop上的数据, [MapPartitionsR ...
hibernate框架学习第二天：核心API、工具类、事务、查询、方言、主键生成策略等
核心API Configuration 描述的是一个封装所有配置信息的对象 1.加载hibernate.properties(非主流,早期) Configuration conf = new Conf ...
Lucene系列六：Lucene搜索详解（Lucene搜索流程详解、搜索核心API详解、基本查询详解、QueryParser详解）
一.搜索流程详解 1. 先看一下Lucene的架构图由图可知搜索的过程如下: 用户输入搜索的关键字.对关键字进行分词.根据分词结果去索引库里面找到对应的文章id.根据文章id找到对应的文章 2. L ...
配置文件详解和核心api讲解
一.配置文件详解 1.映射文件详解 1.映射配置文件的位置和名称没有限制. -建议:位置:和实体类放在统一目录下. 名称:实体类名称.hbm.xml. 2.在映射配置文件中,标签内的name属 ...
Activiti入门 -- 环境搭建和核心API简介
相关文章: <史上最权威的Activiti框架学习指南> <Activiti入门 -- 轻松解读数据库> 本章内容,主要讲解Activiti框架环境的搭建,能够使用Activi ...
bootparam - 介绍Linux核心的启动参数
描叙 Linux 核心在启动的时候可以接受指定的"命令行参数"或"启动参数"．在通常情况下,由于核心有可能无法识别某些硬件,或可能将某些硬件识别为不正确的配置, ...
java多线程核心api以及相关概念(一)
这篇博客总结了对线程核心api以及相关概念的学习,黑体字可以理解为重点,其他的都是我对它的理解个人认为这些是学习java多线程的基础,不理解熟悉这些,后面的也不可能学好滴目录 1.什么是线程以及优 ...

随机推荐

HTML提供的6种空格
HTML提供了6种空格(space entity),它们拥有不同的宽度. 非断行空格( )是常规空格的宽度,可运行于所有主流浏览器.其它几种空格( . . .‌.‍)在不同浏览器中宽度各异. ...
jQuery签名插件jSignature
1.引入jSignature.min.js和jquery.min.js文件2.代码 <div id="signature"></div> 3.js 初始化 ...
Python--day46--用户管理设计方案介绍
1,基于用户权限管理: 2,基于角色的权限管理: 开始一个项目如果要100天的,可能70天都在设计,比如设计数据库表结构,最后30天才是写代码.设计是最难的,写代码是最简单的. 还有一个重要的一点,写 ...
java 泛型的上限与下限
设置泛型对象的上限使用extends,表示参数类型只能是该类型或该类型的子类: 声明对象:类名<? extends 类> 对象名定义类:类名<泛型标签 extends 类>{ ...
vue-上传文件
<label for="exampleInputFile">头像</label> <img :src=" imgsrc != '' ? im ...
H3C 链路聚合显示及维护
【20.95%】【UESTC 94】Bracket Squence
Time Limit: 3000/1000MS (Java/Others) Memory Limit: 65535/65535KB (Java/Others) Submit Status T ...
Spring Security 学习笔记-securityContext过滤器
位于过滤器顶端,第一个起作用的过滤器.SecurityContextPersistenceFilter 在执行其他过滤器之前,率先判断用户的session中是否已经存在一个SecurityContex ...
P3521 [POI2011]ROT-Tree Rotations （线段树合并）
P3521 [POI2011]ROT-Tree Rotations 题意: 给你一颗树,只有叶子节点有权值,你可以交换一个点的左右子树,问你最小的逆序对数题解: 线段树维护权值个个数即可然后左右子 ...
CITRIX ADC配置SSL卸载
如上图,将ssl的加密解密放在前端的负载均衡设备上,客户端到VPX的访问都是加密的,VPX到后端的服务器都是http的 Step1:上传证书到VPX,如下图: Step2:创建SSL的虚拟服务器并且绑 ...

通过核心ＡＰＩ启动单个或多个scrapy爬虫

在同一个进程中运行多个蜘蛛

通过核心ＡＰＩ启动单个或多个scrapy爬虫的更多相关文章

随机推荐

热门专题