Scrapy 教程(11)-API启动爬虫

scarpy 不仅提供了 scrapy crawl spider 命令来启动爬虫，还提供了一种利用 API 编写脚本来启动爬虫的方法。

scrapy 基于 twisted 异步网络库构建的，因此需要在 twisted 容器内运行它。

可以通过两个 API 运行爬虫：scrapy.crawler.CrawlerProcess 和 scrapy.crawler.CrawlerRunner

scrapy.crawler.CrawlerProcess

这个类内部将会开启 twisted.reactor、配置log 和设置 twisted.reactor 自动关闭，该类是所有 scrapy 命令使用的类。

运行单个爬虫示例

class QiushispiderSpider(scrapy.Spider):

    name = 'qiushiSpider'

    # allowed_domains = ['qiushibaike.com']

    start_urls = ['https://tianqi.2345.com/']          

    def start_requests(self):

        return [scrapy.Request(url=self.start_urls[0], callback=self.parse)]          #

    def parse(self, response):

        print('proxy simida')

if __name__ == '__main__':

    from scrapy.crawler import CrawlerProcess

    process = CrawlerProcess()

    process.crawl(QiushispiderSpider)         # 'qiushiSpider'

    process.start()

process.crawl() 内的参数可以是爬虫名'qiushiSpider'，也可以是爬虫类名QiushispiderSpider

这种方式并没有使用爬虫的配置文件settings

2019-05-27 14:39:57 [scrapy.crawler] INFO: Overridden settings: {}

获取配置

from scrapy.crawler import CrawlerProcess

from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())

process.crawl(QiushispiderSpider)         # 'qiushiSpider'

process.start()

运行多个爬虫

import scrapy

from scrapy.crawler import CrawlerProcess

class MySpider1(scrapy.Spider):

    ...

class MySpider2(scrapy.Spider):

    ...

process = CrawlerProcess()

process.crawl(MySpider1)

process.crawl(MySpider2)

process.start()

scrapy.crawler.CrawlerRunner

1. 更好的控制爬虫运行过程

2. 显式运行 twisted.reactor，显式关闭 twisted.reactor

3. 需要在 CrawlerRunner.crawl 返回的对象中添加回调函数

运行单个爬虫示例

class QiushispiderSpider(scrapy.Spider):

    name = 'qiushiSpider'

    # allowed_domains = ['qiushibaike.com']

    start_urls = ['https://tianqi.2345.com/']          

    def start_requests(self):

        return [scrapy.Request(url=self.start_urls[0], callback=self.parse)]          #

    def parse(self, response):

        print('proxy simida')

if __name__ == '__main__':

    # test CrawlerRunner

    from twisted.internet import reactor

    from scrapy.crawler import CrawlerRunner

    from scrapy.utils.log import configure_logging

    from scrapy.utils.project import get_project_settings

    configure_logging({'LOG_FORMAT':'%(levelname)s: %(message)s'})

    runner = CrawlerRunner(get_project_settings())

    d = runner.crawl(QiushispiderSpider)

    d.addBoth(lambda _: reactor.stop())

    reactor.run() # the script will block here until the crawling is finished

configure_logging 设定日志输出格式

addBoth 添加关闭 twisted.crawl 的回调函数

运行多个爬虫

import scrapy

from twisted.internet import reactor

from scrapy.crawler import CrawlerRunner

from scrapy.utils.log import configure_logging

class MySpider1(scrapy.Spider):

    ...

class MySpider2(scrapy.Spider):

    ...

configure_logging()

runner = CrawlerRunner()

runner.crawl(MySpider1)

runner.crawl(MySpider2)

d = runner.join()

d.addBoth(lambda _: reactor.stop())

reactor.run() # the script will block here until all crawling jobs are finished

也可以异步实现

from twisted.internet import reactor, defer

from scrapy.crawler import CrawlerRunner

from scrapy.utils.log import configure_logging

class MySpider1(scrapy.Spider):

    ...

class MySpider2(scrapy.Spider):

    ...

configure_logging()

runner = CrawlerRunner()

@defer.inlineCallbacks

def crawl():

    yield runner.crawl(MySpider1)

    yield runner.crawl(MySpider2)

    reactor.stop()

crawl()

reactor.run() # the script

参考资料：

https://blog.csdn.net/weixin_33857230/article/details/89571872