Scrapy 源代码分析系列－1 spider, spidermanager, crawler, cmdline, command

分析的源代码版本是0.24.6, url: https://github.com/DiamondStudio/scrapy/blob/0.24.6

如github 中Scrapy 源码树所示，包含的子包有:

commands, contracts, contrib, contrib_exp, core, http, selector, settings, templates, tests, utils, xlib

包含的模块有:

_monkeypatches.py, cmdline.py, conf.py, conftest.py, crawler.py, dupefilter.py, exceptions.py,

extension.py, interface.py, item.py, link.py, linkextractor.py, log.py, logformatter.py, mail.py,

middleware.py, project.py, resolver.py, responsetypes.py, shell.py, signalmanager.py, signals.py,

spider.py, spidermanager.py, squeue.py, stats.py, statscol.py, telnet.py, webservice.py

先从重要的模块进行分析。

0. scrapy依赖的第三方库或者框架

twisted

1. 模块: spider, spidermanager, crawler, cmdline, command

1.1 spider.py spidermanager.py crawler.py

spider.py定义了spider的基类: BaseSpider. 每个spider实例只能有一个crawler属性。那么crawler具备哪些功能呢?

crawler.py定义了类Crawler，CrawlerProcess。

类Crawler依赖: SignalManager, ExtensionManager, ExecutionEngine, 以及设置项STATS_CLASS、SPIDER_MANAGER_CLASS

、LOG_FORMATTER

类CrawlerProcess: 顺序地在一个进程中运行多个Crawler。依赖: twisted.internet.reactor、twisted.internet.defer。

启动爬行(Crawlering)。该类在1.2中cmdline.py会涉及。

spidermanager.py定义类SpiderManager, 类SpiderManager用来创建和管理所有website-specific的spider。

 class SpiderManager(object):

     implements(ISpiderManager)

     def __init__(self, spider_modules):

         self.spider_modules = spider_modules

         self._spiders = {}

         for name in self.spider_modules:

             for module in walk_modules(name):

                 self._load_spiders(module)

     def _load_spiders(self, module):

         for spcls in iter_spider_classes(module):

             self._spiders[spcls.name] = spcls

1.2 cmdline.py command.py

cmdline.py定义了公有函数: execute(argv=None, settings=None)。

函数execute是工具scrapy的入口方法(entry method)，如下所示:

 XiaoKL$ cat `which scrapy`

 #!/usr/bin/python

 # -*- coding: utf-8 -*-

 import re

 import sys

 from scrapy.cmdline import execute

 if __name__ == '__main__':

     sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])

     sys.exit(execute())

所以可以根据这个点为切入点进行scrapy源码的分析。下面是execute()函数:

 def execute(argv=None, settings=None):

     if argv is None:

         argv = sys.argv

     # --- backwards compatibility for scrapy.conf.settings singleton ---

     if settings is None and 'scrapy.conf' in sys.modules:

         from scrapy import conf

         if hasattr(conf, 'settings'):

             settings = conf.settings

     # ------------------------------------------------------------------

     if settings is None:

         settings = get_project_settings()

     check_deprecated_settings(settings)

     # --- backwards compatibility for scrapy.conf.settings singleton ---

     import warnings

     from scrapy.exceptions import ScrapyDeprecationWarning

     with warnings.catch_warnings():

         warnings.simplefilter("ignore", ScrapyDeprecationWarning)

         from scrapy import conf

         conf.settings = settings

     # ------------------------------------------------------------------

     inproject = inside_project()

     cmds = _get_commands_dict(settings, inproject)

     cmdname = _pop_command_name(argv)

     parser = optparse.OptionParser(formatter=optparse.TitledHelpFormatter(), \

         conflict_handler='resolve')

     if not cmdname:

         _print_commands(settings, inproject)

         sys.exit(0)

     elif cmdname not in cmds:

         _print_unknown_command(settings, cmdname, inproject)

         sys.exit(2)

     cmd = cmds[cmdname]

     parser.usage = "scrapy %s %s" % (cmdname, cmd.syntax())

     parser.description = cmd.long_desc()

     settings.setdict(cmd.default_settings, priority='command')

     cmd.settings = settings

     cmd.add_options(parser)

     opts, args = parser.parse_args(args=argv[1:])

     _run_print_help(parser, cmd.process_options, args, opts)

     cmd.crawler_process = CrawlerProcess(settings)

     _run_print_help(parser, _run_command, cmd, args, opts)

     sys.exit(cmd.exitcode)

execute()函数主要做: 对命令行进行解析并对scrapy命令模块进行加载；解析命令行参数；获取设置信息；创建CrawlerProcess对象。

CrawlerProcess对象、设置信息、命令行参数都赋值给ScrapyCommand(或其子类)的对象。

自然我们需要来查看定义类ScrapyCommand的模块: command.py。

ScrapyCommand的子类在子包scrapy.commands中进行定义。

_run_print_help() 函数最终调用cmd.run()，来执行该命令。如下:

 def _run_print_help(parser, func, *a, **kw):

     try:

         func(*a, **kw)

     except UsageError as e:

         if str(e):

             parser.error(str(e))

         if e.print_help:

             parser.print_help()

         sys.exit(2)

func是参数_run_command，该函数的实现主要就是调用cmd.run()方法:

 def _run_command(cmd, args, opts):

     if opts.profile or opts.lsprof:

         _run_command_profiled(cmd, args, opts)

     else:

         cmd.run(args, opts)

我们在进行设计时可以参考这个cmdline/commands无关的设计。

command.py: 定义类ScrapyCommand，该类作为Scrapy Commands的基类。来简单看一下类ScrapyCommand提供的接口/方法:

 class ScrapyCommand(object):

     requires_project = False

     crawler_process = None

     # default settings to be used for this command instead of global defaults

     default_settings = {}

     exitcode = 0

     def __init__(self):

         self.settings = None  # set in scrapy.cmdline

     def set_crawler(self, crawler):

         assert not hasattr(self, '_crawler'), "crawler already set"

         self._crawler = crawler

     @property

     def crawler(self):

         warnings.warn("Command's default `crawler` is deprecated and will be removed. "

             "Use `create_crawler` method to instatiate crawlers.",

             ScrapyDeprecationWarning)

         if not hasattr(self, '_crawler'):

             crawler = self.crawler_process.create_crawler()

             old_start = crawler.start

             self.crawler_process.started = False

             def wrapped_start():

                 if self.crawler_process.started:

                     old_start()

                 else:

                     self.crawler_process.started = True

                     self.crawler_process.start()

             crawler.start = wrapped_start

             self.set_crawler(crawler)

         return self._crawler

     def syntax(self):

     def short_desc(self):

     def long_desc(self):

     def help(self):

     def add_options(self, parser):

     def process_options(self, args, opts):

     def run(self, args, opts):

类ScrapyCommand的类属性:

requires_project: 是否需要在Scrapy project中运行

crawler_process：CrawlerProcess对象。在cmdline.py的execute()函数中进行设置。

类ScrapyCommand的方法，重点关注:

def crawler(self): 延迟创建Crawler对象。

def run(self, args, opts): 需要子类进行覆盖实现。

那么我们来具体看一个ScrapyCommand的子类(参考 Python.Scrapy.14-scrapy-source-code-analysis-part-4)。

To Be Continued:

接下来分析模块: signals.py signalmanager.py project.py conf.py Python.Scrapy.12-scrapy-source-code-analysis-part-2

Python.Scrapy.11-scrapy-source-code-analysis-part-1的更多相关文章

Memcached source code analysis (threading model)--reference
Look under the start memcahced threading process memcached multi-threaded mainly by instantiating mu ...
docker build doris-0.11.20-release source code
1. pull doris dev docker image sudo docker pull apachedoris/doris-dev:build-env-1.1 2. dowload doris ...
Golang Template source code analysis(Parse)
This blog was written at go 1.3.1 version. We know that we use template thought by followed way: fun ...
Memcached source code analysis -- Analysis of change of state--reference
This article mainly introduces the process of Memcached, libevent structure of the main thread and w ...
Apache Commons Pool2 源码分析 | Apache Commons Pool2 Source Code Analysis
Apache Commons Pool实现了对象池的功能.定义了对象的生成.销毁.激活.钝化等操作及其状态转换,并提供几个默认的对象池实现.在讲述其实现原理前,先提一下其中有几个重要的对象: Pool ...
Redis source code analysis
http://zhangtielei.com/posts/blog-redis-dict.html http://zhangtielei.com/assets/photos_redis/redis_d ...
linux kernel & source code analysis& hacking
https://kernelnewbies.org/ http://www.tldp.org/LDP/lki/index.html https://kernelnewbies.org/ML https ...
2018.6.21 HOLTEK HT49R70A-1 Source Code analysis
Cange note: “Reading TMR1H will latch the contents of TMR1H and TMR1L counter to the destination”? F ...
The Ultimate List of Open Source Static Code Analysis Security Tools
https://www.checkmarx.com/2014/11/13/the-ultimate-list-of-open-source-static-code-analysis-security- ...
Top 40 Static Code Analysis Tools
https://www.softwaretestinghelp.com/tools/top-40-static-code-analysis-tools/ In this article, I have ...

随机推荐

windows环境下，如何启动chromedriver
java -jar selenium-server-standalone-2.41.0.jar -Dwebdriver.chrome.driver="C:\Program Files\Goo ...
Win10外包公司——长年承接Win10App外包、Win10通用应用外包
在几天前的WinHEC大会中,微软特意在大会中展示了其对通用应用的称呼规范,现在,适用于Windows通用平台的应用的正式名称为“Windows应用”(Windows apps),简洁明了. 总而言之 ...
js反射机制
本文转载自:http://blog.csdn.net/liuzizi888/article/details/6632434 什么是反射机制反射机制指的是程序在运行时能够获取自身的信息.例如一个对象能够 ...
ClientAbortException 异常解决办法
http://blog.sina.com.cn/s/blog_43eb83b90102ds8w.html ClientAbortException 异常解决办法当我们用Servlet导出图片,或用J ...
Python中，添加写入数据到已经存在的Excel的xls文件，即打开excel文件，写入新数据
背景 Python中,想要打开已经存在的excel的xls文件,然后在最后新的一行的数据. 折腾过程 1.找到了参考资料: writing to existing workbook using xlw ...
封装实现一个自己的tabbar
实现效果:
Hive操作表部分总结
创建表: create table tableName(time INT,userid BIGINT,url STRING,ip STRING COMMENT 'IP Address of the U ...
centos7优化mysql5.6配置
一.环境参数 [root@hn mysql]# grep 'physical id' /proc/cpuinfo |sort -u physical id : 0 physical id : 1 [r ...
Ajax （一）
Ajax:即异步的XML和Javascript,在不刷新和提交的情况下,页面局部更新,实现前后端分离. Ajax的核心对象是XMLHttpRequest,服务器通过xhr对象与浏览器异步通信关于HT ...
PNG图片压缩工具
https://tinypng.com/ 效果非常不错. 340k的图能压缩到140k左右. 视觉效果差距不大

Python.Scrapy.11-scrapy-source-code-analysis-part-1