Scrapy 源代码分析系列-1 spider, spidermanager, crawler, cmdline, command

分析的源代码版本是0.24.6, url: https://github.com/DiamondStudio/scrapy/blob/0.24.6

如github 中Scrapy 源码树所示,包含的子包有:

commands, contracts, contrib, contrib_exp, core, http, selector, settings, templates, tests, utils, xlib

包含的模块有:

_monkeypatches.py, cmdline.py, conf.py, conftest.py, crawler.py, dupefilter.py, exceptions.py,

extension.py, interface.py, item.py, link.py, linkextractor.py, log.py, logformatter.py, mail.py,

middleware.py, project.py, resolver.py, responsetypes.py, shell.py, signalmanager.py, signals.py,

spider.py, spidermanager.py, squeue.py, stats.py, statscol.py, telnet.py, webservice.py

先从重要的模块进行分析。

0. scrapy依赖的第三方库或者框架

twisted

1. 模块: spider, spidermanager, crawler, cmdline, command

1.1 spider.py spidermanager.py crawler.py

spider.py定义了spider的基类: BaseSpider. 每个spider实例只能有一个crawler属性。那么crawler具备哪些功能呢?

crawler.py定义了类Crawler,CrawlerProcess。

类Crawler依赖: SignalManager, ExtensionManager, ExecutionEngine,  以及设置项STATS_CLASS、SPIDER_MANAGER_CLASS

、LOG_FORMATTER

类CrawlerProcess: 顺序地在一个进程中运行多个Crawler。依赖: twisted.internet.reactor、twisted.internet.defer。

启动爬行(Crawlering)。该类在1.2中cmdline.py会涉及。

spidermanager.py定义类SpiderManager, 类SpiderManager用来创建和管理所有website-specific的spider。

 class SpiderManager(object):

     implements(ISpiderManager)

     def __init__(self, spider_modules):
self.spider_modules = spider_modules
self._spiders = {}
for name in self.spider_modules:
for module in walk_modules(name):
self._load_spiders(module) def _load_spiders(self, module):
for spcls in iter_spider_classes(module):
self._spiders[spcls.name] = spcls

1.2 cmdline.py command.py

cmdline.py定义了公有函数: execute(argv=None, settings=None)。

函数execute是工具scrapy的入口方法(entry method),如下所示:

 XiaoKL$ cat `which scrapy`
#!/usr/bin/python # -*- coding: utf-8 -*-
import re
import sys from scrapy.cmdline import execute if __name__ == '__main__':
sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])
sys.exit(execute())

所以可以根据这个点为切入点进行scrapy源码的分析。下面是execute()函数:

 def execute(argv=None, settings=None):
if argv is None:
argv = sys.argv # --- backwards compatibility for scrapy.conf.settings singleton ---
if settings is None and 'scrapy.conf' in sys.modules:
from scrapy import conf
if hasattr(conf, 'settings'):
settings = conf.settings
# ------------------------------------------------------------------ if settings is None:
settings = get_project_settings()
check_deprecated_settings(settings) # --- backwards compatibility for scrapy.conf.settings singleton ---
import warnings
from scrapy.exceptions import ScrapyDeprecationWarning
with warnings.catch_warnings():
warnings.simplefilter("ignore", ScrapyDeprecationWarning)
from scrapy import conf
conf.settings = settings
# ------------------------------------------------------------------ inproject = inside_project()
cmds = _get_commands_dict(settings, inproject)
cmdname = _pop_command_name(argv)
parser = optparse.OptionParser(formatter=optparse.TitledHelpFormatter(), \
conflict_handler='resolve')
if not cmdname:
_print_commands(settings, inproject)
sys.exit(0)
elif cmdname not in cmds:
_print_unknown_command(settings, cmdname, inproject)
sys.exit(2) cmd = cmds[cmdname]
parser.usage = "scrapy %s %s" % (cmdname, cmd.syntax())
parser.description = cmd.long_desc()
settings.setdict(cmd.default_settings, priority='command')
cmd.settings = settings
cmd.add_options(parser)
opts, args = parser.parse_args(args=argv[1:])
_run_print_help(parser, cmd.process_options, args, opts) cmd.crawler_process = CrawlerProcess(settings)
_run_print_help(parser, _run_command, cmd, args, opts)
sys.exit(cmd.exitcode)

execute()函数主要做: 对命令行进行解析并对scrapy命令模块进行加载;解析命令行参数;获取设置信息;创建CrawlerProcess对象。

CrawlerProcess对象、设置信息、命令行参数都赋值给ScrapyCommand(或其子类)的对象。

自然我们需要来查看定义类ScrapyCommand的模块: command.py。

ScrapyCommand的子类在子包scrapy.commands中进行定义。

_run_print_help() 函数最终调用cmd.run(),来执行该命令。如下:

 def _run_print_help(parser, func, *a, **kw):
try:
func(*a, **kw)
except UsageError as e:
if str(e):
parser.error(str(e))
if e.print_help:
parser.print_help()
sys.exit(2)

func是参数_run_command,该函数的实现主要就是调用cmd.run()方法:

 def _run_command(cmd, args, opts):
if opts.profile or opts.lsprof:
_run_command_profiled(cmd, args, opts)
else:
cmd.run(args, opts)

我们在进行设计时可以参考这个cmdline/commands无关的设计。

command.py: 定义类ScrapyCommand,该类作为Scrapy Commands的基类。来简单看一下类ScrapyCommand提供的接口/方法:

 class ScrapyCommand(object):

     requires_project = False
crawler_process = None # default settings to be used for this command instead of global defaults
default_settings = {} exitcode = 0 def __init__(self):
self.settings = None # set in scrapy.cmdline def set_crawler(self, crawler):
assert not hasattr(self, '_crawler'), "crawler already set"
self._crawler = crawler @property
def crawler(self):
warnings.warn("Command's default `crawler` is deprecated and will be removed. "
"Use `create_crawler` method to instatiate crawlers.",
ScrapyDeprecationWarning) if not hasattr(self, '_crawler'):
crawler = self.crawler_process.create_crawler() old_start = crawler.start
self.crawler_process.started = False def wrapped_start():
if self.crawler_process.started:
old_start()
else:
self.crawler_process.started = True
self.crawler_process.start() crawler.start = wrapped_start self.set_crawler(crawler) return self._crawler def syntax(self): def short_desc(self): def long_desc(self): def help(self): def add_options(self, parser): def process_options(self, args, opts): def run(self, args, opts):

类ScrapyCommand的类属性:

requires_project: 是否需要在Scrapy project中运行
crawler_process:CrawlerProcess对象。在cmdline.py的execute()函数中进行设置。

类ScrapyCommand的方法,重点关注:

def crawler(self): 延迟创建Crawler对象。
def run(self, args, opts): 需要子类进行覆盖实现。
 

那么我们来具体看一个ScrapyCommand的子类(参考 Python.Scrapy.14-scrapy-source-code-analysis-part-4)。

To Be Continued:

接下来分析模块: signals.py signalmanager.py project.py conf.py Python.Scrapy.12-scrapy-source-code-analysis-part-2

Python.Scrapy.11-scrapy-source-code-analysis-part-1的更多相关文章

  1. Memcached source code analysis (threading model)--reference

    Look under the start memcahced threading process memcached multi-threaded mainly by instantiating mu ...

  2. docker build doris-0.11.20-release source code

    1. pull doris dev docker image sudo docker pull apachedoris/doris-dev:build-env-1.1 2. dowload doris ...

  3. Golang Template source code analysis(Parse)

    This blog was written at go 1.3.1 version. We know that we use template thought by followed way: fun ...

  4. Memcached source code analysis -- Analysis of change of state--reference

    This article mainly introduces the process of Memcached, libevent structure of the main thread and w ...

  5. Apache Commons Pool2 源码分析 | Apache Commons Pool2 Source Code Analysis

    Apache Commons Pool实现了对象池的功能.定义了对象的生成.销毁.激活.钝化等操作及其状态转换,并提供几个默认的对象池实现.在讲述其实现原理前,先提一下其中有几个重要的对象: Pool ...

  6. Redis source code analysis

    http://zhangtielei.com/posts/blog-redis-dict.html http://zhangtielei.com/assets/photos_redis/redis_d ...

  7. linux kernel & source code analysis& hacking

    https://kernelnewbies.org/ http://www.tldp.org/LDP/lki/index.html https://kernelnewbies.org/ML https ...

  8. 2018.6.21 HOLTEK HT49R70A-1 Source Code analysis

    Cange note: “Reading TMR1H will latch the contents of TMR1H and TMR1L counter to the destination”? F ...

  9. The Ultimate List of Open Source Static Code Analysis Security Tools

    https://www.checkmarx.com/2014/11/13/the-ultimate-list-of-open-source-static-code-analysis-security- ...

  10. Top 40 Static Code Analysis Tools

    https://www.softwaretestinghelp.com/tools/top-40-static-code-analysis-tools/ In this article, I have ...

随机推荐

  1. Python Queue队列

    queue is especially useful in threaded programming when information must be exchanged safely between ...

  2. view

    把view添加到某个视图的虾面 [self.superview insertSubview:smallCircle belowSubview:self]; // 返回两个数的根 return sqrt ...

  3. PASCAL==CALLBACK==WINAPI==__stdcall

    VC里面:PASCAL==CALLBACK==WINAPI==__stdcall         _stdcall是Pascal程序的缺省调用方式,通常用于Win32  Api中,函数采用从右到左的压 ...

  4. 每天一个 Linux 命令(8):cp 命令

    cp命令用来复制文件或者目录,是Linux系统中最常用的命令之一.一般情下,shell会设置一个别名,在命令行下复制文件时,如果目标文件已经存在,就会询问是否覆盖,不管你是否使用-i参数.但是如果是在 ...

  5. boost::asio::socket tcp 连接 在程序结束时崩溃。

    刚开始的时候一直不知道怎么回事,不过幸好我有在每个class 的析构时都打印一条信息. 这个时候发现我的一个tcp_connection (就是自定义的一个连接类) 在最后才被析构. 所以感觉这里可能 ...

  6. Visual Studio Enterprise 2015下载 Update3

    Visual Studio 2015 是一个丰富的集成开发环境,可用于创建出色的 Windows.Android 和 iOS 应用程序以及新式 Web 应用程序和云服务. 1.适用于各种规模和复杂程度 ...

  7. uboot和内核波特率不同

    uboot和内核波特率不同,在uboot启动后,修改uboot参数: set bootargs 'noinitrd root=/dev/mtdblock3 init=/linuxrc console= ...

  8. js实现图片的淡入淡出

    思想: 其实是运动的一种,就是当鼠标移入div中时,将div的透明度变大, 当鼠标移动出来的时候透明度变回原来. 你可以尝试写一下,不会再看看代码 <style> #div1{ width ...

  9. Apache 配置HTTPS协议搭载SSL配置

    在设置Apache + SSL之前, 需要做:     安装Apache, 请参见: Windows环境下Apache的安装与虚拟目录的配置, 下载安装Apache时请下载带有ssl版本的Apache ...

  10. 用PHP获取系统时间时,时间比当前时间少8个小时

    自PHP5.0开始,用PHP获取系统时间时,时间比当前时间少8个小时.原因是PHP.ini中没有设置timezone时,PHP是使用的UTC时间,所以在中国时间要少8小时. 解决办法: 1.在PHP. ...