当运行scrapy crawl spider 时,会生成一个crawl命令对象,scrapy是调用execute函数(cmdlin.py)来执行命令的,execute函数会给命令对象添加crawler_process属性(cmd.crawler_process = CrawlerProcess(settings)),CrawlerProcess调用crawle时会调用到spider的crawl方法。

def _create_crawler(self, spidercls):
        if isinstance(spidercls, six.string_types):
            spidercls = self.spider_loader.load(spidercls)
        return Crawler(spidercls, self.settings)#生成一个爬虫

class Crawler(object):#爬虫类

    def __init__(self, spidercls, settings=None):
if isinstance(settings, dict) or settings is None:
settings = Settings(settings) self.spidercls = spidercls#爬虫的蜘蛛类
self.settings = settings.copy()
self.spidercls.update_settings(self.settings) self.signals = SignalManager(self)
self.stats = load_object(self.settings['STATS_CLASS'])(self) handler = LogCounterHandler(self, level=self.settings.get('LOG_LEVEL'))
logging.root.addHandler(handler)
if get_scrapy_root_handler() is not None:
# scrapy root handler alread installed: update it with new settings
install_scrapy_root_handler(self.settings)
# lambda is assigned to Crawler attribute because this way it is not
# garbage collected after leaving __init__ scope
self.__remove_handler = lambda: logging.root.removeHandler(handler)
self.signals.connect(self.__remove_handler, signals.engine_stopped) lf_cls = load_object(self.settings['LOG_FORMATTER'])
self.logformatter = lf_cls.from_crawler(self)
self.extensions = ExtensionManager.from_crawler(self) self.settings.freeze()
self.crawling = False#该爬虫是否在爬行
self.spider = None
self.engine = None#引擎,对蜘蛛、schudler、download的控制 @property
def spiders(self):
if not hasattr(self, '_spiders'):
warnings.warn("Crawler.spiders is deprecated, use "
"CrawlerRunner.spider_loader or instantiate "
"scrapy.spiderloader.SpiderLoader with your "
"settings.",
category=ScrapyDeprecationWarning, stacklevel=2)
self._spiders = _get_spider_loader(self.settings.frozencopy())
return self._spiders @defer.inlineCallbacks
def crawl(self, *args, **kwargs):#调用爬虫的crawl方法时,会创建蜘蛛和引擎
assert not self.crawling, "Crawling already taking place"
self.crawling = True try:
self.spider = self._create_spider(*args, **kwargs)
self.engine = self._create_engine()
start_requests = iter(self.spider.start_requests())
yield self.engine.open_spider(self.spider, start_requests)#引擎的打开蜘蛛
yield defer.maybeDeferred(self.engine.start)
except Exception:
# In Python 2 reraising an exception after yield discards
# the original traceback (see http://bugs.python.org/issue7563),
# so sys.exc_info() workaround is used.
# This workaround also works in Python 3, but it is not needed,
# and it is slower, so in Python 3 we use native `raise`.
if six.PY2:
exc_info = sys.exc_info() self.crawling = False
if self.engine is not None:
yield self.engine.close() if six.PY2:
six.reraise(*exc_info)
raise

scrapy/core/engine.py

def _next_request_from_scheduler(self, spider):
slot = self.slot
request = slot.scheduler.next_request()#从调度器中取出一个请求
if not request:
return
d = self._download(request, spider)#去下载,生成一个defer对象
d.addBoth(self._handle_downloader_output, request, spider)
d.addErrback(lambda f: logger.info('Error while handling downloader output',
exc_info=failure_to_exc_info(f),
extra={'spider': spider}))
d.addBoth(lambda _: slot.remove_request(request))
d.addErrback(lambda f: logger.info('Error while removing request from slot',
exc_info=failure_to_exc_info(f),
extra={'spider': spider}))
d.addBoth(lambda _: slot.nextcall.schedule())
d.addErrback(lambda f: logger.info('Error while scheduling new request',
exc_info=failure_to_exc_info(f),
extra={'spider': spider}))
return d
def _handle_downloader_output(self, response, request, spider):
assert isinstance(response, (Request, Response, Failure)), response
# downloader middleware can return requests (for example, redirects)
if isinstance(response, Request):#如果返回的是请求,调用crawl,产生一个请求。
self.crawl(response, spider)
return
# response is a Response or Failure
d = self.scraper.enqueue_scrape(response, request, spider)#由scraper处理一个爬行结果
d.addErrback(lambda f: logger.error('Error while enqueuing downloader output',
exc_info=failure_to_exc_info(f),
extra={'spider': spider}))
return d

scrapy主要是由inlinecallback实现的。

core/scraper.py

 def _scrape2(self, request_result, request, spider):
"""Handle the different cases of request's result been a Response or a
Failure"""
if not isinstance(request_result, Failure):
return self.spidermw.scrape_response(
self.call_spider, request_result, request, spider)
else:
# FIXME: don't ignore errors in spider middleware
dfd = self.call_spider(request_result, request, spider)#爬行成功,调用call
return dfd.addErrback(
self._log_download_errors, request_result, request, spider) def call_spider(self, result, request, spider):
result.request = request
dfd = defer_result(result)
dfd.addCallbacks(request.callback or spider.parse, request.errback)#调用请求的回调函数或蜘蛛的parse
return dfd.addCallback(iterate_spider_output)

scrapy工作原理概述的更多相关文章

  1. Scrapy工作原理

    目录 1. Scrapy旧版架构图(绿线是数据流向) 2. Scrapy新版架构图 1. 组件介绍 2. 数据流(Data Flow) 3. 使用Scrapy框架爬虫的重要命令 4. Middlewa ...

  2. scrapy工作原理探秘

    def _next_request_from_scheduler(self, spider):#engine从调度器取得下一个request slot = self.slot request = sl ...

  3. Web服务器的工作原理

    Web服务器的工作原理 Web服务器工作原理概述 很多时候我们都想知道,web容器或web服务器(比如Tomcat或者jboss)是怎样工作的?它们是怎样处理来自全世界的http请求的?它们在幕后做了 ...

  4. web服务器工作原理

    Web服务器工作原理概述 转载自http://www.importnew.com/15020.html 很多时候我们都想知道,web容器或web服务器(比如Tomcat或者jboss)是怎样工作的?它 ...

  5. 代码中理解CPU结构及工作原理

    一.前言 从研究生开始到工作半年,陆续在接触MCU SOC这些以CPU为核心的控制器,但由于专业的原因一直对CPU的内部结构和工作原理一知半解.今天从一篇博客中打破一直以来的盲区.特此声明,本文设计思 ...

  6. 一篇文章教会你理解Scrapy网络爬虫框架的工作原理和数据采集过程

    今天小编给大家详细的讲解一下Scrapy爬虫框架,希望对大家的学习有帮助. 1.Scrapy爬虫框架 Scrapy是一个使用Python编程语言编写的爬虫框架,任何人都可以根据自己的需求进行修改,并且 ...

  7. Scrapy 框架结构及工作原理

    1.下图为 Scrapy 框架的组成结构,并从数据流的角度揭示 Scrapy 的工作原理 2.首先.简单了解一下 Scrapy 框架中的各个组件 组       件 描      述 类   型 EN ...

  8. scrapy框架结构与工作原理

    组件: ENGINE:引擎,框架的核心,其他组件在其控制下协同工作. SCHEDULER:调度器,负责对SPIDER提交的下载请求进行调度 DOWNLOADER:下载器,负责下载页面,发送HTTP请求 ...

  9. Python爬虫-Scrapy框架的工作原理

    Scrapy框架工作原理 Scrapy框架架构图 Scrapy框架主要由六大组件组成,分别为: ​ 调度器(Scheduler),下载器(Downler),爬虫(Spiders),中间件(Middwa ...

随机推荐

  1. 2、以自定义struct或struct指针作为map的Key

    若干问题: struct Node { int k, b; friend bool operator <(Node a, Node b) { return a.k < b.k; } }no ...

  2. 网站钓鱼的方法 和 xss

    获取cookie利用代码cookie.asp <html> <title>xx</title> <body> <%testfile = Serve ...

  3. 解决python中路径中包含中文无法找到文件的问题

    a="C:\Users\Dell\Desktop\ATOU\公共测试用例" (带中文的路径) a=a.decode("utf-8").encode(" ...

  4. Java-Runoob-高级教程-实例-方法:07. Java 实例 – instanceOf 关键字用法

    ylbtech-Java-Runoob-高级教程-实例-方法:07. Java 实例 – instanceOf 关键字用法 1.返回顶部 1. Java 实例 - instanceof 关键字用法   ...

  5. 廖雪峰Java6IO编程-2input和output-1inputStream

    1.InputStream 1.1InputStream是所有输入流的超类: int read() * 读取下一个字节,并返回字节(0-255) * 如果已读到末尾,返回-1 * read()方法是阻 ...

  6. [UE4]位移和形变 Render Transform

      任何UI控件都有Render Transform属性. 一.Transform,对应游戏场景中的Transform 1.Translation:位置,平移.对应游戏场景的Transform中的Lo ...

  7. [UE4]圆形的动态材质,使用VectorParameter、Get Dynamic Material、Set Vector Parameter Value

    一.新建一个名为M_FriendColor的材质.使用VectorParameter函数 二.新建一个名为FriendFlag的UserWidget,生成随机颜色,并传递给上一步设置的材质参数Colo ...

  8. Course List for Student

    题目描述 Zhejiang University has 40000 students and provides 2500 courses. Now given the student name li ...

  9. 结合源码分析 bubble 使用注意事项

    使用dubbo时候要尽量了解源码,不然会很容易入坑. 一.服务消费端ReferenceConfig需要自行缓存 ReferenceConfig实例是个很重的实例,每个ReferenceConfig实例 ...

  10. 挂载本地iso镜像

    挂载本地iso镜像 [root@linux-node1 ~]# mkdir -p /disk/iso [root@linux-node1 ~]# cd /disk/iso/ [root@linux-n ...