Scrapy 框架

Scrapy是用纯Python实现一个为了爬取网站数据、提取结构性数据而编写的应用框架，用途非常广泛。
框架的力量，用户只需要定制开发几个模块就可以轻松的实现一个爬虫，用来抓取网页内容以及各种图片，非常之方便。
Scrapy 使用了 Twisted['twɪstɪd](其主要对手是Tornado)异步网络框架来处理网络通讯，可以加快我们的下载速度，不用自己去实现异步框架，并且包含了各种中间件接口，可以灵活的完成各种需求。

Scrapy架构图(绿线是数据流向)：

Scrapy Engine(引擎): 负责Spider、ItemPipeline、Downloader、Scheduler中间的通讯，信号、数据传递等。
Scheduler(调度器): 它负责接受引擎发送过来的Request请求，并按照一定的方式进行整理排列，入队，当引擎需要时，交还给引擎。
Downloader（下载器）：负责下载Scrapy Engine(引擎)发送的所有Requests请求，并将其获取到的Responses交还给Scrapy Engine(引擎)，由引擎交给Spider来处理，
Spider（爬虫）：它负责处理所有Responses,从中分析提取数据，获取Item字段需要的数据，并将需要跟进的URL提交给引擎，再次进入Scheduler(调度器)，
Item Pipeline(管道)：它负责处理Spider中获取到的Item，并进行进行后期处理（详细分析、过滤、存储等）的地方.
Downloader Middlewares（下载中间件）：你可以当作是一个可以自定义扩展下载功能的组件。
Spider Middlewares（Spider中间件）：你可以理解为是一个可以自定扩展和操作引擎和Spider中间通信的功能组件（比如进入Spider的Responses;和从Spider出去的Requests）

以上是 Scrapy 的架构图，从流程上看还是很清晰的，我就只简单的说一下，首先从红色方框的 Spider 开始，通过引擎发送给调度器任务，再将请求任务交给下载器并处理完后返回结果给 Spider，最后将结果交给关到来处理我们的结果就可以了。

上面的话可能还是会有些拗口，在接下来我们会一点点进行剖析，最后会发现利用 Scrapy 框架来做爬虫是如此简单。

Scrapy的安装

windows 安装 pip install scrapy

Mac 安装 sudo pip install scrapy

pip 升级 pip install --upgrade pip

本人目前使用的是Mac电脑，目前使用的是 python3 版本，内容上其实都大同小异，如遇系统或版本问题可及时联系，互相学习！

安装完成后我们在终端输出 Scrapy 即可安装是否成功：

新建项目

在 Scrapy 安装成功之后，我们就需要用它来开发我们的爬虫项目了，进入自定义的项目目录中，运行下列命令：

scrapy startproject spiderDemo

运行上面的命令行就会在我们项目目录下生成一下目录结构：

下面来简单介绍一下各个主要文件的作用：

scrapy.cfg ：项目的配置文件

scrapyDemo/ ：项目的Python模块，将会从这里引用代码

scrapyDemo/items.py ：项目的目标文件

scrapyDemo/middlewares.py ：项目的中间件文件

scrapyDemo/pipelines.py ：项目的管道文件

scrapyDemo/settings.py ：项目的设置文件

scrapyDemo/spiders/ ：存储爬虫代码目录

接下来我们对各文件里的内容简单说一下，里面的代码目前都是最简单的基本代码，在接下来做案例的时候我们会再有针对地对文件做一下解释。

其中的 __init_.py 文件内容都是空的，但是却不能删除掉，否则项目将无法启动。

spiderDemo/items.py

 # -*- coding: utf-8 -*-

 # Define here the models for your scraped items

 #

 # See documentation in:

 # https://docs.scrapy.org/en/latest/topics/items.html

 import scrapy

 class ScrapydemoItem(scrapy.Item):

     # define the fields for your item here like:

     # name = scrapy.Field()

     pass

该文件是用来定义我们通过爬虫所获取到的有用的信息，即 scrapy.Item

scrapyDemo/middlewares.py

 # -*- coding: utf-8 -*-

 # Define here the models for your spider middleware

 #

 # See documentation in:

 # https://docs.scrapy.org/en/latest/topics/spider-middleware.html

 from scrapy import signals

 class ScrapydemoSpiderMiddleware(object):

     # Not all methods need to be defined. If a method is not defined,

     # scrapy acts as if the spider middleware does not modify the

     # passed objects.

     @classmethod

     def from_crawler(cls, crawler):

         # This method is used by Scrapy to create your spiders.

         s = cls()

         crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)

         return s

     def process_spider_input(self, response, spider):

         # Called for each response that goes through the spider

         # middleware and into the spider.

         # Should return None or raise an exception.

         return None

     def process_spider_output(self, response, result, spider):

         # Called with the results returned from the Spider, after

         # it has processed the response.

         # Must return an iterable of Request, dict or Item objects.

         for i in result:

             yield i

     def process_spider_exception(self, response, exception, spider):

         # Called when a spider or process_spider_input() method

         # (from other spider middleware) raises an exception.

         # Should return either None or an iterable of Request, dict

         # or Item objects.

         pass

     def process_start_requests(self, start_requests, spider):

         # Called with the start requests of the spider, and works

         # similarly to the process_spider_output() method, except

         # that it doesn’t have a response associated.

         # Must return only requests (not items).

         for r in start_requests:

             yield r

     def spider_opened(self, spider):

         spider.logger.info('Spider opened: %s' % spider.name)

 class ScrapydemoDownloaderMiddleware(object):

     # Not all methods need to be defined. If a method is not defined,

     # scrapy acts as if the downloader middleware does not modify the

     # passed objects.

     @classmethod

     def from_crawler(cls, crawler):

         # This method is used by Scrapy to create your spiders.

         s = cls()

         crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)

         return s

     def process_request(self, request, spider):

         # Called for each request that goes through the downloader

         # middleware.

         # Must either:

         # - return None: continue processing this request

         # - or return a Response object

         # - or return a Request object

         # - or raise IgnoreRequest: process_exception() methods of

         #   installed downloader middleware will be called

         return None

     def process_response(self, request, response, spider):

         # Called with the response returned from the downloader.

         # Must either;

         # - return a Response object

         # - return a Request object

         # - or raise IgnoreRequest

         return response

     def process_exception(self, request, exception, spider):

         # Called when a download handler or a process_request()

         # (from other downloader middleware) raises an exception.

         # Must either:

         # - return None: continue processing this exception

         # - return a Response object: stops process_exception() chain

         # - return a Request object: stops process_exception() chain

         pass

     def spider_opened(self, spider):

         spider.logger.info('Spider opened: %s' % spider.name)

该文件为中间件文件，名字后面的s表示复数，说明这个文件里面可以放很多个中间件，我们用到的中间件可以在此定义

spiderDemo/pipelines.py

 # -*- coding: utf-8 -*-

 # Define your item pipelines here

 #

 # Don't forget to add your pipeline to the ITEM_PIPELINES setting

 # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

 class ScrapydemoPipeline(object):

     def process_item(self, item, spider):

         return item

该文件俗称管道文件，是用来获取到我们的Item数据，并对数据做针对性的处理。

scrapyDemo/settings.py

 # -*- coding: utf-8 -*-

 # Scrapy settings for scrapyDemo project

 #

 # For simplicity, this file contains only settings considered important or

 # commonly used. You can find more settings consulting the documentation:

 #

 #     https://docs.scrapy.org/en/latest/topics/settings.html

 #     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html

 #     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

 BOT_NAME = 'scrapyDemo'

 SPIDER_MODULES = ['scrapyDemo.spiders']

 NEWSPIDER_MODULE = 'scrapyDemo.spiders'

 # Crawl responsibly by identifying yourself (and your website) on the user-agent

 #USER_AGENT = 'scrapyDemo (+http://www.yourdomain.com)'

 # Obey robots.txt rules

 ROBOTSTXT_OBEY = True

 # Configure maximum concurrent requests performed by Scrapy (default: 16)

 #CONCURRENT_REQUESTS = 32

 # Configure a delay for requests for the same website (default: 0)

 # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay

 # See also autothrottle settings and docs

 #DOWNLOAD_DELAY = 3

 # The download delay setting will honor only one of:

 #CONCURRENT_REQUESTS_PER_DOMAIN = 16

 #CONCURRENT_REQUESTS_PER_IP = 16

 # Disable cookies (enabled by default)

 #COOKIES_ENABLED = False

 # Disable Telnet Console (enabled by default)

 #TELNETCONSOLE_ENABLED = False

 # Override the default request headers:

 #DEFAULT_REQUEST_HEADERS = {

 #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

 #   'Accept-Language': 'en',

 #}

 # Enable or disable spider middlewares

 # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html

 #SPIDER_MIDDLEWARES = {

 #    'scrapyDemo.middlewares.ScrapydemoSpiderMiddleware': 543,

 #}

 # Enable or disable downloader middlewares

 # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html

 #DOWNLOADER_MIDDLEWARES = {

 #    'scrapyDemo.middlewares.ScrapydemoDownloaderMiddleware': 543,

 #}

 # Enable or disable extensions

 # See https://docs.scrapy.org/en/latest/topics/extensions.html

 #EXTENSIONS = {

 #    'scrapy.extensions.telnet.TelnetConsole': None,

 #}

 # Configure item pipelines

 # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html

 #ITEM_PIPELINES = {

 #    'scrapyDemo.pipelines.ScrapydemoPipeline': 300,

 #}

 # Enable and configure the AutoThrottle extension (disabled by default)

 # See https://docs.scrapy.org/en/latest/topics/autothrottle.html

 #AUTOTHROTTLE_ENABLED = True

 # The initial download delay

 #AUTOTHROTTLE_START_DELAY = 5

 # The maximum download delay to be set in case of high latencies

 #AUTOTHROTTLE_MAX_DELAY = 60

 # The average number of requests Scrapy should be sending in parallel to

 # each remote server

 #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

 # Enable showing throttling stats for every response received:

 #AUTOTHROTTLE_DEBUG = False

 # Enable and configure HTTP caching (disabled by default)

 # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

 #HTTPCACHE_ENABLED = True

 #HTTPCACHE_EXPIRATION_SECS = 0

 #HTTPCACHE_DIR = 'httpcache'

 #HTTPCACHE_IGNORE_HTTP_CODES = []

 #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

该文件为我们的设置文件，一些基本的设置需要我们在此文件中进行配置，如我们的中间件文件当中的两个类 ScrapydemoSpiderMiddleware，ScrapydemoDownloaderMiddleware 在 settings.py 中就能找到。

在 settings 文件中，我们常会配置到如上面的字段如：ITEM_PIPELINES（管道文件），DEFAULT_REQUEST_HEADERS（请求报头），DOWNLOAD_DELAY（下载延迟）

，ROBOTSTXT_OBEY（是否遵循爬虫协议）等。

本章我们就先简单的介绍一下 scrapy 的基本目录，下一章我们来根据 scrapy 框架实现一个爬虫案例。

python Scrapy 从零开始学习笔记（一）的更多相关文章

python Scrapy 从零开始学习笔记（二）
在之前的文章中我们简单了解了一下Scrapy 框架和安装及目录的介绍,本章我们将根据 scrapy 框架实现博客园首页博客的爬取及数据处理. 我们先在自定义的目录中通过命令行来构建一个 scrapy ...
Python scrapy爬虫学习笔记01
1.scrapy 新建项目 scrapy startproject 项目名称 2.spiders编写(以爬取163北京新闻为例) 此例中用到了scrapy的Itemloader机制,itemloade ...
Requests:Python HTTP Module学习笔记（一）（转）
Requests:Python HTTP Module学习笔记(一) 在学习用python写爬虫的时候用到了Requests这个Http网络库,这个库简单好用并且功能强大,完全可以代替python的标 ...
python网络爬虫学习笔记
python网络爬虫学习笔记 By 钟桓 9月 4 2014 更新日期:9月 4 2014 文章文件夹 1. 介绍: 2. 从简单语句中開始: 3. 传送数据给server 4. HTTP头-描写叙述 ...
Python Built-in Function 学习笔记
Python Built-in Function 学习笔记 1. 匿名函数 1.1 什么是匿名函数 python允许使用lambda来创建一个匿名函数,匿名是因为他不需要以标准的方式来声明,比如def ...
2019-03-22 Python Scrapy 入门教程笔记
Python Scrapy 入门教程入门教程笔记: # 创建mySpider scrapy startproject mySpider # 创建itcast.py cd C:\Users\theDa ...
Python快速入门学习笔记（二）
注:本学习笔记参考了廖雪峰老师的Python学习教程,教程地址为:http://www.liaoxuefeng.com/wiki/001374738125095c955c1e6d8bb49318210 ...
python数据分析入门学习笔记
学习利用python进行数据分析的笔记&下星期二内部交流会要讲的内容,一并分享给大家.博主粗心大意,有什么不对的地方欢迎指正~还有许多尚待完善的地方,待我一边学习一边完善~ 前言:各种和数据分 ...
python网络爬虫学习笔记（二）BeautifulSoup库
Beautiful Soup库也称为beautiful4库.bs4库,它可用于解析HTML/XML,并将所有文件.字符串转换为'utf-8'编码.HTML/XML文档是与“标签树一一对应的.具体地说, ...

随机推荐

python黑帽子之udp客户端
将上文的TCP客户端简单修改便能得到UDP客户端 import socket target_host = "127.0.0.1" target_port = 80 client = ...
SpringBoot--集成actuator
actuator是spring boot项目中非常强大一个功能,有助于对应用程序进行监视和管理,通过 restful api 请求来监管.审计.收集应用的运行情况,针对微服务而言它是必不可少的一个环节 ...
Dll的多字节和Unicode
Dll的多字节和Unicode 分类: MFC2013-10-17 13:00 28人阅读评论(0) 收藏举报 dll字符集字符集多字节Unicode 我们定义dll的时候会区分: 字符集:使用多 ...
博弈论Nim取子问题，困扰千年的问题一行代码解决
本文始发于个人公众号:TechFlow,原创不易,求个关注今天是算法与数据结构专题26篇文章,我们来看看一个新的博弈论模型--Nim取子问题. 这个博弈问题非常古老,延续长度千年之久,一直到20世纪 ...
PKIX
这是证书认证不通过的问题,对https协议免认证 http://blog.csdn.net/zziamalei/article/details/46520797 使用上面的方法时,使用spring的& ...
Spring Boot + Vue + Shiro 实现前后端分离、权限控制
本文总结自实习中对项目的重构.原先项目采用Springboot+freemarker模版,开发过程中觉得前端逻辑写的实在恶心,后端Controller层还必须返回Freemarker模版的ModelA ...
css与javascript重难点，学前端，基础不好一切白费！
JavaScript是一种属于网络的脚本语言,已经被广泛用于Web应用开发,常用来为网页添加各式各样的动态功能,为用户提供更流畅美观的浏览效果.通常JavaScript脚本是通过嵌入在HTML中来实现 ...
JAVA死锁排查-性能测试问题排查思路
死锁原因 Java发生死锁的根本原因是:在申请锁时发生了交叉闭环申请.即线程在获得了锁A并且没有释放的情况下去申请锁B,这时,另一个线程已经获得了锁B,在释放锁B之前又要先获得锁A,因此闭环发生,陷入 ...
LeetCode 哈希表 380. 常数时间插入、删除和获取随机元素（设计数据结构 List HashMap底层时间复杂度）
比起之前那些问计数哈希表的题目,这道题好像更接近哈希表的底层机制. java中hashmap的实现是通过List<Node>,即链表的list,如果链表过长则换为红黑树,如果容量不足(装填 ...
零拷贝(Zero-copy) 浅析及其应用
相信大家都有过面经历,如果跟面试官聊到了操作系统,聊到了文件操作,可能会问你普通的文件读写流程,它有什么缺点,你知道有什么改进的措施.我们经常听说零拷贝,每次可能只是背诵一些面试要点就过去了,今天我 ...