环境使用anaconda 创建的pyithon3.6环境

mac下

source activate python36

mac@macdeMacBook-Pro:~$     source activate python36
(python36) mac@macdeMacBook-Pro:~$ cd /www
(python36) mac@macdeMacBook-Pro:/www$ scrapy startproject testMiddlewile
New Scrapy project 'testMiddlewile', using template directory '/Users/mac/anaconda3/envs/python36/lib/python3.6/site-packages/scrapy/templates/project', created in:
/www/testMiddlewile You can start your first spider with:
cd testMiddlewile
scrapy genspider example example.com
(python36) mac@macdeMacBook-Pro:/www$ cd testMiddlewile/
(python36) mac@macdeMacBook-Pro:/www/testMiddlewile$ scrapy genspider -t crawl yeves yeves.cn
Created spider 'yeves' using template 'crawl' in module:
testMiddlewile.spiders.yeves
(python36) mac@macdeMacBook-Pro:/www/testMiddlewile$

  

启动爬虫

scrapy crawl yeves

 

(python36) mac@macdeMacBook-Pro:/www/testMiddlewile$     scrapy crawl yeves
2019-11-10 09:10:27 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: testMiddlewile)
2019-11-10 09:10:27 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.7.0, Python 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 13:42:17) - [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1d 10 Sep 2019), cryptography 2.7, Platform Darwin-17.7.0-x86_64-i386-64bit
2019-11-10 09:10:27 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'testMiddlewile', 'NEWSPIDER_MODULE': 'testMiddlewile.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['testMiddlewile.spiders']}
2019-11-10 09:10:27 [scrapy.extensions.telnet] INFO: Telnet Password: 29995a24067c48f8
2019-11-10 09:10:27 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2019-11-10 09:10:27 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-11-10 09:10:27 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-11-10 09:10:27 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-11-10 09:10:27 [scrapy.core.engine] INFO: Spider opened
2019-11-10 09:10:27 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-11-10 09:10:27 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-11-10 09:10:27 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yeves.cn/robots.txt> from <GET http://yeves.cn/robots.txt>
2019-11-10 09:10:30 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.yeves.cn/robots.txt> (referer: None)
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 9 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 10 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 14 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 15 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 21 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 22 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 27 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 28 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 29 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 30 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 31 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 32 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 36 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 37 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 39 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 41 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 42 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 43 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 47 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 48 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 49 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 53 without any user agent to enforce it on.
2019-11-10 09:10:30 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yeves.cn/> from <GET http://yeves.cn/>
2019-11-10 09:10:30 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.yeves.cn/robots.txt> (referer: None)
2019-11-10 09:10:30 [protego] DEBUG: Rule at l

  

从上面打印信息可以看到 scrapy默认启动了五个爬虫中间件

2019-11-10 09:10:27 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']

 

通过在pycharm 查看源码 先引入

from scrapy.spidermiddlewares.offsite import  OffsiteMiddleware
from scrapy.spidermiddlewares.referer import RefererMiddleware from scrapy.spidermiddlewares.httperror import HttpErrorMiddleware
from scrapy.spidermiddlewares.urllength import UrlLengthMiddleware
from scrapy.spidermiddlewares.depth import DepthMiddleware

  

offsite中间件

通过按住option进入offsite中间件源码

"""
Offsite Spider Middleware See documentation in docs/topics/spider-middleware.rst
"""
import re
import logging
import warnings from scrapy import signals
from scrapy.http import Request
from scrapy.utils.httpobj import urlparse_cached logger = logging.getLogger(__name__) class OffsiteMiddleware(object): def __init__(self, stats):
self.stats = stats @classmethod
def from_crawler(cls, crawler):
o = cls(crawler.stats)
crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
return o def process_spider_output(self, response, result, spider):
for x in result:
if isinstance(x, Request):
if x.dont_filter or self.should_follow(x, spider):
yield x
else:
domain = urlparse_cached(x).hostname
if domain and domain not in self.domains_seen:
self.domains_seen.add(domain)
logger.debug(
"Filtered offsite request to %(domain)r: %(request)s",
{'domain': domain, 'request': x}, extra={'spider': spider})
self.stats.inc_value('offsite/domains', spider=spider)
self.stats.inc_value('offsite/filtered', spider=spider)
else:
yield x def should_follow(self, request, spider):
regex = self.host_regex
# hostname can be None for wrong urls (like javascript links)
host = urlparse_cached(request).hostname or ''
return bool(regex.search(host)) def get_host_regex(self, spider):
"""Override this method to implement a different offsite policy"""
allowed_domains = getattr(spider, 'allowed_domains', None)
if not allowed_domains:
return re.compile('') # allow all by default
url_pattern = re.compile("^https?://.*$")
for domain in allowed_domains:
if url_pattern.match(domain):
message = ("allowed_domains accepts only domains, not URLs. "
"Ignoring URL entry %s in allowed_domains." % domain)
warnings.warn(message, URLWarning)
domains = [re.escape(d) for d in allowed_domains if d is not None]
regex = r'^(.*\.)?(%s)$' % '|'.join(domains)
return re.compile(regex) def spider_opened(self, spider):
self.host_regex = self.get_host_regex(spider)
self.domains_seen = set() class URLWarning(Warning):
pass

__init__ 类初始化

from_crawler   scrapy 中间件管理所调用的 调用后得到对象

process_spider_output 处理输出

should_follow  是否要继续跟踪

get_host_regex  正则

spider_opend 为了兼容以前的一个函数

函数调用流程  from_crawler-》__init__》spider_opend-》get_host_regex

offsite中间件 就是判断当前要请求的url是否符合爬虫里面定义的运行的域名 防止跳到其他域名去了

allowed_domains = ['yeves.cn']

refer中间件 主要是因为有些图片访问需要提供refer访问来源才能访问,比如阿里云后台oss配置的防止盗链

通过把上次的请求url作为本次url的refer

源码如下

class RefererMiddleware(object):

    def __init__(self, settings=None):
self.default_policy = DefaultReferrerPolicy
if settings is not None:
self.default_policy = _load_policy_class(
settings.get('REFERRER_POLICY')) @classmethod
def from_crawler(cls, crawler):
if not crawler.settings.getbool('REFERER_ENABLED'):
raise NotConfigured
mw = cls(crawler.settings) # Note: this hook is a bit of a hack to intercept redirections
crawler.signals.connect(mw.request_scheduled, signal=signals.request_scheduled) return mw def policy(self, resp_or_url, request):
"""
Determine Referrer-Policy to use from a parent Response (or URL),
and a Request to be sent. - if a valid policy is set in Request meta, it is used.
- if the policy is set in meta but is wrong (e.g. a typo error),
the policy from settings is used
- if the policy is not set in Request meta,
but there is a Referrer-policy header in the parent response,
it is used if valid
- otherwise, the policy from settings is used.
"""
policy_name = request.meta.get('referrer_policy')
if policy_name is None:
if isinstance(resp_or_url, Response):
policy_header = resp_or_url.headers.get('Referrer-Policy')
if policy_header is not None:
policy_name = to_native_str(policy_header.decode('latin1'))
if policy_name is None:
return self.default_policy() cls = _load_policy_class(policy_name, warning_only=True)
return cls() if cls else self.default_policy() def process_spider_output(self, response, result, spider):
def _set_referer(r):
if isinstance(r, Request):
referrer = self.policy(response, r).referrer(response.url, r.url)
if referrer is not None:
r.headers.setdefault('Referer', referrer)
return r
return (_set_referer(r) for r in result or ()) def request_scheduled(self, request, spider):
# check redirected request to patch "Referer" header if necessary
redirected_urls = request.meta.get('redirect_urls', [])
if redirected_urls:
request_referrer = request.headers.get('Referer')
# we don't patch the referrer value if there is none
if request_referrer is not None:
# the request's referrer header value acts as a surrogate
# for the parent response URL
#
# Note: if the 3xx response contained a Referrer-Policy header,
# the information is not available using this hook
parent_url = safe_url_string(request_referrer)
policy_referrer = self.policy(parent_url, request).referrer(
parent_url, request.url)
if policy_referrer != request_referrer:
if policy_referrer is None:
request.headers.pop('Referer')
else:
request.headers['Referer'] = policy_referrer

  

爬虫中间件里面的几个函数 offsite中间件只用到了output

process_spider_input 3

process_spider_output 2

process_start_requests 1

process_spider_exception

scrapy 爬虫中间件-offsite和refer中间件的更多相关文章

  1. <scrapy爬虫>基本知识-修改链接-中间件

    rules = ( Rule(LinkExtractor(allow=r'/films/\d+'),process_links='deal_links' ,callback='parse_maoyan ...

  2. Scrapy爬虫框架(实战篇)【Scrapy框架对接Splash抓取javaScript动态渲染页面】

    (1).前言 动态页面:HTML文档中的部分是由客户端运行JS脚本生成的,即服务器生成部分HTML文档内容,其余的再由客户端生成 静态页面:整个HTML文档是在服务器端生成的,即服务器生成好了,再发送 ...

  3. scrapy爬虫学习系列二:scrapy简单爬虫样例学习

    系列文章列表: scrapy爬虫学习系列一:scrapy爬虫环境的准备:      http://www.cnblogs.com/zhaojiedi1992/p/zhaojiedi_python_00 ...

  4. scrapy爬虫框架介绍

    一 介绍 Scrapy一个开源和协作的框架,其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的,使用它可以以快速.简单.可扩展的方式从网站中提取所需的数据.但目前Scrapy的用途十分广泛,可 ...

  5. Scrapy 爬虫

    Scrapy 爬虫 使用指南 完全教程   scrapy note command 全局命令: startproject :在 project_name 文件夹下创建一个名为 project_name ...

  6. Scrapy 爬虫入门 +实战

    爬虫,其实很早就有涉及到这个点,但是一直没有深入,今天来搞爬虫.选择了,scrapy这个框架 http://scrapy-chs.readthedocs.io/zh_CN/0.24/intro/tut ...

  7. scrapy爬虫框架setting模块解析

    平时写爬虫的时候并不需要设置setting里所有的参数,今天心血来潮,花了点时间查了一下setting模块创建后自动写入的所有参数的含义,记录一下. 模块相关说明信息 # -*- coding: ut ...

  8. Scrapy爬虫框架补充内容一(Linux环境)

    Scrapy爬虫框架结构及工作原理详解 scrapy框架的框架结构如下: 组件分析: ENGINE:(核心):处理整个框架的数据流,各个组件在其控制下协同工作 SCHEDULER(调度器):负责接收引 ...

  9. scrapy爬虫 快速入门

    Scrapy 1. 简介 Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架. 其可以应用在数据挖掘,信息处理或存储历史数据等一系列的程序中.其最初是为了页面抓取 (更确切来说, 网络 ...

随机推荐

  1. Centos7.6使用yum安装PHP7.2

    Centos7.6使用yum安装PHP7.2 1.安装源 安装php72w,是需要配置额外的yum源地址的,否则会报错不能找到相关软件包. php高版本的yum源地址,有两部分,其中一部分是epel- ...

  2. GBDT学习笔记

    GBDT(Gradient Boosting Decision Tree,Friedman,1999)算法自提出以来,在各个领域广泛使用.从名字里可以看到,该算法主要涉及了三类知识,Gradient梯 ...

  3. PHP获取指定分钟数的下一个整数倍

    2019-6-11 11:51:03 星期二 情景: 要定时发送邮件, 邮件数据入表时就记录下其待发送时间, 然后crontab是每分钟扫描邮件表, 找出当时那一分钟需要发送的邮件 举例: 假如有一种 ...

  4. 【转】WPF DataGridComboBoxColumn使用

    若要填充下拉列表,请首先使用下列选项之一设置 ComboBox 的 ItemsSource 属性.静态资源. x:Static 代码实体.ComboBoxItem 类型的内联集合.实现效果如下: 如需 ...

  5. redis支持远程接入的安全防护问题

    如果我们没有启用保护模式,支持远程接入,启用默认端口6379,而且是用root用户启动的,那么基本上redis就是在裸奔了,人家分分钟搞你没商量. 我们模拟一下,现在机器A(ip假设为10.100.1 ...

  6. Nginx学习之入门

    1. 概念   (1) 什么是nginx?    Nginx (engine x) 是一款轻量级的Web 服务器 .反向代理服务器及电子邮件(IMAP/POP3)代理服务器.   (2) 什么是反向代 ...

  7. K8S使用入门-添加一个node

    上一篇博客我们已经将K8S部署起来了,现在我们就来介绍一下如何简单使用K8S (1)添加节点 注意事项:不能和k8s master节点的主机名一样.否则会导致k8s无法正常识别出该节点 添加节点是比较 ...

  8. Git - ignore过滤文件

    Git - ignore 官网:https://git-scm.com/docs/gitignore 今天在初始化仓库的时候,考虑到如何过滤不需要的文件进入版本控制系统.所以去查阅了一番官方文档. 想 ...

  9. zuul 熔断后重试

    <dependency> <groupId>org.springframework.retry</groupId> <artifactId>spring ...

  10. 通过TopShelf简单创建windows service

    目前很多项目都是B/S架构的,我们经常会用到webapi.MVC等框架,实际项目中可能不仅仅是一些数据的增删改查,需要对数据进行计算,但是将计算逻辑放到api层又会拖累整个项目的运行速度,从而会写一些 ...