Scrapy 扩展中间件: 针对特定响应状态码，使用代理重新请求

0.参考

https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.redirect

https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.httpproxy

1.主要实现

实际爬虫过程中如果请求过于频繁，通常会被临时重定向到登录页面即302，甚至是提示禁止访问即403，因此可以对这些响应执行一次代理请求：

(1) 参考原生 redirect.py 模块，满足 dont_redirect 或 handle_httpstatus_list 等条件时，直接传递 response

(2) 不满足条件(1)，如果响应状态码为 302 或 403，使用代理重新发起请求

(3) 使用代理后，如果响应状态码仍为 302 或 403，直接丢弃

2.代码实现

保存至 /site-packages/my_middlewares.py

from w3lib.url import safe_url_string

from six.moves.urllib.parse import urljoin

from scrapy.exceptions import IgnoreRequest

class MyAutoProxyDownloaderMiddleware(object):

    def __init__(self, settings):

        self.proxy_status = settings.get('PROXY_STATUS', [302, 403])

        # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html?highlight=proxy#module-scrapy.downloadermiddlewares.httpproxy

        self.proxy_config = settings.get('PROXY_CONFIG', 'http://username:password@some_proxy_server:port')

    @classmethod

    def from_crawler(cls, crawler):

        return cls(

            settings = crawler.settings

        )        

    # See /site-packages/scrapy/downloadermiddlewares/redirect.py

    def process_response(self, request, response, spider):

        if (request.meta.get('dont_redirect', False) or

                response.status in getattr(spider, 'handle_httpstatus_list', []) or

                response.status in request.meta.get('handle_httpstatus_list', []) or

                request.meta.get('handle_httpstatus_all', False)):

            return response

        if response.status in self.proxy_status:

            if 'Location' in response.headers:

                location = safe_url_string(response.headers['location'])

                redirected_url = urljoin(request.url, location)

            else:

                redirected_url = ''

            # AutoProxy for first time

            if not request.meta.get('auto_proxy'):

                request.meta.update({'auto_proxy': True, 'proxy': self.proxy_config})

                new_request = request.replace(meta=request.meta, dont_filter=True)

                new_request.priority = request.priority + 2

                spider.log('Will AutoProxy for <{} {}> {}'.format(

                            response.status, request.url, redirected_url))

                return new_request

            # IgnoreRequest for second time

            else:

                spider.logger.warn('Ignoring response <{} {}>: HTTP status code still in {} after AutoProxy'.format(

                                    response.status, request.url, self.proxy_status))

                raise IgnoreRequest

        return response

3.调用方法

(1) 项目 settings.py 添加代码，注意必须在默认的 RedirectMiddleware 和 HttpProxyMiddleware 之间。

# Enable or disable downloader middlewares

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

DOWNLOADER_MIDDLEWARES = {

    # 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,

    'my_middlewares.MyAutoProxyDownloaderMiddleware': 601,

    # 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,

}

PROXY_STATUS = [302, 403]

PROXY_CONFIG = 'http://username:password@some_proxy_server:port'

4.运行结果

2018-07-18 18:42:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/> (referer: None)

2018-07-18 18:42:38 [test] DEBUG: Will AutoProxy for < http://httpbin.org/status/302> http://httpbin.org/redirect/1

2018-07-18 18:42:43 [test] DEBUG: Will AutoProxy for < https://httpbin.org/status/403>

2018-07-18 18:42:51 [test] WARNING: Ignoring response <302 http://httpbin.org/status/302>: HTTP status code still in [302, 403] after AutoProxy

2018-07-18 18:42:52 [test] WARNING: Ignoring response <403 https://httpbin.org/status/403>: HTTP status code still in [302, 403] after AutoProxy

代理服务器 log：

squid [18/Jul/2018:18:42:53 +0800] "GET http://httpbin.org/status/302 HTTP/1.1" 302 310 "-" "Mozilla/5.0" TCP_MISS:HIER_DIRECT

squid [18/Jul/2018:18:42:54 +0800] "CONNECT httpbin.org:443 HTTP/1.1" 200 3560 "-" "-" TCP_TUNNEL:HIER_DIRECT

Scrapy 扩展中间件: 针对特定响应状态码，使用代理重新请求的更多相关文章

9. http协议_响应状态码_页面渲染流程_路由_中间件
1. http协议超文本传输协议协议详细规定了浏览器和万维网服务器之间互相通信的规则客户端与服务端通信时传输的内容我们称之为报文(请求报文.响应报文) 常见的发送 get 请求方式在浏 ...
TCP/IP协议族(一) HTTP简介、请求方法与响应状态码
接下来想系统的回顾一下TCP/IP协议族的相关东西,当然这些东西大部分是在大学的时候学过的,但是那句话,基础的东西还是要不时的回顾回顾的.接下来的几篇博客都是关于TCP/IP协议族的,本篇博客就先简单 ...
ASP.NET Core错误处理中间件[4]: 响应状态码页面
StatusCodePagesMiddleware中间件与ExceptionHandlerMiddleware中间件类似,它们都是在后续请求处理过程中"出错"的情况下利用一个错误处 ...
http响应状态码大全
http响应状态码大全 http状态返回代码 1xx(临时响应)表示临时响应并需要请求者继续执行操作的状态代码. http状态返回代码代码说明100 (继续) 请求者应当继续提出请求. 服 ...
iOS开发——网络篇——HTTP/NSURLConnection（请求、响应）、http响应状态码大全
一.网络基础 1.基本概念> 为什么要学习网络编程在移动互联网时代,移动应用的特征有几乎所有应用都需要用到网络,比如QQ.微博.网易新闻.优酷.百度地图只有通过网络跟外界进行数据交互.数据更新, ...
HTTP响应状态码参考
HTTP响应状态码参考: 1xx:信息 Continue 服务器仅接收到部分请求,但是一旦服务器并没有拒绝该请求,客户端应该继续发送其余的请求. Switching Protocols 服务器转换协议 ...
HTTP协议—常见的HTTP响应状态码解析
常见的HTTP响应状态码解析 1XX Informational(信息性状态码) 2XX Success(成功状态码) 3XX Redirection(重定向状态码) 4XX Client Error ...
Java Web学习总结（21）——http协议响应状态码大全以及常用状态码
http协议响应状态码大全以及常用状态码当我们在浏览网页或是在查看服务器日志时,常会遇到3位数字的状态码,这3位数字是什么意思呢?其实,这3位数字是HTTP状态码,用来表示网页服务器HTTP响应状态 ...
【转】HTTP响应状态码参考簿
HTTP响应状态码参考簿 http状态返回代码 1xx(临时响应)表示临时响应并需要请求者继续执行操作的状态代码. http状态返回代码代码说明100 (继续) 请求者应当继续提出请求. ...

随机推荐

CF235B Let's Play Osu! 期望DP
貌似是一道很裸的期望\(DP\).直接说思路: 设\(f[i]\)表示到\(i\)位置时的期望分数,但是只有\(f[i]\)的话我们发现是无法转移的,我们还需要知道到\(i\)位置时的期望连续长度,于 ...
从redis中取值如果不存在设置值，使用Redisson分布式锁【我】
用到的jar包:  <dependency> <groupId>redis.clients</groupId> < ...
django-crontab实现定时任务
django-crontab实现服务端的定时任务安装 pip install django-crontab 在Django项目中使用 settings.py INSTALLED_APPS = ( ' ...
SQL随记(六)
1.关于dbms_sql包的一些执行语句 cursor_name := DBMS_SQL.OPEN_CURSOR; --打开游标: DBMS_SQL.PARSE(cursor_name, var_dd ...
Typora 使用说明
目录 Typora是一款超简洁的markdown编辑器,具有如下特点: 完全免费,目前已支持中文跨平台,支持windows,mac,linux 支持数学公式输入,图片插入极其简洁,无多余功能界面 ...
【转】Redis学习笔记（四）如何用Redis实现分布式锁（1）—— 单机版
原文地址:http://bridgeforyou.cn/2018/09/01/Redis-Dsitributed-Lock-1/ 为什么要使用分布式锁这个问题,可以分为两个问题来回答: 为什么要使用 ...
MySQL 死锁场景
SESSION 1 SESSION 2 SESSION 3 START TRANSACTION START TRANSACTION START TRANSACTION INSERT INS ...
vue全局变量的使用
新建一个VUE文件,声明一个变量,并且把它export. 在main.js中引入,并声明. 在其他地方使用,直接this就可以了.
1、IDEA的常用快捷键
一.常用的快捷键1.Ctrl+/ 或 Ctrl+Shift+/ 注释(// 或者/*...*/ )###2.Ctrl+D 复制行### 注意在MyEclipse中Ctrl+D的作用是删除行3.Ctrl ...
How to Create UML in Markdown
Import yuml class format ![](http://yuml.me/diagram/boring/class/[...]) Create your own class Person ...