python之scrapy模块下载中间件

知识点

使用方法：

       编写一个Downloader Middlewares和我们编写一个pipeline一样，定义一个类，然后在setting中开启

　　　　Downloader Middlewares默认的方法：

　　　　　　process_request(self, request, spider)：

    　　　　当每个request通过下载中间件时，该方法被调用。

　　　　process_response(self, request, response, spider)：

   　　　　 当下载器完成http请求，传递响应给引擎的时候调用

1、学习官网网址

https://docs.scrapy.org/

2、settings文件，USER_AGENTS代理池

# -*- coding: utf-8 -*-

# Scrapy settings for zjh project

#

# For simplicity, this file contains only settings considered important or

# commonly used. You can find more settings consulting the documentation:

#

#     https://doc.scrapy.org/en/latest/topics/settings.html

#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'zjh'

SPIDER_MODULES = ['zjh.spiders']

NEWSPIDER_MODULE = 'zjh.spiders'

LOG_LEVEL = "WARNING"

# Crawl responsibly by identifying yourself (and your website) on the user-agent

#USER_AGENT = 'zjh (+http://www.yourdomain.com)'

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'

# Obey robots.txt rules

ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)

#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)

# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay

# See also autothrottle settings and docs

#DOWNLOAD_DELAY = 3

# The download delay setting will honor only one of:

#CONCURRENT_REQUESTS_PER_DOMAIN = 16

#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)

#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)

#TELNETCONSOLE_ENABLED = False

# Override the default request headers:

#DEFAULT_REQUEST_HEADERS = {

#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

#   'Accept-Language': 'en',

#}

# Enable or disable spider middlewares

# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html

#SPIDER_MIDDLEWARES = {

#    'zjh.middlewares.ZjhSpiderMiddleware': 543,

#}

# Enable or disable downloader middlewares

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#开启下载中间件

DOWNLOADER_MIDDLEWARES = {

   'zjh.middlewares.RandomUserAgentMiddleware': 543,

    'zjh.middlewares.CheckUserAgent': 544,

}

# Enable or disable extensions

# See https://doc.scrapy.org/en/latest/topics/extensions.html

#EXTENSIONS = {

#    'scrapy.extensions.telnet.TelnetConsole': None,

#}

# Configure item pipelines

# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html

#ITEM_PIPELINES = {

#    'zjh.pipelines.ZjhPipeline': 300,

#}

# Enable and configure the AutoThrottle extension (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/autothrottle.html

#AUTOTHROTTLE_ENABLED = True

# The initial download delay

#AUTOTHROTTLE_START_DELAY = 5

# The maximum download delay to be set in case of high latencies

#AUTOTHROTTLE_MAX_DELAY = 60

# The average number of requests Scrapy should be sending in parallel to

# each remote server

#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:

#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

#HTTPCACHE_ENABLED = True

#HTTPCACHE_EXPIRATION_SECS = 0

#HTTPCACHE_DIR = 'httpcache'

#HTTPCACHE_IGNORE_HTTP_CODES = []

#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

USER_AGENTS = [ "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)", "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)", "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)", "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0", "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5" ]

3、middleware.py处理代码池

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware

#

# See documentation in:

# https://doc.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals

import  random

class RandomUserAgentMiddleware:

    """

    下载中间件一般用于反爬虫，代理IP,自定义USER_AGENTS

    """

    def process_request(self,request,spider):

        ua = random.choice(spider.settings.get("USER_AGENTS"))

        request.headers["User-Agent"] = ua

class CheckUserAgent:

    def process_response(self,request,response,spider):

        print(dir(response.request))

        print(request.headers["User-Agent"])

        # return 必须有，表示响应经过引擎交给爬虫

        return response

4、参考学习

　　a)代理UserAgent

　　b) 代理ip

python之scrapy模块下载中间件的更多相关文章

python之poplib模块下载并解析邮件
# -*- coding: utf-8 -*- #python 27 #xiaodeng #python之poplib模块下载并解析邮件 #https://github.com/michaelliao ...
Scrapy——5 下载中间件常用函数、scrapy怎么对接selenium、常用的Setting内置设置有哪些
Scrapy——5 下载中间件常用的函数 Scrapy怎样对接selenium 常用的setting内置设置对接selenium实战 (Downloader Middleware)下载中间件常用函数 ...
UA池代理IP池 scrapy的下载中间件
# 一些概念 - 在scrapy中如何给所有的请求对象尽可能多的设置不一样的请求载体身份标识 - UA池,process_request(request) - 在scrapy中如何给发生异常的请求设置 ...
Scrapy的下载中间件
下载中间件简介下载器,无法执行js代码,本身不支持代理下载中间件用来hooks进Scrapy的request/response处理过程的框架,一个轻量级的底层系统,用来全局修改scrapy的re ...
python之scrapy模块scrapy-redis使用
1.redis的使用,自己可以多学习下,个人也是在学习 https://www.cnblogs.com/ywjfx/p/10262662.html官网可以自己搜索下. 2.下载安装scrapy-red ...
python使用requests模块下载文件并获取进度提示
一.概述使用python3写了一个获取某网站文件的小脚本,使用了requests模块的get方法得到内容,然后通过文件读写的方式保存到硬盘同时需要实现下载进度的显示二.代码实现安装模块 pip3 ...
python使用you-get模块下载视频
pip install you-get # 安装先怎么用进入命令行: you-get url 暂停下载:ctrl + c ,继续下载重复 you-get url 官网地址:https:// ...
python 安装 Scrapy 模块
环境的安装总是让人多愁善感,爱恨交叉... 本人安装环境:win7 64 + python2.7 先来几个网站 https://doc.scrapy.org/en/latest/intro/insta ...
Python使用requests模块下载图片
MySQL中事先保存好爬取到的图片链接地址. 然后使用多线程把图片下载到本地. # coding: utf-8 import MySQLdb import requests import os imp ...

随机推荐

揭秘如何用Python黑掉智能锅炉
引文去年我买了一个新的冷凝式锅炉(家用取暖产品),于是考虑上面必须有一个“智能恒温器”,而选择也很多,包括Google Nest. Hive(英国天然气公司设计的) 以及伍斯特·博世‘Wave’ ...
C和指针--链表
1.链表的基本概念链表(linked list)是一些包含数据的节点的集合.链表中的每个节点通过链或指针连接在一起.程序通过指针访问链表中的节点.通常节点是动态分配的. 2.链表的分类链表可分为: ...
2-1 bash基本特性
bash基本特性 bash基本介绍 bash是shell的一种,shell是计算机与用户交互的主要接口,狭义上的shell指的是CLI(command line interface命令行接口),用户输 ...
小A的数学题
小A最近开始研究数论题了,这一次他随手写出来一个式子, 但是他发现他并不太会计算这个式子,你可以告诉他这个结果吗,答案可能会比较大,请模上1000000007. 输入描述: 一行两个正整数n,m一行两 ...
MySQL 5.6, 5.7, 8.0版本的新特性汇总大全
转载:http://blog.itpub.net/15498/viewspace-2650661/ MySQL 5.6 1).支持GTID复制 2).支持无损复制 3).支持延迟复制 4).支持基于库 ...
Spring中，请求参数处理
Spring中,Controller里,获取请求数据有多种情况在使用@RequestParam的方式获取请求中的参数时, 如果没有设置required这个属性,或者主动设置为true,则意味着这个参 ...
ZOJ - 4114 Flipping Game
ZOJ - 4114 Flipping Game 题目大意:给出两个串s,t,n个灯泡的序列,1代表开着,0代表关着,一共操作k轮,每轮改变m个灯泡的状态,问最终能把s串变成t串的方案数. 坤神题解. ...
【luogu4781】拉格朗日插值
题目背景这是一道模板题题目描述由小学知识可知,nn个点(x_i,y_i)(xi,yi)可以唯一地确定一个多项式现在,给定nn个点,请你确定这个多项式,并将kk代入求值求出的值对99824 ...
「CF712E」Memory and Casinos「线段树」「概率」
题解解法1:(官方做法) 一段区间的\(L\)定义为从最左边开始出发,最左不失败,一直到最右边胜利的概率,\(R\)定义为从最右边开始出发,最左不失败,又回到最右边胜利的概率考虑一个区间\([l, ...
虚拟视点demo
2019年7月16日15:55:11 感觉虚拟视点也是视觉slam里头一个重要的需求和应该实现的功能,但是好像没看到什么资料. 百度的全景地图,或者有些公司网站上的3d装修效果图,可以用鼠标拖动查看 ...

python之scrapy模块下载中间件

python之scrapy模块下载中间件的更多相关文章

随机推荐

热门专题