知识点

使用方法:
编写一个Downloader Middlewares和我们编写一个pipeline一样,定义一个类,然后在setting中开启     Downloader Middlewares默认的方法:
      process_request(self, request, spider):
    当每个request通过下载中间件时,该方法被调用。
    process_response(self, request, response, spider):
     当下载器完成http请求,传递响应给引擎的时候调用

1、学习官网网址

https://docs.scrapy.org/

2、settings文件,USER_AGENTS代理池

# -*- coding: utf-8 -*-

# Scrapy settings for zjh project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'zjh' SPIDER_MODULES = ['zjh.spiders']
NEWSPIDER_MODULE = 'zjh.spiders' LOG_LEVEL = "WARNING"
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'zjh (+http://www.yourdomain.com)'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True # Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default)
#COOKIES_ENABLED = False # Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False # Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#} # Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'zjh.middlewares.ZjhSpiderMiddleware': 543,
#} # Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#开启下载中间件
DOWNLOADER_MIDDLEWARES = {
'zjh.middlewares.RandomUserAgentMiddleware': 543,
'zjh.middlewares.CheckUserAgent': 544,
} # Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#} # Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'zjh.pipelines.ZjhPipeline': 300,
#} # Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' USER_AGENTS = [ "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)", "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)", "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)", "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0", "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5" ]

3、middleware.py处理代码池

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html from scrapy import signals import random
class RandomUserAgentMiddleware:
"""
下载中间件一般用于反爬虫,代理IP,自定义USER_AGENTS
"""
def process_request(self,request,spider):
ua = random.choice(spider.settings.get("USER_AGENTS"))
request.headers["User-Agent"] = ua class CheckUserAgent:
def process_response(self,request,response,spider):
print(dir(response.request))
print(request.headers["User-Agent"])
# return 必须有,表示响应经过引擎交给爬虫
return response

4、参考学习

  a)代理UserAgent

  

  b) 代理ip

  

python之scrapy模块下载中间件的更多相关文章

  1. python之poplib模块下载并解析邮件

    # -*- coding: utf-8 -*- #python 27 #xiaodeng #python之poplib模块下载并解析邮件 #https://github.com/michaelliao ...

  2. Scrapy——5 下载中间件常用函数、scrapy怎么对接selenium、常用的Setting内置设置有哪些

    Scrapy——5 下载中间件常用的函数 Scrapy怎样对接selenium 常用的setting内置设置 对接selenium实战 (Downloader Middleware)下载中间件常用函数 ...

  3. UA池 代理IP池 scrapy的下载中间件

    # 一些概念 - 在scrapy中如何给所有的请求对象尽可能多的设置不一样的请求载体身份标识 - UA池,process_request(request) - 在scrapy中如何给发生异常的请求设置 ...

  4. Scrapy的下载中间件

    下载中间件 简介 下载器,无法执行js代码,本身不支持代理 下载中间件用来hooks进Scrapy的request/response处理过程的框架,一个轻量级的底层系统,用来全局修改scrapy的re ...

  5. python之scrapy模块scrapy-redis使用

    1.redis的使用,自己可以多学习下,个人也是在学习 https://www.cnblogs.com/ywjfx/p/10262662.html官网可以自己搜索下. 2.下载安装scrapy-red ...

  6. python使用requests模块下载文件并获取进度提示

    一.概述 使用python3写了一个获取某网站文件的小脚本,使用了requests模块的get方法得到内容,然后通过文件读写的方式保存到硬盘同时需要实现下载进度的显示 二.代码实现 安装模块 pip3 ...

  7. python使用you-get模块下载视频

    pip install you-get # 安装先 怎么用    进入命令行: you-get url 暂停下载:ctrl + c ,继续下载重复  you-get url 官网地址:https:// ...

  8. python 安装 Scrapy 模块

    环境的安装总是让人多愁善感,爱恨交叉... 本人安装环境:win7 64 + python2.7 先来几个网站 https://doc.scrapy.org/en/latest/intro/insta ...

  9. Python使用requests模块下载图片

    MySQL中事先保存好爬取到的图片链接地址. 然后使用多线程把图片下载到本地. # coding: utf-8 import MySQLdb import requests import os imp ...

随机推荐

  1. Django—logging日志

    简介 Django使用python自带的logging 作为日志打印工具.简单介绍下logging. logging 是线程安全的,其主要由4部分组成: Logger 用户使用的直接接口,将日志传递给 ...

  2. 使用canvas 代码画小猪佩奇

    最近不是小猪佩奇很火嘛!!! 前几天 在知乎 看见了别人大佬用python写的 小猪佩奇,  顿时想学 ,可是 自己 没学过python(自己就好爬爬图片,,,,几个月没用 又丢了) 然后 就想画一个 ...

  3. SpiderMan成长记(爬虫之路)

    第一章 爬虫基础 1.1 爬虫基本原理 1.2 请求库 -- urllib库的使用 1.3 请求库 -- requests库的使用 1.4 数据解析 -- 正则基础 1.5 数据解析 -- lxml与 ...

  4. HTML禁用一块区域点击

    style="pointer-events: none;" 此方法可以禁止鼠标点击指定区域,但是对于键盘事件无法屏蔽,最好禁用一下键盘事件,如:tab

  5. 域知识深入学习一:Active Directory 域服务

      AD DS用来组织,管理,控制网络资源 1.1 Active Directory 域服务概述 AD内的directorydatabase(目录数据库)用来存储用户账户,计算机账户,打印机与共享文件 ...

  6. 老瞎眼 pk 小鲜肉 (线段树)

    链接:https://ac.nowcoder.com/acm/contest/1114/E来源:牛客网 题目描述 老瞎眼有一个长度为 n 的数组 a,为了为难小鲜肉,他准备了 Q 次询问,每次给出 一 ...

  7. 在当前目录下配置ansible

    配置ansible.cfg [defaults] inventory = myhost # 指定主机清单文件 host_key_checking = False 配置主机清单文件 [node] nod ...

  8. MyBatis插件原理

    官方文档:https://mybatis.org/mybatis-3/zh/configuration.html#plugins MyBatis 允许你在已映射语句执行过程中的某一点进行拦截调用.默认 ...

  9. maven插件生成可执行jar包

    <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-assem ...

  10. Python图形用户界面-Tkinter

    Tkinter是什么 python 特定的GUI界面,是一个图像的窗口,tkinter是python 自带的,可以编辑的GUI界面,我们可以用GUI 实现很多一个直观的功能,如何想开发一个计算器,如果 ...