python之scrapy模块下载中间件
知识点
使用方法:
编写一个Downloader Middlewares和我们编写一个pipeline一样,定义一个类,然后在setting中开启 Downloader Middlewares默认的方法:
process_request(self, request, spider):
当每个request通过下载中间件时,该方法被调用。
process_response(self, request, response, spider):
当下载器完成http请求,传递响应给引擎的时候调用
1、学习官网网址
https://docs.scrapy.org/
2、settings文件,USER_AGENTS代理池
# -*- coding: utf-8 -*- # Scrapy settings for zjh project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'zjh' SPIDER_MODULES = ['zjh.spiders']
NEWSPIDER_MODULE = 'zjh.spiders' LOG_LEVEL = "WARNING"
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'zjh (+http://www.yourdomain.com)'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True # Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default)
#COOKIES_ENABLED = False # Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False # Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#} # Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'zjh.middlewares.ZjhSpiderMiddleware': 543,
#} # Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#开启下载中间件
DOWNLOADER_MIDDLEWARES = {
'zjh.middlewares.RandomUserAgentMiddleware': 543,
'zjh.middlewares.CheckUserAgent': 544,
} # Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#} # Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'zjh.pipelines.ZjhPipeline': 300,
#} # Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' USER_AGENTS = [ "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)", "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)", "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)", "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0", "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5" ]
3、middleware.py处理代码池
# -*- coding: utf-8 -*- # Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html from scrapy import signals import random
class RandomUserAgentMiddleware:
"""
下载中间件一般用于反爬虫,代理IP,自定义USER_AGENTS
"""
def process_request(self,request,spider):
ua = random.choice(spider.settings.get("USER_AGENTS"))
request.headers["User-Agent"] = ua class CheckUserAgent:
def process_response(self,request,response,spider):
print(dir(response.request))
print(request.headers["User-Agent"])
# return 必须有,表示响应经过引擎交给爬虫
return response
4、参考学习
a)代理UserAgent
b) 代理ip
python之scrapy模块下载中间件的更多相关文章
- python之poplib模块下载并解析邮件
# -*- coding: utf-8 -*- #python 27 #xiaodeng #python之poplib模块下载并解析邮件 #https://github.com/michaelliao ...
- Scrapy——5 下载中间件常用函数、scrapy怎么对接selenium、常用的Setting内置设置有哪些
Scrapy——5 下载中间件常用的函数 Scrapy怎样对接selenium 常用的setting内置设置 对接selenium实战 (Downloader Middleware)下载中间件常用函数 ...
- UA池 代理IP池 scrapy的下载中间件
# 一些概念 - 在scrapy中如何给所有的请求对象尽可能多的设置不一样的请求载体身份标识 - UA池,process_request(request) - 在scrapy中如何给发生异常的请求设置 ...
- Scrapy的下载中间件
下载中间件 简介 下载器,无法执行js代码,本身不支持代理 下载中间件用来hooks进Scrapy的request/response处理过程的框架,一个轻量级的底层系统,用来全局修改scrapy的re ...
- python之scrapy模块scrapy-redis使用
1.redis的使用,自己可以多学习下,个人也是在学习 https://www.cnblogs.com/ywjfx/p/10262662.html官网可以自己搜索下. 2.下载安装scrapy-red ...
- python使用requests模块下载文件并获取进度提示
一.概述 使用python3写了一个获取某网站文件的小脚本,使用了requests模块的get方法得到内容,然后通过文件读写的方式保存到硬盘同时需要实现下载进度的显示 二.代码实现 安装模块 pip3 ...
- python使用you-get模块下载视频
pip install you-get # 安装先 怎么用 进入命令行: you-get url 暂停下载:ctrl + c ,继续下载重复 you-get url 官网地址:https:// ...
- python 安装 Scrapy 模块
环境的安装总是让人多愁善感,爱恨交叉... 本人安装环境:win7 64 + python2.7 先来几个网站 https://doc.scrapy.org/en/latest/intro/insta ...
- Python使用requests模块下载图片
MySQL中事先保存好爬取到的图片链接地址. 然后使用多线程把图片下载到本地. # coding: utf-8 import MySQLdb import requests import os imp ...
随机推荐
- 利用Pycharm部署同步更新Django项目文件
利用Pycharm部署同步更新Django项目文件 这里使用同步更新的前提是你已经在服务器上上传了你的Django项目文件. 在"工具(Tools)"菜单中找到"部署(D ...
- Go测试开发就用这三板斧
一个古老的面试问题:“给你个XX,你怎么测试?” 时间穿越到9102,Go语言成为了新生代的代名词.老问题变成了“给你Golang程序,你怎么测试?” 看完本文后,读者可以拍着胸脯回答,“一共 ...
- P2172 [国家集训队]部落战争 二分图最小不相交路径覆盖
二分图最小不相交路径覆盖 #include<bits/stdc++.h> using namespace std; ; ; ; ], nxt[MAXM << ], f[MAXM ...
- Appium Desired Capabilities-iOS Only
Appium Desired Capabilities-iOS Only These Capabilities are available only on the XCUITest Driver an ...
- golang embedded structs
golang 中把struct 转成json格式输出 package main import ( "encoding/json" "fmt" ) type Pe ...
- Mysql导入Excel数据 日期问题 (Excel 与 MySQL 时间戳格式和日期 互转)
https://blog.csdn.net/ghw455954461/article/details/7247738 今天项目表中需要导入好几w条数据 ,但日期由两个一个是标准时间一个为时间戳,程序中 ...
- C# 将Excel以文件流转换DataTable
/* *引用 NuGet包 Spire.XLS */ /// <summary> /// Excel帮助类 /// </summary> public class ExcelH ...
- 关于iar intrinsics.h is already included previously!报错的问题及解决办法
用最新的cubemx生成f103的代码(带freertos系统),如果用iar编译,可能会出现intrinsics.h is already included previously!的错误,如果没有待 ...
- mysql自增主键清零方法
MySQL数据库自增主键归零的几种方法 如果曾经的数据都不需要的话,可以直接清空所有数据,并将自增字段恢复从1开始计数: truncate table table_name; 1 当用户没有trunc ...
- NodeList对象的特点
nodeList对象的特点1,nodeList是一种类数组对象,用于保存一组有序的节点.2,通过方括号来访问nodeList的值,有item方法与length属性.3,它并不是Array的实例,没有数 ...