报错 Filtered offsite request

用scrapy框架迭代爬取时报错
scrapy日志：

在 setting.py 文件中设置日志记录等级

LOG_LEVEL= 'DEBUG'

LOG_FILE ='log.txt'

观察 scrapy 日志

2017-08-15 21:58:05 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'sou.zhaopin.com': <GET http://sou.zhaopin.com/jobs/searchresult.ashx?jl=%E4%B8%8A%E6%B5%B7&kw=python&sm=0&source=0&p=2>

2017-08-15 21:58:05 [scrapy.core.engine] INFO: Closing spider (finished)

2017-08-15 21:58:05 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

{'downloader/request_bytes': 782,

 'downloader/request_count': 3,

 'downloader/request_method_count/GET': 3,

 'downloader/response_bytes': 58273,

 'downloader/response_count': 3,

 'downloader/response_status_count/200': 2,

 'downloader/response_status_count/302': 1,

 'finish_reason': 'finished',

 'finish_time': datetime.datetime(2017, 8, 15, 13, 58, 5, 915565),

 'item_scraped_count': 59,

 'log_count/DEBUG': 64,

 'log_count/INFO': 7,

 'memusage/max': 52699136,

 'memusage/startup': 52699136,

 'offsite/domains': 1,

 'offsite/filtered': 1,

 'request_depth_max': 1,

 'response_received_count': 2,

 'scheduler/dequeued': 1,

 'scheduler/dequeued/memory': 1,

 'scheduler/enqueued': 1,

 'scheduler/enqueued/memory': 1,

 'start_time': datetime.datetime(2017, 8, 15, 13, 58, 5, 98357)}

2017-08-15 21:58:05 [scrapy.core.engine] INFO: Spider closed (finished)

重要的是第一行，我开始做的时候没有意识到这竟然是一个错误，应该是被记录的一个错误提示，然后程序也就没有报错

2017-08-15 21:58:05 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'sou.zhaopin.com': <GET http://sou.zhaopin.com/jobs/searchresult.ashx?jl=%E4%B8%8A%E6%B5%B7&kw=python&sm=0&source=0&p=2>

DEBUG: Filtered offsite request to
因为 Request中请求的 URL 和 allowed_domains 中定义的域名冲突，所以将Request中请求的URL过滤掉了，无法请求

name = 'zhilianspider'

allowed_domains = ['http://sou.zhaopin.com']

page = 1

url = 'http://sou.zhaopin.com/jobs/searchresult.ashx?jl=%E4%B8%8A%E6%B5%B7&kw=python&sm=0&source=0&p='

start_urls = [url+str(page)]

在 Request 请求参数中，设置 dont_filter = True ,Request 中请求的 URL 将不通过 allowed_domains 过滤。

if self.page <= 10:

            self.page +=1

            yield scrapy.Request(self.url+str(self.page),callback=self.parse,dont_filter = True)

由于关掉了allowed_domains 过滤，所以要将yield 写在判断条件呢，开始我写在了外面程序一直迭代，停不下来了，尴尬。
之前都是写在if同级下的，那时候还没有关掉过滤所以没问题

网友评论：

SPIDER_MIDDLEWARES = {

'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': None,

}

设置一下这个貌似也是剋的

链接：https://www.jianshu.com/p/c31e53fd45f6
來源：简书

报错 Filtered offsite request的更多相关文章

scrapy 爬网站显示 Filtered offsite request to 错误
爬取zol 网站图片,无法抓取. 在 setting.py 文件中设置日志记录等级 LOG_LEVEL= 'DEBUG' LOG_FILE ='log.txt' 查看日志发现报 2015-11 ...
网页中嵌入百度地图报错：The request has been blocked,the content must served over Https
网页中嵌入百度地图 1.进入百度地图开发平台:http://lbsyun.baidu.com/index.php?title=jspopular 2.获取密钥:http://lbsyun.baidu. ...
scrapy-yield scrapy.Request()不执行、失效、Filtered offsite request to错误 [转]
scrapy错误:yield scrapy.Request()不执行.失效.Filtered offsite request to错误.首先我们在Request()方法里面添加这么一个东东: yiel ...
svn报错 400 Bad Request
MyEclipse中的svn,commit经常报错 Error: Commit failed (details follow): Error: At least one property chang ...
[Linux]Centos git报错fatal: HTTP request failed
在使用git pull.git push.git clone会报类似例如以下的错误: error: The requested URL returned error: 401 Unauthorized ...
SpringBoot整合升级Spring Security 报错【The request was rejected because the URL was not normalized】
前言最近LZ给项目框架升级, 从Spring1.x升级到Spring2.x, 在这里就不多赘述两个版本之间的区别以及升级的原因. 关于升级过程中踩的坑,在其他博文中会做比较详细的记录,以便给读者参考 ...
【linux】【git】git报错fatal: HTTP request failed
在使用git pull.git push.git clone会报类似如下的错误: error: The requested URL returned error: 401 Unauthorized w ...
centos git clone 报错 fatal: HTTP request failed 解决办法
git clone报错提示 git clone https://github.com/xxxx.git Initialized empty Git repository in /root/xxxx/. ...
小程序运行报错：errMsg: "request:fail url not in domain list"
错误原因: 报错提示说请求的url不在域名列表里,应该是还没有配置服务器域名解决方法: 可点击开发者工具右上角详情-项目设置-不校验合法域名.web-view(业务域名).TLS 版本以及 HTT ...

随机推荐

行人检测4（LBP特征）
参考原文: http://blog.csdn.net/zouxy09/article/details/7929531 http://www.cnblogs.com/dwdxdy/archive/201 ...
mvc手把手教你写excel导入[mvc+三层，没用EF]
实习狗的每天新知识日常准备工作: 1.在项目中添加对NPOI的引用,NPOI下载地址:http://npoi.codeplex.com/releases/view/38113 2.NPOI学习系列教 ...
Spring.Net---4、IoC/DI注入方式
spring.net里实现了控制反转IOC(Inversion of control),也即依赖注入DI(Dependency Injection),以达到解耦的目的,实现模块的组件化.程序在调用sp ...
单源最短路(Dijkstra算法)
#返回上一级 @Author: 张海拔 @Update: 2015-03-11 @Link: http://www.cnblogs.com/zhanghaiba/p/3514570.html Dijk ...
hashmap的一些基础原理
本文来源于翁舒航的博客,点击即可跳转原文观看!!!(被转载或者拷贝走的内容可能缺失图片.视频等原文的内容) 若网站将链接屏蔽,可直接拷贝原文链接到地址栏跳转观看,原文链接:https://www.cn ...
oracle 多列数据相同，部分列数据不同合并不相同列数据
出现这样一种情况: 前面列数据一致,最后remark数据不同,将remark合并成解决办法: 最后一列:结果详情: 使用到的语句为: select a,b,c,wm_concat(d) d,wm_c ...
Hibernate入门（三）—— 一对多、多对多关系
一.一对多关系 1.概念一对多关系是关系型数据库中两个表之间的一种关系.通常在数据库层级中,两表之间是有主外键关系的.在ORM中,如何通过对象描述表之间的关系,是ORM核心. 2.Hiberna ...
HTML基本结构及标签样式
<!DOCTYPE html>————声明 <html> <head>————头部设置信息 <title>文件标题</title> < ...
一步一步实现web程序信息管理系统之二----后台框架实现跳转登陆页面
SpringBoot springboot的目的是为了简化spring应用的开发搭建以及开发过程.内部使用了特殊的处理,使得开发人员不需要进行额外繁锁的xml文件配置的编写,其内部包含很多模块的配置只 ...
Luogu 4240：毒瘤之神的考验
传送门 Sol 分开考虑 \(\varphi(ij)\) 中 \(ij\) 的质因子那么 \[\varphi(ij)=\frac{\varphi(i)\varphi(j)gcd(i,j)}{\var ...

报错 Filtered offsite request

报错 Filtered offsite request的更多相关文章

随机推荐

热门专题