Scrapy shell调试返回403错误
一、问题描述
有时候用scrapy shell来调试很方便,但是有些网站有防爬虫机制,所以使用scrapy shell会返回403,比如下面
C:\Users\fendo>scrapy shell https://book.douban.com/subject/26805083/
2017-04-17 15:18:53 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot)
2017-04-17 15:18:53 [scrapy.utils.log] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0}
2017-04-17 15:18:53 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole']
2017-04-17 15:18:54 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-04-17 15:18:54 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-04-17 15:18:54 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-04-17 15:18:54 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2017-04-17 15:18:54 [scrapy.core.engine] INFO: Spider opened
2017-04-17 15:18:54 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://book.douban.com/subject/26805083/> (referer: None)
2017-04-17 15:18:54 [traitlets] DEBUG: Using default logger
2017-04-17 15:18:54 [traitlets] DEBUG: Using default logger
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x000001E696FBAD68>
[s] item {}
[s] request <GET https://book.douban.com/subject/26805083/>
[s] response <403 https://book.douban.com/subject/26805083/>
[s] settings <scrapy.settings.Settings object at 0x000001E6993C7B70>
[s] spider <DefaultSpider 'default' at 0x1e69964d1d0>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
In [1]:
直接返回403!!
二、解决方法
有两种解决方法:
(1):第一种方法是在命令上加上-s USER_AGENT='Mozilla/5.0'
C:\Users\fendo>scrapy shell -s USER_AGENT='Mozilla/5.0' https://book.douban.com/subject/26805083/
2017-04-17 15:21:37 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot)
2017-04-17 15:21:37 [scrapy.utils.log] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0, 'USER_AGENT': "'Mozilla/5.0'"}
2017-04-17 15:21:37 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole']
2017-04-17 15:21:37 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-04-17 15:21:37 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-04-17 15:21:37 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-04-17 15:21:37 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2017-04-17 15:21:37 [scrapy.core.engine] INFO: Spider opened
2017-04-17 15:21:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://book.douban.com/subject/26805083/> (referer: None)
2017-04-17 15:21:38 [traitlets] DEBUG: Using default logger
2017-04-17 15:21:38 [traitlets] DEBUG: Using default logger
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x000001D2DC68AD68>
[s] item {}
[s] request <GET https://book.douban.com/subject/26805083/>
[s] response <200 https://book.douban.com/subject/26805083/>
[s] settings <scrapy.settings.Settings object at 0x000001D2DEAB6B38>
[s] spider <DefaultSpider 'default' at 0x1d2ded3d208>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
In [1]:
第一种方法最简单但是每次操作都要加上去很繁琐,第二种方法比较好。
(2):第二种方法是修改scrapy的user-agent默认值
找到python的:安装目录下的default_settings.py文件,比如我的F:\Software\Python36\Lib\site-packages\scrapy\settings\default_settings.py
把
USER_AGENT = 'Scrapy/%s (+http://scrapy.org)' % import_module('scrapy').__version__
改为
USER_AGENT = 'Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0'
使用shell再次,发现已经可以正常访问html不会在出现403错误了。
C:\Users\fendo>scrapy shell "https://book.douban.com/subject/26805083/"
2017-04-17 15:34:13 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot)
2017-04-17 15:34:13 [scrapy.utils.log] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0}
2017-04-17 15:34:14 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole']
2017-04-17 15:34:14 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-04-17 15:34:14 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-04-17 15:34:14 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-04-17 15:34:14 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2017-04-17 15:34:14 [scrapy.core.engine] INFO: Spider opened
2017-04-17 15:34:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://book.douban.com/subject/26805083/> (referer: None)
2017-04-17 15:34:15 [traitlets] DEBUG: Using default logger
2017-04-17 15:34:15 [traitlets] DEBUG: Using default logger
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x000001476886AD68>
[s] item {}
[s] request <GET https://book.douban.com/subject/26805083/>
[s] response <200 https://book.douban.com/subject/26805083/>
[s] settings <scrapy.settings.Settings object at 0x000001476AC97B70>
[s] spider <DefaultSpider 'default' at 0x1476af1d198>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
In [1]:
---------------------
作者:lfendo
来源:CSDN
原文:https://blog.csdn.net/u011781521/article/details/70211474
版权声明:本文为博主原创文章,转载请附上博文链接!
Scrapy shell调试返回403错误的更多相关文章
- Centos 执行shell命令返回127错误
shell脚本功能:连接mysql,自动创建数据库,脚本如下 mysql -h$MYSQL_IP -u$MYSQL_USER -p$MYSQL_PASSWORD --default-character ...
- Scrapy shell调试网页的信息
通过scrapy shell "http://www.thinkive.cn:10000/zentaopms/www/index.php?m=user&f=login"
- github上传代码返回403错误
报错代码: **************** 表示上传的项目地址 remote: Permission to Jayson00/camera.git denied to Minelinkinpar ...
- Spring MVC Post请求返回403错误,Get请求却正常,可能是安全框架引起的前端解决办法
原文地址:http://blog.csdn.net/t894690230/article/details/52404105 困惑:很奇怪,明明在方法上面配置了RequestMethod.POST,PO ...
- idhttp.get返回403错误解决办法
在GET之前,先指定UserAgent参数IdHTTP1.Request.UserAgent := 'Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Ma ...
- scrapy shell 遇到的问题
有时候用scrapy shell来调试很方便,但是有些网站有防爬虫机制,所以使用scrapy shell会返回403,比如下面 有两种解决方法: (1):第一种方法是在命令上加上-s USER_AGE ...
- Apache服务器出现Forbidden 403错误提示的解决方法总结
在配置Linux的 Apache服务时,经常会遇到http403错误,我今天配置测试时也出现了,最后解决了,总结了一下.http 403错误是拒绝访问的意思,有很多原因的.还有,这些问题在win平台的 ...
- tomcat 403错误和ls: cannot open directory '.': Permission denied
在centos7 linux系统上使用docker进行tomcat部署项目测试的时候发现第一个错误:浏览器返回403 错误,显然是一个权限问题,禁止访问.下面是我一步一步的操作: 1 使用的docke ...
- django post请求 403错误解决方法
--摘 第一次用Django做项目,遇到了很多问题. 今天遇到的问题是Django在处理post请求时多次出现403错误. 我先描述一下问题出现的环境:我用Django写了一个web服务端,姑且称它为 ...
随机推荐
- vue router 跳转到新的窗口方法
在CreateSendView2.vue 组件中的方法定义点击事件,vue router 跳转新的窗口通过采用如下的方法可以实现传递参数跳转相应的页面goEditor: function (index ...
- canvas画布内部重复画圆
<!DOCTYPE html><html><head> <title>canvas example</title> <meta cha ...
- yum except KeyboardInterrupt, e: 错误
在上一篇升级python的时候的,使用yum时,出现以下错误 [root@localhost bin]# yum File "/usr/bin/yum", line 30 ...
- 2018面向对象程序设计(Java)第10周学习指导及要求
2018面向对象程序设计(Java)第10周学习指导及要求(2018.11.1-2018.11.4) 学习目标 理解泛型概念: 掌握泛型类的定义与使用: 掌握泛型方法的声明与使用: 掌握泛型接口的定 ...
- python遇到的错误
今天学习文件遇到这个错误. 这是在 text_files\vvvv.txt 之间加一个\ 就可以了,变成 text_files\\vvvv.txt,运行成功
- 《java与模式》阅读笔记01
这次我读了前两章的内容,就如书名所言,这本书主要将的就是java中的模式,在书中的序言就把所有的模式都介绍了一下,主要有, 1.创建模式:简单工厂模式,工厂方法模式,抽象工厂模式,建造模式 2.行为模 ...
- html的textarea默认文案实现换行
问题:textarea默认文案,想使用换行展示 但是使用/r/n</br>之类的都无效 解决方法: 使用下面字符实现换行 <textarea >第一行内容 第二行内容& ...
- PHPActiveRecord validates
validates_presence_of #检测是不是为空 为空的话可以抛出异常 *Model类: static $validates_presence_of = array( array('tit ...
- javascript 页面导出功能
javascript 页面导出功能 <a class="btn" href="javascript:void(0);" onclick="win ...
- Fiddler调试和Wireshark数据包分析
扫码时备注或说明中留下邮箱 付款后如未回复请至https://shop135452397.taobao.com/ 联系店主