一、问题描述

有时候用scrapy shell来调试很方便,但是有些网站有防爬虫机制,所以使用scrapy shell会返回403,比如下面

C:\Users\fendo>scrapy shell https://book.douban.com/subject/26805083/
2017-04-17 15:18:53 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot)
2017-04-17 15:18:53 [scrapy.utils.log] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0}
2017-04-17 15:18:53 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole']
2017-04-17 15:18:54 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-04-17 15:18:54 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-04-17 15:18:54 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-04-17 15:18:54 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2017-04-17 15:18:54 [scrapy.core.engine] INFO: Spider opened
2017-04-17 15:18:54 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://book.douban.com/subject/26805083/> (referer: None)
2017-04-17 15:18:54 [traitlets] DEBUG: Using default logger
2017-04-17 15:18:54 [traitlets] DEBUG: Using default logger
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x000001E696FBAD68>
[s] item {}
[s] request <GET https://book.douban.com/subject/26805083/>
[s] response <403 https://book.douban.com/subject/26805083/>
[s] settings <scrapy.settings.Settings object at 0x000001E6993C7B70>
[s] spider <DefaultSpider 'default' at 0x1e69964d1d0>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
In [1]:

直接返回403!!

二、解决方法

有两种解决方法:

(1):第一种方法是在命令上加上-s USER_AGENT='Mozilla/5.0'

C:\Users\fendo>scrapy shell -s USER_AGENT='Mozilla/5.0' https://book.douban.com/subject/26805083/
2017-04-17 15:21:37 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot)
2017-04-17 15:21:37 [scrapy.utils.log] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0, 'USER_AGENT': "'Mozilla/5.0'"}
2017-04-17 15:21:37 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole']
2017-04-17 15:21:37 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-04-17 15:21:37 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-04-17 15:21:37 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-04-17 15:21:37 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2017-04-17 15:21:37 [scrapy.core.engine] INFO: Spider opened
2017-04-17 15:21:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://book.douban.com/subject/26805083/> (referer: None)
2017-04-17 15:21:38 [traitlets] DEBUG: Using default logger
2017-04-17 15:21:38 [traitlets] DEBUG: Using default logger
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x000001D2DC68AD68>
[s] item {}
[s] request <GET https://book.douban.com/subject/26805083/>
[s] response <200 https://book.douban.com/subject/26805083/>
[s] settings <scrapy.settings.Settings object at 0x000001D2DEAB6B38>
[s] spider <DefaultSpider 'default' at 0x1d2ded3d208>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
In [1]:

第一种方法最简单但是每次操作都要加上去很繁琐,第二种方法比较好。

(2):第二种方法是修改scrapy的user-agent默认值

找到python的:安装目录下的default_settings.py文件,比如我的F:\Software\Python36\Lib\site-packages\scrapy\settings\default_settings.py

USER_AGENT = 'Scrapy/%s (+http://scrapy.org)' % import_module('scrapy').__version__

改为

USER_AGENT = 'Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0'

使用shell再次,发现已经可以正常访问html不会在出现403错误了。

C:\Users\fendo>scrapy shell "https://book.douban.com/subject/26805083/"
2017-04-17 15:34:13 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot)
2017-04-17 15:34:13 [scrapy.utils.log] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0}
2017-04-17 15:34:14 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole']
2017-04-17 15:34:14 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-04-17 15:34:14 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-04-17 15:34:14 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-04-17 15:34:14 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2017-04-17 15:34:14 [scrapy.core.engine] INFO: Spider opened
2017-04-17 15:34:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://book.douban.com/subject/26805083/> (referer: None)
2017-04-17 15:34:15 [traitlets] DEBUG: Using default logger
2017-04-17 15:34:15 [traitlets] DEBUG: Using default logger
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x000001476886AD68>
[s] item {}
[s] request <GET https://book.douban.com/subject/26805083/>
[s] response <200 https://book.douban.com/subject/26805083/>
[s] settings <scrapy.settings.Settings object at 0x000001476AC97B70>
[s] spider <DefaultSpider 'default' at 0x1476af1d198>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
In [1]:

---------------------
作者:lfendo
来源:CSDN
原文:https://blog.csdn.net/u011781521/article/details/70211474
版权声明:本文为博主原创文章,转载请附上博文链接!

Scrapy shell调试返回403错误的更多相关文章

  1. Centos 执行shell命令返回127错误

    shell脚本功能:连接mysql,自动创建数据库,脚本如下 mysql -h$MYSQL_IP -u$MYSQL_USER -p$MYSQL_PASSWORD --default-character ...

  2. Scrapy shell调试网页的信息

    通过scrapy shell "http://www.thinkive.cn:10000/zentaopms/www/index.php?m=user&f=login"

  3. github上传代码返回403错误

    报错代码: ****************   表示上传的项目地址 remote: Permission to Jayson00/camera.git denied to Minelinkinpar ...

  4. Spring MVC Post请求返回403错误,Get请求却正常,可能是安全框架引起的前端解决办法

    原文地址:http://blog.csdn.net/t894690230/article/details/52404105 困惑:很奇怪,明明在方法上面配置了RequestMethod.POST,PO ...

  5. idhttp.get返回403错误解决办法

    在GET之前,先指定UserAgent参数IdHTTP1.Request.UserAgent := 'Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Ma ...

  6. scrapy shell 遇到的问题

    有时候用scrapy shell来调试很方便,但是有些网站有防爬虫机制,所以使用scrapy shell会返回403,比如下面 有两种解决方法: (1):第一种方法是在命令上加上-s USER_AGE ...

  7. Apache服务器出现Forbidden 403错误提示的解决方法总结

    在配置Linux的 Apache服务时,经常会遇到http403错误,我今天配置测试时也出现了,最后解决了,总结了一下.http 403错误是拒绝访问的意思,有很多原因的.还有,这些问题在win平台的 ...

  8. tomcat 403错误和ls: cannot open directory '.': Permission denied

    在centos7 linux系统上使用docker进行tomcat部署项目测试的时候发现第一个错误:浏览器返回403 错误,显然是一个权限问题,禁止访问.下面是我一步一步的操作: 1 使用的docke ...

  9. django post请求 403错误解决方法

    --摘 第一次用Django做项目,遇到了很多问题. 今天遇到的问题是Django在处理post请求时多次出现403错误. 我先描述一下问题出现的环境:我用Django写了一个web服务端,姑且称它为 ...

随机推荐

  1. node Cannot enqueue Quit after invoking quit.

    因为第二次调用数据库时连接关闭了,应该把connection.connect();放在请求的函数里面:不然第二次请求出错

  2. Python 多继承与MRO-C3算法

    继承关系图:树结构 广度优先遍历:先找A,再找B.C,最后找D.E.(顺序:A.B.C) 深度优先遍历:先找A,再找B,接着找D.E(把B里面找完):然后找C.(顺序:A.B.D.E.C) MRO-C ...

  3. Excel图标布局,图表样式,图标元素

    一.图标布局----图表元素的增删改 * 快速布局: 更改图表的整体布局,主要是图表标题,坐标轴,图例,网格线 * 操作如下: 选中数据源,Ctrl+Q 出现图表,选中图表,在上方选择设计, 共有10 ...

  4. 【OpenGL】glsl、glew、glfw

    glsl: OpenGL着色语言(OpenGL Shading Language)是用来在OpenGL中着色编程的语言,也即开发人员写的短小的自定义程序,他们是在图形卡的GPU (Graphic Pr ...

  5. WDlinux 修改后台默认8080端口的方法

    修改8080端口正确方法 新版本: 方法一: apache sed -i 's/8080/8088/' /www/wdlinux/wdapache/conf/httpd.conf 然后记得修改防火墙i ...

  6. df.dropna()函数和df.ix(),df.at(),df.loc()

  7. 高德地图开发者平台获取sHA1值

    一般在  Application 中进行初始化 /** * 获取高德SHA1值 * */ public static String sHA1(Context context) { try { Pack ...

  8. Codeforces977D ---Divide by three, multiply by two 深搜+map存出现的数

    传送门:点我 题意:给定n长度的序列,重排成后一个数是前一个数除以三,或者后一个数是前一个数乘二,要求输出这个序列. 思路:大力深搜,对每个数搜除3的和乘2的是否出现过,然后继续搜下去.如果有一个数搜 ...

  9. Unity游戏设计与实现 南梦宫一线程序员的开发实例

    图灵程序设计丛书 Unity游戏设计与实现:南梦宫一线程序员的开发实例(修订版)     加藤政树 (作者) 罗水东 (译者)  c# 游戏 unity   <内容提要>本书的作者是日本知 ...

  10. Codeforces Beta Round #77 (Div. 2 Only)

    Codeforces Beta Round #77 (Div. 2 Only) http://codeforces.com/contest/96 A #include<bits/stdc++.h ...