Scrapy学习-10-Request&Response对象
请求URL流程
Scarpy使用请求和响应对象来抓取网站
Request对象
"""
This module implements the Request class which is used to represent HTTP
requests in Scrapy. See documentation in docs/topics/request-response.rst
"""
import six
from w3lib.url import safe_url_string from scrapy.http.headers import Headers
from scrapy.utils.python import to_bytes
from scrapy.utils.trackref import object_ref
from scrapy.utils.url import escape_ajax
from scrapy.http.common import obsolete_setter class Request(object_ref): def __init__(self, url, callback=None, method='GET', headers=None, body=None,
cookies=None, meta=None, encoding='utf-8', priority=0,
dont_filter=False, errback=None, flags=None): self._encoding = encoding # this one has to be set first
self.method = str(method).upper()
self._set_url(url)
self._set_body(body)
assert isinstance(priority, int), "Request priority not an integer: %r" % priority
self.priority = priority if callback is not None and not callable(callback):
raise TypeError('callback must be a callable, got %s' % type(callback).__name__)
if errback is not None and not callable(errback):
raise TypeError('errback must be a callable, got %s' % type(errback).__name__)
assert callback or not errback, "Cannot use errback without a callback"
self.callback = callback
self.errback = errback self.cookies = cookies or {}
self.headers = Headers(headers or {}, encoding=encoding)
self.dont_filter = dont_filter self._meta = dict(meta) if meta else None
self.flags = [] if flags is None else list(flags) @property
def meta(self):
if self._meta is None:
self._meta = {}
return self._meta def _get_url(self):
return self._url def _set_url(self, url):
if not isinstance(url, six.string_types):
raise TypeError('Request url must be str or unicode, got %s:' % type(url).__name__) s = safe_url_string(url, self.encoding)
self._url = escape_ajax(s) if ':' not in self._url:
raise ValueError('Missing scheme in request url: %s' % self._url) url = property(_get_url, obsolete_setter(_set_url, 'url')) def _get_body(self):
return self._body def _set_body(self, body):
if body is None:
self._body = b''
else:
self._body = to_bytes(body, self.encoding) body = property(_get_body, obsolete_setter(_set_body, 'body')) @property
def encoding(self):
return self._encoding def __str__(self):
return "<%s %s>" % (self.method, self.url) __repr__ = __str__ def copy(self):
"""Return a copy of this Request"""
return self.replace() def replace(self, *args, **kwargs):
"""Create a new Request with the same attributes except for those
given new values.
"""
for x in ['url', 'method', 'headers', 'body', 'cookies', 'meta',
'encoding', 'priority', 'dont_filter', 'callback', 'errback']:
kwargs.setdefault(x, getattr(self, x))
cls = kwargs.pop('cls', self.__class__)
return cls(*args, **kwargs)
部分参数解析
url (string) – the URL of this request callback (callable) – the function that will be called with the response of this request (once its downloaded) as its first parameter. For more information see Passing additional data to callback functions below. If a Request doesn’t specify a callback, the spider’s parse() method will be used. Note that if exceptions are raised during processing, errback is called instead. method (string) – the HTTP method of this request. Defaults to 'GET'. meta (dict) – the initial values for the Request.meta attribute. If given, the dict passed in this parameter will be shallow copied. body (str or unicode) – the request body. If a unicode is passed, then it’s encoded to str using the encoding passed (which defaults to utf-8). If body is not given, an empty string is stored. Regardless of the type of this argument, the final value stored will be a str (never unicode or None). headers (dict) – the headers of this request. The dict values can be strings (for single valued headers) or lists (for multi-valued headers). If None is passed as value, the HTTP header will not be sent at all. cookies (dict or list) –
the request cookies. These can be sent in two forms.
1.Using a dict:
request_with_cookies = Request(url="http://www.example.com",
cookies={'currency': 'USD', 'country': 'UY'})
2. Using a list of dicts
request_with_cookies = Request(url="http://www.example.com",
cookies=[{'name': 'currency',
'value': 'USD',
'domain': 'example.com',
'path': '/currency'}])
Response对象
"""
This module implements the Response class which is used to represent HTTP
responses in Scrapy. See documentation in docs/topics/request-response.rst
"""
from six.moves.urllib.parse import urljoin from scrapy.http.request import Request
from scrapy.http.headers import Headers
from scrapy.link import Link
from scrapy.utils.trackref import object_ref
from scrapy.http.common import obsolete_setter
from scrapy.exceptions import NotSupported class Response(object_ref): def __init__(self, url, status=200, headers=None, body=b'', flags=None, request=None):
self.headers = Headers(headers or {})
self.status = int(status)
self._set_body(body)
self._set_url(url)
self.request = request
self.flags = [] if flags is None else list(flags) @property
def meta(self):
try:
return self.request.meta
except AttributeError:
raise AttributeError(
"Response.meta not available, this response "
"is not tied to any request"
) def _get_url(self):
return self._url def _set_url(self, url):
if isinstance(url, str):
self._url = url
else:
raise TypeError('%s url must be str, got %s:' % (type(self).__name__,
type(url).__name__)) url = property(_get_url, obsolete_setter(_set_url, 'url')) def _get_body(self):
return self._body def _set_body(self, body):
if body is None:
self._body = b''
elif not isinstance(body, bytes):
raise TypeError(
"Response body must be bytes. "
"If you want to pass unicode body use TextResponse "
"or HtmlResponse.")
else:
self._body = body body = property(_get_body, obsolete_setter(_set_body, 'body')) def __str__(self):
return "<%d %s>" % (self.status, self.url) __repr__ = __str__ def copy(self):
"""Return a copy of this Response"""
return self.replace() def replace(self, *args, **kwargs):
"""Create a new Response with the same attributes except for those
given new values.
"""
for x in ['url', 'status', 'headers', 'body', 'request', 'flags']:
kwargs.setdefault(x, getattr(self, x))
cls = kwargs.pop('cls', self.__class__)
return cls(*args, **kwargs) def urljoin(self, url):
"""Join this Response's url with a possible relative url to form an
absolute interpretation of the latter."""
return urljoin(self.url, url) @property
def text(self):
"""For subclasses of TextResponse, this will return the body
as text (unicode object in Python 2 and str in Python 3)
"""
raise AttributeError("Response content isn't text") def css(self, *a, **kw):
"""Shortcut method implemented only by responses whose content
is text (subclasses of TextResponse).
"""
raise NotSupported("Response content isn't text") def xpath(self, *a, **kw):
"""Shortcut method implemented only by responses whose content
is text (subclasses of TextResponse).
"""
raise NotSupported("Response content isn't text") def follow(self, url, callback=None, method='GET', headers=None, body=None,
cookies=None, meta=None, encoding='utf-8', priority=0,
dont_filter=False, errback=None):
# type: (...) -> Request
"""
Return a :class:`~.Request` instance to follow a link ``url``.
It accepts the same arguments as ``Request.__init__`` method,
but ``url`` can be a relative URL or a ``scrapy.link.Link`` object,
not only an absolute URL. :class:`~.TextResponse` provides a :meth:`~.TextResponse.follow`
method which supports selectors in addition to absolute/relative URLs
and Link objects.
"""
if isinstance(url, Link):
url = url.url
url = self.urljoin(url)
return Request(url, callback,
method=method,
headers=headers,
body=body,
cookies=cookies,
meta=meta,
encoding=encoding,
priority=priority,
dont_filter=dont_filter,
errback=errback)
参考官方文档 https://doc.scrapy.org
Scrapy学习-10-Request&Response对象的更多相关文章
- Servlet的学习之Request请求对象(3)
本篇接上一篇,将Servlet中的HttpServletRequest对象获取RequestDispatcher对象后能进行的[转发]forward功能和[包含]include功能介绍完. 首先来看R ...
- Servlet的学习之Request请求对象(2)
在上一篇<Servlet的学习(十)>中介绍了HttpServletRequest请求对象的一些常用方法,而从这篇起开始介绍和学习HttpServletRequest的常用功能. 使用Ht ...
- Servlet的学习之Request请求对象(1)
在本篇中开始对Servlet中的HttpServletRequest请求对象进行学习,请求对象同响应对象一样,我们可以根据该对象中的方法获取例如请求行,请求头和请求实体数据的方法. 在本篇中先对Htt ...
- Java-Spring-获取Request,Response对象
转载自:https://www.cnblogs.com/bjlhx/p/6639542.html 第一种.参数 @RequestMapping("/test") @Response ...
- request与response对象.
request与response对象. 1. request代表请求对象 response代表的响应对象. 学习它们我们可以操作http请求与响应. 2.request,response体系结构. 在 ...
- request与response对象详述
request与response对象. 1. request代表请求对象 response代表的响应对象. 学习它们我们可以操作http请求与响应. 2.request,response体系结构. 在 ...
- java中获取request与response对象的方法
Java 获取Request,Response对象方法 第一种.参数 @RequestMapping("/test") @ResponseBody public void sa ...
- SpringMvc4中获取request、response对象的方法
springMVC4中获取request和response对象有以下两种简单易用的方法: 1.在control层获取 在control层中获取HttpServletRequest和HttpServle ...
- Scrapy 中 Request 对象和 Response 对象的各参数及属性介绍
Request 对象 Request构造器方法的参数列表: Request(url [, callback=None, method='GET', headers=None, body=None,co ...
随机推荐
- dfs染色法判定二分图
#include<iostream> #include<cstring> using namespace std; ][],color[],n; int dfs(int x,i ...
- Noip2016 提高组 蚯蚓
刚看到这道题:这题直接用堆+模拟不就可以了(并没有认真算时间复杂度) 于是用priority_queue水到了85分-- (STL大法好) 天真的我还以为是常数问题,于是疯狂卡常--(我是ZZ) 直到 ...
- 二. python函数与模块
第四章.内置函数与装饰器详解 1.内置函数补充1 注:红色圆圈:必会: 紫红色方框:熟练: 绿色:了解 callable() 判断函数是否可以被调用执行 def f1(): pass f1() ...
- JS中Null与Undefined的区别--2015-06-26
在JavaScript中存在这样两种原始类型:Null与Undefined.这两种类型常常会使JavaScript的开发人员产生疑惑,在什么时候是Null,什么时候又是Undefined? Undef ...
- MacBook Pro休眠掉电、耗电量大问题解决方案
1.前言 最近我的2015mbpMacBook Pro (Retina, 13-inch, early 2015)更新完10.14系统后,发现休眠待机一晚上后能掉5%电,白天待机4-5小时又掉了8%. ...
- ssh 免密码登录 与 密钥公钥原理讲解
前言 由于最近频繁需要登录几个服务器,每次登录都需要输入密码,故相对麻烦. 由于个人服务器用于实验,故对安全性要求不是很高,故想实现ssh免密登录. 通过阅读ssh 公钥私钥认证操作及原理以及ssh公 ...
- [译]The Python Tutorial#9. Classes
写在前面 本篇文章是<The Python Tutorial>(3.6.1),第九章,类的译文. 9. Classes 与其他编程语言相比,Python的类机制定义类时,最小化了新的语法和 ...
- 内涵段子爬取及re匹配
案例:使用正则表达式的爬虫 现在拥有了正则表达式这把神兵利器,我们就可以进行对爬取到的全部网页源代码进行筛选了. 下面我们一起尝试一下爬取内涵段子网站: http://www.neihan8.com/ ...
- python基础——14(shelve/shutil/random/logging模块/标准流)
一.标准流 1.1.标准输入流 res = sys.stdin.read(3) 可以设置读取的字节数 print(res) res = sys.stdin.readline() print(res) ...
- Android开发——ThreadLocal功能介绍
个静态的监听器对象,显然是无法接受的. 2. 使用实例 //首先定义一个ThreadLocal对象,选择泛型为Boolean类型 private ThreadLocal<Boolean> ...