Request、Response

Request

Request对象在我们写爬虫发送请求的时候调用，参数如下：

url: 就是需要请求的url
callback: 指定该请求返回的Response由那个函数来处理。
method: 请求方法，默认GET方法，可设置为"GET", "POST", "PUT"等，且保证字符串大写
headers: 请求时，包含的头文件。一般不需要。内容一般如下：
- Host: media.readthedocs.org
- User-Agent: Mozilla/5.0 (Windows NT 6.2; WOW64; rv:33.0) Gecko/20100101 Firefox/33.0
- Accept: text/css,/;q=0.1
- Accept-Language: zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3
- Accept-Encoding: gzip, deflate
- Referer: http://scrapy-chs.readthedocs.org/zh_CN/0.24/
- Cookie: _ga=GA1.2.1612165614.1415584110;
- Connection: keep-alive
- If-Modified-Since: Mon, 25 Aug 2014 21:59:35
- GMT Cache-Control: max-age=0

meta: 在不同的解析函数之间传递数据使用的。字典dict型

# -*- coding: utf-8 -*-
import scrapy
from TencentHR.items import TencenthrItem

class HrSpider(scrapy.Spider):
    name = 'hr'
    # allowed_domains = ['ddd']
    start_urls = ['https://hr.tencent.com/position.php']

    def parse(self, response):
        trs = response.xpath('//table[@class="tablelist"]/tr[@class="odd"] | //table[@class="tablelist"]/tr[@class="even"]')
        # print(len(trs))
        for tr in trs:
            items = TencenthrItem()
            detail_url = tr.xpath('./td/a/@href').extract()[0]
            items['position_name'] = tr.xpath('./td/a/text()').extract()[0]
            try:
                items['position_type'] = tr.xpath('./td[2]/text()').extract()[0]
            except:
                print("{}职位没有类型,url为{}".format(items['position_name'], "https://hr.tencent.com/" + detail_url))
                items['position_type'] = None
            items['position_num'] = tr.xpath('./td[3]/text()').extract()[0]
            items['publish_time'] = tr.xpath('./td[5]/text()').extract()[0]
            items['work_addr'] = tr.xpath('./td[4]/text()').extract()[0]

            detail_url = 'https://hr.tencent.com/' + detail_url
            yield scrapy.Request(detail_url, 
                                 comallback=self.parse_detail, 
                                 meta={"items":items}
                                 )
            
        next_url = response.xpath('//a[text()="下一页"]/@href').extract_first()
        next_url = 'https://hr.tencent.com/' + next_url
        print(next_url)
        yield scrapy.Request(next_url, 
                             callback=self.parse
                            )

    def parse_detail(self,response):
        items = response.meta['items']
        items["work_duty"] = response.xpath('//table[@class="tablelist textl"]/tr[3]//li/text()').extract()
        items["work_require"] =response.xpath('//table[@class="tablelist textl"]/tr[4]//li/text()').extract()
        yield items

encoding: 使用默认的 'utf-8' 就行。
dont_filter: 表明该请求不由调度器过滤。这是当你想使用多次执行相同的请求,忽略重复的过滤器。默认为False。
errback: 指定错误处理函数

Response

Response属性和可以调用的方法

meta: 从其他解析函数传递过来的meta属性，可以保持多个解析函数之间的数据连接
encoding: 返回当前字符串编码和编码的格式
text: 返回Unicode字符串
body: 返回bytes字符串
xpath: 可以调用xpath方法解析数据
css: 调用css选择器解析数据

发送POST请求

当我们需要发送Post请求的时候，就调用Request中的子类FormRequest 来实现，如果需要在爬虫一开始的时候就发送post请求，那么需要在爬虫类中重写 start_requests(self) 方法，并且不再调用start_urls中的url

案例登录豆瓣网

# -*- coding: utf-8 -*-
import scrapy

class TestSpider(scrapy.Spider):
    name = 'login'
    allowed_domains = ['www.douban.com']
    # start_urls = ['http://www.baidu.com/']

    def start_requests(self):
        login_url = "https://accounts.douban.com/j/mobile/login/basic"
        headers = {
            'Referer': 'https://accounts.douban.com/passport/login_popup?login_source=anony',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'
        }
        formdata = {
            'ck': '',
            'name': 用户名,
            'password': 密码,
            'remember': 'true',
            'ticket': ''
        }
        request = scrapy.FormRequest(login_url, callback=self.parse, formdata=formdata, headers=headers)
        yield request

    def parse(self, response):
        print(response.text)

返回结果，可以看到登录成功了

{"status":"success","message":"success","description":"处理成功","payload":{"account_info":{"name":"仅此而已","weixin_binded":false,"phone":"手机号","avatar":{"medium":"https://img3.doubanio.com\/icon\/user_large.jpg","median":"https://img1.doubanio.com\/icon\/user_normal.jpg","large":"https://img3.doubanio.com\/icon\/user_large.jpg","raw":"https://img3.doubanio.com\/icon\/user_large.jpg","small":"https://img1.doubanio.com\/icon\/user_normal.jpg","icon":"https://img3.doubanio.com\/pics\/icon\/user_icon.jpg"},"id":"193317985","uid":"193317985"}}}

登录成功之后请求个人主页，可以看到我们可以访问登录之后的页面了

# -*- coding: utf-8 -*-

import scrapy

class TestSpider(scrapy.Spider):

    name = 'login'

    allowed_domains = ['www.douban.com']

    # start_urls = ['http://www.baidu.com/']

    def start_requests(self):

        login_url = "https://accounts.douban.com/j/mobile/login/basic"

        headers = {

            'Referer': 'https://accounts.douban.com/passport/login_popup?login_source=anony',

            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'

        }

        formdata = {

            'ck': '',

            'name': 用户名,

            'password': 密码,

            'remember': 'true',

            'ticket': ''

        }

        request = scrapy.FormRequest(login_url, callback=self.parse, formdata=formdata, headers=headers)

        yield request

    def parse(self, response):

        print(response.text)

        # 登录成功之后访问个人主页

        url = "https://www.douban.com/people/193317985/"

        yield scrapy.Request(url=url, callback=self.parse_detail)

    def parse_detail(self, response):

        print(response.text)

Request、Response的更多相关文章

Request 、Response 与Server的使用
纯属记录总结,以下图片都是来自 ASP.NET笔记之 Request .Response 与Server的使用 Request Response Server 关于Server.MapPath 方法看 ...
LoadRunner中取Request、Response
LoadRunner中取Request.Response LoadRunner两个“内置变量”: 1.REQUEST,用于提取完整的请求头信息. 2.RESPONSE,用于提取完整的响应头信息. 响应 ...
struts2中获取request、response，与android客户端进行交互（文件传递给客户端）
用struts2作为服务器框架,与android客户端进行交互需要得到request.response对象. struts2中获取request.response有两种方法. 第一种:利用Servle ...
第十五节：HttpContext五大核心对象的使用(Request、Response、Application、Server、Session)
一. 基本认识 1. 简介:HttpContext用于保持单个用户.单个请求的数据,并且数据只在该请求期间保持: 也可以用于保持需要在不同的HttpModules和HttpHandlers之间传递的值 ...
java web(四):request、response一些用法和文件的上传和下载
上一篇讲了ServletContent.ServletCOnfig.HTTPSession.request.response几个对象的生命周期.作用范围和一些用法.今天通过一个小项目运用这些知识.简单 ...
@ModelAttribute设置request、response、session对象
利用spring web提供的@ModelAttribute注解放在类方法的参数前面表示引用Model中的数据 @ModelAttribute放在类方法上面则表示该Action类中的每个请求调用之前 ...
spring aop 获取request、response对象
在网上看到有不少人说如下方式获取: 1.在web.xml中添加监听 <listener> <listener-class> org. ...
SpringMvc4中获取request、response对象的方法
springMVC4中获取request和response对象有以下两种简单易用的方法: 1.在control层获取在control层中获取HttpServletRequest和HttpServle ...
springboot的junit4模拟request、response对象
关键字: MockHttpRequest.Mock测试问题: 在模拟junit的request.response对象时,会报如下空指针异常. 处理方法: 可用MockHttpServletReque ...
在SpringMVC中操作Session、Request、Response对象
示例 @Service public class UserServiceImpl implements UserService { @Autowired private UserMapper user ...

随机推荐

editorconfig使用
//是否是顶级配置文件,设置为true的时候才会停止搜索.editorconfig文件 root = true [*] //缩进方式tab" | "space indent_sty ...
为虚机Linux系统设置静态IP，ping通外网并解决相关问题
在虚机中安装完Linux系统后,虚机是ping不通外网的,而默认的动态IP会为之后的Hadoop应用造成不少麻烦,为了减少这些不必要的麻烦,我们把系统的IP设置为静态. 步骤: 修改系统配置文件命令 ...
10ci
QT学习教程
原地址:http://www.devbean.NET/2012/08/qt-study-road-2-catelog/ 网上看到的不错的教程本教程以qt5为主,部分地方会涉及qt4.据说非常适合qt ...
异常java.lang.NumberFormatException解决
原因一:超出了int类型的取值范围项目中要把十六进制字符串转化为十进制, 用到了到了Integer.parseInt(str1.trim(), 16):这个是不是后抛出java.lang.Numbe ...
java 对小数位的处理 BigDecimal DecimalFormat 常用操作浅解
[博客园cnblogs笔者m-yb原创, 转载请加本文博客链接,笔者github: https://github.com/mayangbo666,公众号aandb7,QQ群927113708] htt ...
error: `cout' was not declared in this scope
原因:C++ 1998 要求cout and endl被调用使用'std::cout'和'std::endl'格式,或using namespace std; 修改后:#include<iost ...
ftp 发布配置
地址:ftp://192.168.26.128/ 存放文件夹:jenkins
Leetcode 600 不含连续1的非负整数
给定一个正整数 n,找出小于或等于 n 的非负整数中,其二进制表示不包含连续的1 的个数. 例如: 输入: 5 输出: 5 解释: 下面是带有相应二进制表示的非负整数<= 5: 0 : 0 1 ...
20175223 姚明宇 MyCP
目录 MyCP 要求代码运行编译及文本输出输入结果目录树代码运行编译: 文本输出输入结果: 源代码码云链接目录 MyCP 要求编写MyCP.java 实现类似Linux下cp XXX1 X ...

Request、Response

Request

Request对象在我们写爬虫发送请求的时候调用，参数如下：

Response

Response属性和可以调用的方法

发送POST请求

Request、Response的更多相关文章

随机推荐

热门专题