Python爬虫从入门到进阶(3)之requests的使用

快速上手(官网地址：http://www.python-requests.org/en/master/user/quickstart/）

发送请求

首先导入Requests模块

import requests

试着获取一个网页

r = requests.get('https://api.github.com/events')

返回的 r 是 Response 对象，可以从这个对象中获得所有信息。

Requests 简单的 API 意味着所有 HTTP 请求类型都是显而易见的。例如，可以这样发送一个 HTTP POST 请求：

 r = requests.post('https://httpbin.org/post', data={'key': 'value'})

传递 URL 参数

关于 URL 的查询字符串(query string)传递某种数据，如果手动构建 URL，数据以key/value的形式出现在 URL的？后面，例如：httpbin.org/get?key=value

Requests允许参数使用这个params 关键字参数，以字符串字典的形式提供参数。如果你想传递 key1=value1 和 key2=value2 到 httpbin.org/get ，那么你可以使用如下代码：

def url_params():

    payload = {'key1': 'value1', 'key2': 'value2'}

    r = requests.get("http://httpbin.org/get", params=payload)

    print(r.url)

    # http://httpbin.org/get?key1=value1&key2=value2

也可以将列表传入

# 2.传递 URL 参数,可以将一个列表作为值传入

def url_params_2():

    payload = {'key1': 'value1', 'key2': ['value2', 'value3']}

    r = requests.get("http://httpbin.org/get", params=payload)

    print(r.url)

    # http://httpbin.org/get?key1=value1&key2=value2&key2=value3

响应内容

Requests 会自动解码来自服务器的内容。大多数 unicode 字符集都能被无缝地解码。

# 3.响应内容

def response_text():

    r = requests.get('https://httpbin.org/get')

    print(r.text)

二进制响应内容

Requests 会自动为你解码 gzip 和 deflate 传输编码的响应数据。

# 4.二进制响应内容

def response_content():

    r = requests.get('https://httpbin.org/get')

    print(r.content)

json 响应内容

# 5.json响应内容

def response_json():

    r = requests.get('https://httpbin.org/get')

    print(r.json())

如果 JSON 解码失败， r.json() 就会抛出一个异常。例如，响应内容是 204 (No Content)，或者响应包含无效的 json，尝试访问 r.json() 将会抛出 ValueError: No JSON object could be decoded 异常。

值得注意的是调用r.json()成功并不意味着请求响应成功。有些服务器响应失败会返回 JSON 对象的失败信息。可以通过判断r.raise_for_status() 或者 r.status_code 验证请求是不是成功

定制请求头

如果你想为请求添加 HTTP 头部，只要简单地传递一个 dict 给 headers 参数就可以了。

# 6.定制请求头

def add_header():

    url = 'https://httpbin.org/get'

    headers = {

        'User-Agent':'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 2.0.50727; SLCC2; .NET '

    }

    r = requests.get(url=url,headers=headers)

    print(r.text)

更加复杂的 post 请求

通常，要想发送表单形式的数据，只需简单地传递一个字典给 data 参数，在发出请求时会自动编码为表单形式：

# 7.复杂的 post 请求

def complicated_post():

    payload = {'key1': 'value1', 'key2': 'value2'}

    r = requests.post("https://httpbin.org/post", data=payload)

    print(r.text)

data 参数每个键可以有多个值，可以将元组列表或列表作为值的字典来实现，通常用于将表单中多个元素使用相同的键时使用

def complicated_post_2():

    payload_tuples = [('key1', 'value1'), ('key1', 'value2')]

    r1 = requests.post('https://httpbin.org/post', data=payload_tuples)

    print(r1.text)

    payload_dict = {'key1': ['value1', 'value2']}

    r2 = requests.post('https://httpbin.org/post', data=payload_dict)

    print(r1.text == r2.text) # False,官网是 True ？？？

用时候不需要用表单编码传递数据，可以使用字符串传递数据，例如

def complicated_post_3():

    url = 'https://api.github.com/some/endpoint'

    payload = {'some': 'data'}

    r = requests.post(url, data=json.dumps(payload))

    print(r.text)

也可以直接传递 json 参数，例如：

def complicated_post_4():

    url = 'https://api.github.com/some/endpoint'

    payload = {'some': 'data'}

    r = requests.post(url, json=payload)

    print(r.text)

注意：如果已经传递了 data 或者 files 参数，json 参数会被忽略掉

在请求中使用 json 参数会改变header 头Content-Type的值为application/json

POST一个多部分编码(Multipart-Encoded)的文件

Requests使上传Multipart-Encoded文件变得很简单

#  8.post 上传 multipart-Encoded 文件

def post_mulutipart_file():

    url = 'https://httpbin.org/post'

    files = {'file': open('report.xls', 'rb')}

    r = requests.post(url, files=files, timeout=10)

    print(r.text)

可以直接设置文件名称，content_type，和 header

def post_mulutipart_file_2():

    url = 'https://httpbin.org/post'

    files = {'file': ('report.xls', open('report.xls', 'rb'), 'application/vnd.ms-excel', {'Expires': ''})}

    r = requests.post(url, files=files, timeout=10)

    print(r.text)

如果想传递字符串让服务器以文件的方式接收，可以如下设置：

def post_mulutipart_file_3():

    url = 'https://httpbin.org/post'

    files = {'file': ('report.csv', 'some,data,to,send\nanother,row,to,send\n')}

    r = requests.post(url, files=files, timeout=10)

    print(r.text)

注意：requests 不支持传递超大文件

响应状态码

# 9.响应状态码

def response_status_code():

    r = requests.get('https://httpbin.org/get', timeout=10)

    print(r.status_code)

    print(r.status_code == requests.codes.ok)

    # print(raise_for_status)

# 请求错误状态码

def error_response_status_code():

    bad_r = requests.get('https://httpbin.org/status/404')

    print(bad_r.status_code)

    print(bad_r.raise_for_status())

响应头

# 10.响应头

def response_header():

    r = requests.get('https://httpbin.org/get', timeout=10)

    print(r.headers)

    print(r.headers['Content-Type'])

    print(r.headers.get('content-type'))

Cookies

# 11.cookie

def cookies():

    # 读取响应内容的 cookie

    # url = 'http://example.com/some/cookie/setting/url'

    # r = requests.get(url)

    # print(r.cookies['example_cookie_name'])

    # 发送 cookie 到服务器

    url = 'https://httpbin.org/cookies'

    cookies = dict(cookies_are='working')

    r = requests.get(url, cookies=cookies, timeout=10)

    print(r.text)

Cookie 在RequestsCookieJar中返回，起作用相当于一个字典，但是也提供更完整的接口，适合在多个域或路径上使用，Cookie jars也能在请求时传递

def cookiejar():

    jar = requests.cookies.RequestsCookieJar()

    jar.set('tasty_cookie', 'yum', domain='httpbin.org', path='/cookies')

    jar.set('gross_cookie', 'blech', domain='httpbin.org', path='/elsewhere')

    url = 'https://httpbin.org/cookies'

    r = requests.get(url, cookies=jar)

    print(r.text)

重定向和历史

默认情况下，请求将对除HEAD之外的所有谓词执行位置重定向。

可以使用响应对象的history属性来跟踪重定向。

Response.history列表包含为完成请求而创建的 Response 对象，该列表从最早的响应到最近的响应

# 12.重定向和历史记录

def redirection_history():

    r = requests.get('http://github.com/')

    print(r.url)

    print(r.status_code)

    print(r.history)

如果使用 GET, OPTIONS, POST, PUT, PATCH 或者 DELETE，可以使用allow_reredirect参数禁用重定向处理

使用 HEAD 启用重定向

def redirection_history_2():

    # 禁用重定向

    r = requests.get('http://github.com/', allow_redirects=False)

    print(r.status_code)

    print(r.history)

    # 使用 HEAD 启用重定向

    r = requests.head('http://github.com/', allow_redirects=True)

    print(r.status_code)

    print(r.history)

超时

可以使用 timeout 参数告诉请求等待几秒停止响应，几乎所有的生产代码的所有请求中都使用这个参数，不然可能导致程序无限期挂起

注意：超时不是整个响应下载的时间限制；相反，如果服务器没有在超时时间限制内发出响应，则会引发异常。如果没设置超时则请求不会超时

错误和异常

如果出现网络问题(例如DNS失败、拒绝连接等)，请求将引发ConnectionError异常。

如果HTTP请求返回不成功的状态代码，则Response.raise_for_status()将引发HTTPError。

如果请求超时，将引发超时异常。

代码下载

Python爬虫从入门到进阶(3)之requests的使用的更多相关文章

Python 爬虫从入门到进阶之路（八）
在之前的文章中我们介绍了一下 requests 模块,今天我们再来看一下 Python 爬虫中的正则表达的使用和 re 模块. 实际上爬虫一共就四个主要步骤: 明确目标 (要知道你准备在哪个范围或者网 ...
Python 爬虫从入门到进阶之路（二）
上一篇文章我们对爬虫有了一个初步认识,本篇文章我们开始学习 Python 爬虫实例. 在 Python 中有很多库可以用来抓取网页,其中内置了 urllib 模块,该模块就能实现我们基本的网页爬取. ...
Python爬虫从入门到进阶(1)之Python概述及爬虫入门
一.Python 概述 1.计算机语言概述 (1).语言:交流的工具,沟通的媒介 (2).计算机语言:人跟计算机交流的工具 (3).Python是计算机语言的一种 2.Python编程语言代码:人类 ...
Python 爬虫从入门到进阶之路（六）
在之前的文章中我们介绍了一下 opener 应用中的 ProxyHandler 处理器(代理设置),本篇文章我们再来看一下 opener 中的 Cookie 的使用. Cookie 是指某些网站服务器 ...
Python 爬虫从入门到进阶之路（九）
之前的文章我们介绍了一下 Python 中的正则表达式和与爬虫正则相关的 re 模块,本章我们就利用正则表达式和 re 模块来做一个案例,爬取<糗事百科>的糗事并存储到本地. 我们要爬取的 ...
Python 爬虫从入门到进阶之路（十二）
之前的文章我们介绍了 re 模块和 lxml 模块来做爬虫,本章我们再来看一个 bs4 模块来做爬虫. 和 lxml 一样,Beautiful Soup 也是一个HTML/XML的解析器,主要的功能也 ...
Python 爬虫从入门到进阶之路（十五）
之前的文章我们介绍了一下 Python 的 json 模块,本章我们就介绍一下之前根据 Xpath 模块做的爬取<糗事百科>的糗事进行丰富和完善. 在 Xpath 模块的爬取糗百的案例中我 ...
Python 爬虫从入门到进阶之路（十六）
之前的文章我们介绍了几种可以爬取网站信息的模块,并根据这些模块爬取了<糗事百科>的糗百内容,本章我们来看一下用于专门爬取网站信息的框架 Scrapy. Scrapy是用纯Python实现一 ...
Python 爬虫从入门到进阶之路（十七）
在之前的文章中我们介绍了 scrapy 框架并给予 scrapy 框架写了一个爬虫来爬取<糗事百科>的糗事,本章我们继续说一下 scrapy 框架并对之前的糗百爬虫做一下优化和丰富. 在上 ...

随机推荐

.NET 增加扩展方法
声明:通过一个js的实例来告诉你C#也可以实现这样的效果. 在JS中是这样实现的: 你是否见过JS中给系统默认Array对象增加一个自定义查重方法contains 在没有给Array原型上增加cont ...
spring 纯注解方式与AOP
spring注解方式以前我也使用过纯注解方式.现在在这里做个记录我们先认识几个我们都耳熟能详的注解 @configuration :从spring3.0这个注解就可以用于定义配置类,可以替换xml ...
antd pro 分支
添加图片这两种都可以 form表单问题 1 @Form.create() 这是绑定表单和组件,必须有,这样就能从this.props 中找到Form了 2 Select 要写initialValue ...
Linux程序前台后台切换
1.在Linux终端运行命令的时候,在命令末尾加上 & 符号,就可以让程序在后台运行 root@Ubuntu$ ./tcpserv01 & 2.如果程序正在前台运行,可以使用 Ctrl ...
设置tomcat开机自启和后台运行
前言:程序登录遇到了问题,重启服务器上的tomcat后程序可以正常的使用,是通过进入bin目录,双击startup.bat运行启动的程序,此时会弹出启动窗口,而且该窗口不能关闭,这个窗口是tomcat ...
python中的__dict__,__getattr__,__setattr__
python class 通过内置成员dict 存储成员信息(字典) 首先用一个简单的例子看一下dict 的用法 class A(): def __init__(self,a,b): self.a = ...
Codeforces #402
目录 Codeforces #402 Codeforces #402 Codeforces 779A Pupils Redistribution 链接:http://codeforces.com/co ...
python3 魔法方法
魔法方法是一些内置的函数,开头和结尾都是两个下划线,它们将在特定情况下(具体是哪种情况取决于方法的名称)被Python调用,而几乎不需要直接调. 1.__new__ 2.__init__ 3.__st ...
[Windows Server]Windows Server turn off screen auto-lock to fit scheduled tasks(Error Code :0x4F7) / 关闭Windows Server的自动锁定来解决计划任务0x4F7错误
1. 打开“运行”,输入“regedit” 并回车. 2. 找到以下注册表路径,将Attributes的值改为 2: (原为1 HKEY_LOCAL_MACHINE \SYSTEM \CurrentC ...
当PsychicBoom_发觉自己是个大SB的时候……
这些题都是没ac调了好久发现是sb错误的题--. 想清楚再写题!!! 2019.4.18 洛谷P5155 [USACO18DEC]Balance Beam 转移方程\((a[l[i]]*(r[i]-i ...

Python爬虫从入门到进阶(3)之requests的使用

Python爬虫从入门到进阶(3)之requests的使用的更多相关文章

随机推荐

热门专题