爬虫基本库的使用---requests库

使用requests---实现Cookies、登录验证、代理设置等操作

　　　　处理网页验证和Cookies时，需要写Opener和Handler来处理，为了更方便地实现这些操作，就有了更强大的库requests

例子简单使用requests库

 import requests

 r = requests.get('http://wwww.baidu.com/')

 print(type(r), r.status_code, r.text, r.cookies, sep='\n\n')

 # 输出：

 <class 'requests.models.Response'>

 200

 <!DOCTYPE html>

 <!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible

 ......

 feedback>æè§åé¦</a>&nbsp;äº¬ICPè¯030173å·&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>

 <RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>

GET请求

GET请求，返回相应的请求信息

requests.get(url, params, **kwargs)

url表示要捕获的页面链接，params表示url的额外参数（字典或字节流格式），**kwargs表示12个控制访问的参数

 import requests

 r = requests.get('http://httpbin.org/get')

 print(r.text)

 # 输出：

 {

   "args": {},

   "headers": {

     "Accept": "*/*",

     "Accept-Encoding": "gzip, deflate",

     "Host": "httpbin.org",

     "User-Agent": "python-requests/2.21.0"

   },

   "origin": "120.85.108.192, 120.85.108.192",

   "url": "https://httpbin.org/get"

 }

 # 返回结果中包含请求头、URL、IP等信息

 import requests

 data = {

     'name': 'LiYihua',

     'age': ''

 }

 r = requests.get('http://httpbin.org/get', params=data)

 print(r.text)

 # 输出：

 {

   "args": {

     "age": "",

     "name": "LiYihua"

   },

   "headers": {

     "Accept": "*/*",

     "Accept-Encoding": "gzip, deflate",

     "Host": "httpbin.org",

     "User-Agent": "python-requests/2.21.0"

   },

   "origin": "120.85.108.92, 120.85.108.92",

   "url": "https://httpbin.org/get?name=LiYihua&age=21"

 }

 import requests

 r = requests.get('http://httpbin.org/get')

 print(type(r.text), r.json(), type(r.json()), sep='\n\n')

 # 输出：

 <class 'str'>

 {'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.21.0'}, 'origin': '120.85.108.92, 120.85.108.92', 'url': 'https://httpbin.org/get'}

 <class 'dict'>

 # json()方法可以将返回结果是JSON格式的字符串转化为字典

抓取二进制数据

 import requests

 r = requests.get('https://github.com/favicon.ico')

 print(r.text, r.content, sep='\n\n')

 # response.content返回的是bytes型的数据。

 # 如果想取图片，文件，则可以通过r.content

 # response.text返回的是Unicode型的数据。

 # 如果想取文本，可以通过r.text

 # 输出：

 :�������OL��......

 b'\x00\x00\x01\x00\x02\x00\x10\x10\x00\x00\x0......

将提取到的图片保存

 import requests

 r = requests.get('https://github.com/favicon.ico')

 with open('favicon.ico', 'wb') as f:

     f.write(r.content)

 # 运行结束后生成一个名为favicon.ico的图标

上一个例子用到的open()方法和with as语句

# open()方法

# def open(file, mode='r', buffering=None, encoding=None, errors=None, newline=None, closefd=True)

# 常用参数:

file表示要打开的文件            mode表示打开文件的模式：只读，写入，追加等

buffering : 如果 buffering 的值被设为 0，就不会有寄存。如果 buffering 的值取 1，访问文件时会寄存行。如果将 buffering 的值设为大于 1 的整数，表明了这就是的寄存区的缓冲大小。如果取负值，寄存区的缓冲大小则为系统默认

# 对于mode参数

========= ===============================================================

    字母的意义

    --------- ---------------------------------------------------------------

    'r'         打开阅读（默认）

    'w'        打开进行写入，首先截断文件

    'x'        创建一个新文件并打开它进行写入

    'a'        打开进行写入，如果文件存在，则附加到文件结尾

    'b'        二进制模式

    't'         文本模式（默认）

    '+'        打开磁盘文件进行更新（读写）

    'U'       通用换行模式（已弃用）

    ========= ===============================================================

# with as 语句

有一些任务，可能事先需要设置，事后做清理工作。对于这种场景，Python的with语句提供了一种非常方便的处理方式。

with的处理基本思想是with所求值的对象必须有一个__enter__()方法，一个__exit__()方法。紧跟with后面的语句被求值后，返回对象的__enter__()方法被调用，这个方法的返回值将被赋值给as后面的变量。当with后面的代码块全部被执行完之后，将调用前面返回对象的__exit__()方法。

代码解释说明：

class Sample:

    def __enter__(self):

        print "In __enter__()"

        return "Foo"

    def __exit__(self, type, value, trace):

        print "In __exit__()"

def get_sample():

    return Sample()

with get_sample() as sample:

    print "sample:", sample

添加headers

 import requests

 r = requests.get('https://www.zhihu.com/explore')

 print(r.text)

 # 输出：

 <html>

 <head><title>400 Bad Request</title></head>

 <body bgcolor="white">

 <center><h1>400 Bad Request</h1></center>

 <hr><center>openresty</center>

 </body>

 </html>

 # 部分网址需要传递headers，如果不传递，就不能正常请求

 import requests

 headers = {

     'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko '

                   'Chrome/52.0.2743.116 Safari/537.36'

 }

 r = requests.get('https://www.zhihu.com/explore', headers=headers)

 print(r.text)

 # 输出：

 <!DOCTYPE html>

 <html lang="zh-CN" dropEffect="none" class="no-js no-auth ">

 <head>

 <meta charset="utf-8" />

 ......

 <script type="text/zscript" znonce="d78db0c15fa84270ac967503884baf11"></script>

 <input type="hidden" name="_xsrf" value="cdb6166e0dc5f38afc3ee95053d7ef55"/>

 </body>

 </html>

POST请求

这是一种比较常见的URL请求方式

 import requests

 data = {

     'name': 'LiYihua',

     'age': 21

 }

 r = requests.post('http://httpbin.org/post', data=data)

 print(r.text)

 # 输出：

 {

   "args": {},

   "data": "",

   "files": {},

   "form": {

     "age": "",

     "name": "LiYihua"

   },

   "headers": {

     "Accept": "*/*",

     "Accept-Encoding": "gzip, deflate",

     "Content-Length": "",

     "Content-Type": "application/x-www-form-urlencoded",

     "Host": "httpbin.org",

     "User-Agent": "python-requests/2.21.0"

   },

   "json": null,

   "origin": "120.85.108.90, 120.85.108.90",

   "url": "https://httpbin.org/post"

 }

 # POST请求成功，获得返回结果，form部分为提交的数据

响应

text 和 content 获取响应的内容

status code 属性得到状态码 headers 属性得到响应头 cookies属性得到 Cookies

url属性得到 URL history属性得到请求历史

 import requests

 r = requests.get('https://www.cnblogs.com/liyihua/')

 print(type(r.status_code), r.status_code,

       type(r.headers), r.headers,

       type(r.cookies), r.cookies,

       type(r.url), r.url,

       type(r.history), r.history,

       sep='\n\n')

 # 输出：

 <class 'int'>

 200

 <class 'requests.structures.CaseInsensitiveDict'>

 {'Date': 'Thu, 20 Jun 2019 08:18:00 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'Cache-Control': 'private, max-age=10', 'Expires': 'Thu, 20 Jun 2019 08:18:10 GMT', 'Last-Modified': 'Thu, 20 Jun 2019 08:18:00 GMT', 'X-UA-Compatible': 'IE=10', 'X-Frame-Options': 'SAMEORIGIN', 'Content-Encoding': 'gzip'}

 <class 'requests.cookies.RequestsCookieJar'>

 <RequestsCookieJar[]>

 <class 'str'>

 https://www.cnblogs.com/liyihua/

 <class 'list'>

 []

状态码通常用来判断请求是否成功

 import requests

 r = requests.get('http://www.baidu.com')

 exit() if not r.status_code == requests.codes.ok else print('Request Successfully')

 # 输出：

 Request Successfully

 # request.codes.ok 返回成功的状态码200

返回码和相应的查询条件

高级用法

文件上传

 import requests

 files = {

     'file': open('favicon.ico', 'rb')

 }

 r = requests.post('http://httpbin.org/post', files=files)

 print(r.text)

 # 输出：

 {

   "args": {},

   "data": "",

   "files": {

     "file": "data:application/octetstream;base64,AAABAAIAEBAAAAEAIAAoBQAAJgAAACAgAAABACAAKBQAAE4FAAAoAAAAEAAAACAAAAABACAAAAAAAAAFAAA...

   },

   "form": {},

   "headers": {

     "Accept": "*/*",

     "Accept-Encoding": "gzip, deflate",

     "Content-Length": "",

     "Content-Type": "multipart/form-data; boundary=c1b665273fc73e67e57ac97e78f49110",

     "Host": "httpbin.org",

     "User-Agent": "python-requests/2.21.0"

   },

   "json": null,

   "origin": "120.85.108.71, 120.85.108.71",

   "url": "https://httpbin.org/post"

 }

Cookies

 import requests

 headers = {

     'Cookie': 'tgw_l7_route=66cb16bc7......ECLNu3tQ',

     'Host': 'www.zhihu.com',

     'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36'

 }

 r = requests.get('https://www.zhihu.com', headers=headers)

 print(r.text)

 # 输出：

 <!doctype html>

 <html lang="zh" data-hairline="true" data-theme="light"><head><meta charSet="utf-8"/><title data-react-helmet="true">首页 - 知乎</title><meta name="viewport" ......

 # 说明登录成功

 # Cookie维持登录状态，首先登录知乎，复制headers中的Cookie，然后将其设置到Headers里面，然后发送请求

 from requests.cookies import RequestsCookieJar

 import requests

 cookies = 'tgw_l7_route=66cb16bc7f45da64562a07.......ALNI_MbNds66nlodoTCxp8EVE6ECLNu3tQ'

 jar = requests.cookies.RequestsCookieJar()

 headers = {

     'Host': 'www.zhihu.com',

     'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36'

 }

 for cookies in cookies.split(';'):

     key, value = cookies.split('=', 1)

     jar.set(key, value)

 r = requests.get('https://www.zhihu.com', cookies=jar, headers=headers)

 print(r.text)

 # 输出同上面一样

 # 将复制下来的cookies利用split()方法处理分割

 # 创建RequestsCookieJar对象，利用set()方法设置好每个Cookie的key和value

会话维持

Session对象，可以方便的维护一个会话

 import requests

 requests.get('http://httpbin.org/cookies/set/number/123456789')

 r = requests.get('http://httpbin.org/cookies')

 print(r.text)

 # 输出：

 {

   "cookies": {}

 }

 import requests

 s = requests.Session()

 s.get('http://httpbin.org/cookies/set/number/123456789')

 r = s.get('http://httpbin.org/cookies')

 print(r.text)

 # 输出：

 {

   "cookies": {

     "number": ""

   }

 }

SSL证书验证

 import requests

 r = requests.get('https://www.12306.cn')

 print(r.status_code)

 # 没有出错会输出：200

 # 如果请求一个HTTPS站点，但是证书验证错误的页面时，就会错误。

 # 为了避免错误，可以将改例子稍作修改

 import requests

 from requests.packages import urllib3

 urllib3.disable_warnings()

 r = requests.get('https://www.12306.cn', verify=False)

 print(r.status_code)

代理设置

 import requests

 proxies = {

     'http': 'socks5://user:password@10.10.1.10:3128',

     'https': 'socks5://user:password@10.10.1.10:1080'

 }

 requests.get('https://www.taobao.com', proxies=proxies)

 # 使用SOCKS协议代理

超时设置

 import requests

 r = requests.get('https://taobao.com', timeout=(0.1, 1))

 print(r.status_code)

 # 输出：200

身份验证

 import requests

 from requests.auth import HTTPBasicAuth

 r = requests.get('http://localhost', auth=HTTPBasicAuth('liyihua', 'woshiyihua134'))

 print(r.status_code)

 # 输出：200

 # 也可以使用OAuth1方法

 import requests

 from requests_oauthlib import OAuth1

 url = 'https://api.twitter.com/1.1/account/verify_credentials.json'

 auth = OAuth1('YOUR_APP_KEY', 'YOUR_APP_SECRET'

               'USER_OAUTH_TOKEN', 'USER_OAUTH_TOKEN_SECRET')

 requests.get(url, auth=auth)

Prepared Request（准备请求）

要获取一个带有状态的 Prepared Request， 需要用Session.prepare_request()

 from requests import Request, Session

 url = 'http://httpbin.org/post'

 data = {

     'name': 'LiYihua'

 }           # 参数

 header = {

     'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537 (KHTML, like Gecko Chrome/53.0.2785.116 Safari/537.36'

 }           # 伪装浏览器

 s = Session()                       # 会话维持

 req = Request('POST', url, data=data, headers=header)

 prepped = s.prepare_request(req)            # Session的prepare_request()方法将req转化为一个 Prepared Request对象

 r = s.send(prepped)                 # send() 发送请求

 print(r.text)

 # 输出：

 {

   "args": {},

   "data": "",

   "files": {},

   "form": {

     "name": "LiYihua"

   },

   "headers": {

     "Accept": "*/*",

     "Accept-Encoding": "gzip, deflate",

     "Content-Length": "",

     "Content-Type": "application/x-www-form-urlencoded",

     "Host": "httpbin.org",

     "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537 (KHTML, like Gecko Chrome/53.0.2785.116 Safari/537.36"

   },

   "json": null,

   "origin": "120.85.108.184, 120.85.108.184",

   "url": "https://httpbin.org/post"

 }

爬虫基本库的使用---requests库的更多相关文章

爬虫1.1-基础知识+requests库
目录爬虫-基础知识+requests库 1. 状态返回码 2. URL各个字段解释 2. requests库 3. requests库爬虫的基本流程爬虫-基础知识+requests库关于html ...
Python爬虫：HTTP协议、Requests库（爬虫学习第一天）
HTTP协议: HTTP(Hypertext Transfer Protocol):即超文本传输协议.URL是通过HTTP协议存取资源的Internet路径,一个URL对应一个数据资源. HTTP协议 ...
Python爬虫（二）：Requests库
所谓爬虫就是模拟客户端发送网络请求,获取网络响应,并按照一定的规则解析获取的数据并保存的程序.要说 Python 的爬虫必然绕不过 Requests 库. 1 简介对于 Requests 库,官方文 ...
爬虫（三）：Requests库的基本使用
一:什么是Requests Requests是用python语言基于urllib编写的,采用的是Apache2 Licensed开源协议的HTTP库如果你看过上篇文章关于urllib库的使用,你会发现 ...
Python爬虫学习==>第八章：Requests库详解
学习目的: request库比urllib库使用更加简洁,且更方便. 正式步骤 Step1:什么是requests requests是用Python语言编写,基于urllib,采用Apache2 Li ...
python之爬虫（四）之 Requests库的基本使用
什么是Requests Requests是用python语言基于urllib编写的,采用的是Apache2 Licensed开源协议的HTTP库如果你看过上篇文章关于urllib库的使用,你会发现,其 ...
Python爬虫：HTTP协议、Requests库
HTTP协议: HTTP(Hypertext Transfer Protocol):即超文本传输协议.URL是通过HTTP协议存取资源的Internet路径,一个URL对应一个数据资源. HTTP协议 ...
爬虫入门【2】Requests库简介
发送请求使用Requests发送网络请求很简单 #首先要导入requests库 import requests #返回一个Response对象 r=requests.get('https://git ...
python爬虫---从零开始（三）Requests库
1,什么是Requests库 Requests是用python语言编写,基于urllib,采用Apache2 Licensed 开源协议的HTTP库. 它比urllib更加方便,可以节约我们大量的工作 ...

随机推荐

ECMAScript6 VS TypeScript
如果你真正使用过Typescript你会发现他其实是javascript的超集, 这是一个非常简洁的描述 ,之所以称之为Typescript,正是Type一词的表述(强类型),可不仅仅是有一个Clas ...
你也可以写聊天程序 - C# Socket学习1
简述我们做软件工作的虽然每天都离不开网络,可网络协议细节却不是每个人都会接触和深入了解.我今天就来和大家一起学习下Socket,并写一个简单的聊天程序. 一些基础类首先我们每天打开浏览器访问网页信 ...
salt-api 配置使用
salt-api 安装配置源 (系统环境s示例是centos6) epel 源 rpm -Uvh https://mirrors.tuna.tsinghua.edu.cn/epel/6/i386/ ...
LeetCode_232-Implement Queue using Stacks
题意是使用栈实现队列:队列是先进先出,后进后出. class MyQueue { public: /** Initialize your data structure here. */ MyQueue ...
CH3801Rainbow的信号
Description Freda发明了传呼机之后,rainbow进一步改进了传呼机发送信息所使用的信号.由于现在是数字.信息时代,rainbow发明的信号用N个自然数表示.为了避免两个人的对话被大坏 ...
网关服务自定义路由规则（springcloud+nacos）
1. 场景描述需要给各个网关服务类提供自定义配置路由规则,实时生效,不用重启网关(重启风险大),目前已实现,动态加载自定义路由文件,动态加载路由文件中的路由规则,只需在规则文件中配置下规则就可以了 ...
Qt5教程: (5) Lambda匿名函数的使用
Lambda是C++11的新特性, 首先看看你的.pro项目文件里有没有CONFIG += c++11这句话, 没有就加上. 下面新建一个工程, 具体步骤就不多说了然后给主窗口添加一个按钮b, 并且 ...
css涂鸦这样玩
前言上一次深扒CSS的时候,还说CSS和H5绘制复杂图形很麻烦,看了大神的操作后,感觉茅塞顿开了,哈哈. 就算可能我暂时没有用到的机会,学习一下开发者的设计思路也是受益匪浅呀. 嗯,今天要介绍的是一 ...
LeetCode初级算法--排序和搜索01：第一个错误的版本
LeetCode初级算法--排序和搜索01:第一个错误的版本搜索微信公众号:'AI-ming3526'或者'计算机视觉这件小事' 获取更多算法.机器学习干货 csdn:https://blog.cs ...
时序数据库InfluxDB（I）- 搭建与采集信息demo操作
搭建环境:vmware workstation pro15.5.0, ubuntu18.04.3 实践时间:2019.10.12-10.27 (一)时序数据库InfluxDB准备 (1)安装曾出现问 ...