人生苦短之Python的urllib urllib2 requests

在Python中涉及到URL请求相关的操作涉及到模块有urllib,urllib2,requests,其中urllib和urllib2是Python自带的HTTP访问标准库,requsets是第三方库,需要自行安装。requests是第三方库,可以想到在使用起来它可能是最方便的一个。

urllib和urllib2

urllib和urllib2模块都是跟url请求相关的,但是提供的功能是不同的。我们常用的urllib2的请求方式:

response = urllib2.urlopen('http://www.baidu.com')

在参数中可以传入url和request对象,传入request可以来设置URL请求的headers,可以伪装成浏览器(当请求的网站进行请求监测的时候),urllib是只能传入url的,这也是二者的差别之一:

user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'

request = urllib2.Request(url, headers={

    'User-Agent': user_agent

})

response = urllib2.urlopen(request)

但是在urllib中一些方法是没有加入的urllib2当中的,在有些时候也是需要urllib的辅助,这些我还暂时不是很懂,等遇到的时候再深究下,例如涉及到unicode编码相关的是只能用urllib来处理的,这也是二者的差别之一。

在urllib2中openurl函数还有几个常用的参数:data、timeout,阻塞操作以秒为单位,data和request对象在Request类中说明。

Requset类有5个参数:url,data,headers,origin_req_host,unverifiable 。

url不必说了就是我们要请求的url地址
data是我们要向服务器提交的额外的数据,如果没有数据可以为None,请求如果是由数据的话那就是POST请求,这些数据需要以标准的格式编码然后传送给request对象。
headers请求头,是一个字典类型的。它是告诉服务器请求的一些信息,例如像请求的浏览器信息,操作系统信息,cookie,返回信息格式,缓存,是否支持压缩等等,像一些反爬虫的网站会监测请求的类型,我们需要伪装成浏览器而不是直接发起请求,例如上面代码里的User-Agent
origin_req_host是RFC2965定义的源交互的request-host。默认的取值是cookielib.request_host(self)。这是由用户发起的原始请求的主机名或IP地址。例如，如果请求的是一个HTML文档中的图像，这应该是包含该图像的页面请求的request-host。
unverifiable代表请求是否是无法验证的，它也是由RFC2965定义的。默认值为false。一个无法验证的请求是，其用户的URL没有足够的权限来被接受。

我们在请求的时候不一定每次都是请求的成功的页面,如果请求url不正常报错也是需要做好判断处理的。

try:

    response = urllib2.urlopen('http://www.baidu.com')

except urllib2.HTTPError as e:

    print e.code

    print e.reason

except urllib2.URLError as e:

    print e.reason

else:

    response.read()

当发生错误抛出异常我们可以捕获查看异常原因,获取请求的状态码。getcode()方法也可以获取请求状态码,附录:

# Table mapping response codes to messages; entries have the

# form {code: (shortmessage, longmessage)}.

responses = {

    100: ('Continue', 'Request received, please continue'),

    101: ('Switching Protocols',

          'Switching to new protocol; obey Upgrade header'),

    200: ('OK', 'Request fulfilled, document follows'),

    201: ('Created', 'Document created, URL follows'),

    202: ('Accepted',

          'Request accepted, processing continues off-line'),

    203: ('Non-Authoritative Information', 'Request fulfilled from cache'),

    204: ('No Content', 'Request fulfilled, nothing follows'),

    205: ('Reset Content', 'Clear input form for further input.'),

    206: ('Partial Content', 'Partial content follows.'),

    300: ('Multiple Choices',

          'Object has several resources -- see URI list'),

    301: ('Moved Permanently', 'Object moved permanently -- see URI list'),

    302: ('Found', 'Object moved temporarily -- see URI list'),

    303: ('See Other', 'Object moved -- see Method and URL list'),

    304: ('Not Modified',

          'Document has not changed since given time'),

    305: ('Use Proxy',

          'You must use proxy specified in Location to access this '

          'resource.'),

    307: ('Temporary Redirect',

          'Object moved temporarily -- see URI list'),

    400: ('Bad Request',

          'Bad request syntax or unsupported method'),

    401: ('Unauthorized',

          'No permission -- see authorization schemes'),

    402: ('Payment Required',

          'No payment -- see charging schemes'),

    403: ('Forbidden',

          'Request forbidden -- authorization will not help'),

    404: ('Not Found', 'Nothing matches the given URI'),

    405: ('Method Not Allowed',

          'Specified method is invalid for this server.'),

    406: ('Not Acceptable', 'URI not available in preferred format.'),

    407: ('Proxy Authentication Required', 'You must authenticate with '

          'this proxy before proceeding.'),

    408: ('Request Timeout', 'Request timed out; try again later.'),

    409: ('Conflict', 'Request conflict.'),

    410: ('Gone',

          'URI no longer exists and has been permanently removed.'),

    411: ('Length Required', 'Client must specify Content-Length.'),

    412: ('Precondition Failed', 'Precondition in headers is false.'),

    413: ('Request Entity Too Large', 'Entity is too large.'),

    414: ('Request-URI Too Long', 'URI is too long.'),

    415: ('Unsupported Media Type', 'Entity body in unsupported format.'),

    416: ('Requested Range Not Satisfiable',

          'Cannot satisfy request range.'),

    417: ('Expectation Failed',

          'Expect condition could not be satisfied.'),

    500: ('Internal Server Error', 'Server got itself in trouble'),

    501: ('Not Implemented',

          'Server does not support this operation'),

    502: ('Bad Gateway', 'Invalid responses from another server/proxy.'),

    503: ('Service Unavailable',

          'The server cannot process the request due to a high load'),

    504: ('Gateway Timeout',

          'The gateway server did not receive a timely response'),

    505: ('HTTP Version Not Supported', 'Cannot fulfill request.'),

    }

Requests

requests使用的是urllib3,继承了urllib2的所有特性,Requests支持HTTP连接保持和连接池，支持使用cookie保持会话，支持文件上传，支持自动确定响应内容的编码，支持国际化的 URL 和 POST 数据自动编码。

get请求

response = requests.get('http://www.baidu.com')

print response.text

post请求

response = requests.post('http://api.baidu.com', data={

    'data': 'value'

})

定制headers

response = requests.get('http://www.baidu.com', headers={

    'User-Agent': user_agent

})

response返回数据的相关操作:

r.status_code #响应状态码
r.raw #返回原始响应体，也就是 urllib 的 response 对象，使用 r.raw.read() 读取
r.content #字节方式的响应体，会自动为你解码 gzip 和 deflate 压缩
r.text #字符串方式的响应体，会自动根据响应头部的字符编码进行解码
r.headers #以字典对象存储服务器响应头，但是这个字典比较特殊，字典键不区分大小写，若键不存在则返回None
特殊方法:
r.json() #Requests中内置的JSON解码器
r.raise_for_status() #失败请求(非200响应)抛出异常

----------------未完待续---------------

人生苦短之Python的urllib urllib2 requests的更多相关文章

python中urllib, urllib2,urllib3, httplib,httplib2, request的区别
permike原文python中urllib, urllib2,urllib3, httplib,httplib2, request的区别若只使用python3.X, 下面可以不看了, 记住有个ur ...
Python使用urllib,urllib3,requests库+beautifulsoup爬取网页
Python使用urllib/urllib3/requests库+beautifulsoup爬取网页 urllib urllib3 requests 笔者在爬取时遇到的问题 1.结果不全 2.'抓取失 ...
【Python爬虫实战--1】深入理解urllib;urllib2;requests
摘自:http://1oscar.github.io/blog/2015/07/05/%E6%B7%B1%E5%85%A5%E7%90%86%E8%A7%A3urllib;urllib2;reques ...
python中 urllib, urllib2, httplib, httplib2 几个库的区别
转载摘要: 只用 python3, 只用 urllib 若只使用python3.X, 下面可以不看了, 记住有个urllib的库就行了 python2.X 有这些库名可用: urllib, urll ...
urllib,urllib2,requests对比
#coding:utf-8 import urllib2 import urllib import httplib import socket import requests #实现以下几个方面内容: ...
三十一、python中urllib和requests包详解
A.urllibimport urllibimport urllib.requestimport json '''1.loads,dumpsjson.loads():将字符串转化成python的基础数 ...
python爬虫 urllib模块url编码处理
案例:爬取使用搜狗根据指定词条搜索到的页面数据(例如爬取词条为‘周杰伦'的页面数据) import urllib.request # 1.指定url url = 'https://www.sogou. ...
python urllib urllib2
区别 1) urllib2可以接受一个Request类的实例来设置URL请求的headers,urllib仅可以接受URL.这意味着,用urllib时不可以伪装User Agent字符串等. 2) u ...
Python 网络请求模块 urllib 、requests
Python 给人的印象是抓取网页非常方便,提供这种生产力的,主要依靠的就是 urllib.requests这两个模块. urlib 介绍 urllib.request 提供了一个 urlopen 函 ...

随机推荐

阿里云云服务器ubuntu配置nginx+uwsgi+django记录文档
1 安装ssh 1 sudo apt-get update 2 sudo apt-get install openssh-server 3 sudo ps -e |grep ssh 有sshd ...
paramiko获取远程主机的环境变量
本文的情况,不同的linux系统版本,表现可能不同. 问题:默认情况下,paramiko在远程主机上执行命令的时候,命令的搜索路径为(/usr/local/bin:/bin:/usr/bin),这样我 ...
VS"后生成事件" 菜单的使用
网上有很多的文章都在介绍怎样创建一个自己定义的dll文件,以及怎样使用一个dll文件,在此不在赘述.本文主要介绍怎样使用VS2008的"生成后事件"的命令行,将一个dll文件直接复 ...
【Spark】RDD操作具体解释4——Action算子
本质上在Actions算子中通过SparkContext运行提交作业的runJob操作,触发了RDD DAG的运行. 依据Action算子的输出空间将Action算子进行分类:无输出. HDFS. S ...
百科知识 .e,.ec文件如何打开
1 .e是易语言源文件,你可以从以下网址下载e语言编程环境: http://www.xiazaiba.com/html/409.html 2 安装之后会自动关联.e文件. 3 打开一个e语言文 ...
andrid对不能导入的类，知道类路径怎样使用该类
andrid对不能导入的类,知道类路径怎样使用该类?使用java的反射机制. 下边是一个样例. MTK平台对Android源生的Telephone接口进行了扩展,加入了一个TelephonyManag ...
【iOS】UIWebView的HTML5扩展之canvas篇
先前公布大那个所谓的"HTML5"扩展严格说来还算不是"HTML5".曲曲几行JS代码就自诩为HTML5扩展多少有些标题党的嫌疑. 而相比之下,本篇的主题can ...
android相关文件夹的存取方式与函数解析---全
因为排版问题.转为markdown编辑: http://blog.csdn.net/self_study/article/details/58587412
dubbo学习之Hello world
现在企业中使用dubbo的越来越多,今天就简单的学习一下dubbo,写了一个hello world,教程仅供入门,如要深入学习请上官网服务提供方: 首先将提供方和消费方都引入jar包,如果使用的是m ...
iOS应用数据存储的经常使用方式
ios程序中数据数据存储有下列5种方式 XML属性列表(plist)归档 Preference(偏好设置) NSKeyedArchiver归档(NSCoding) SQLite3 Core Data ...

人生苦短之Python的urllib urllib2 requests

人生苦短之Python的urllib urllib2 requests的更多相关文章

随机推荐

热门专题