python urllib2库的简单总结

urllib2的简单介绍
参考网址：http://www.voidspace.org.uk/python/articles/urllib2.shtml

Fetching URLs
The simplest way to use urllib2 is as follows :
1、
import urllib2
response = urllib2.urlopen('http://python.org/')
html = response.read()

2、
import urllib2
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req)
the_page = response.read()
3、
req = urllib2.Request('ftp://example.com/')
4、
import urllib
import urllib2

url = 'http://www.someserver.com/cgi-bin/register.cgi'
values = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }

data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
the_page = response.read()
5、
Data can also be passed in an HTTP GET request by encoding it in the URL itself.

>>> import urllib2
>>> import urllib
>>> data = {}
>>> data['name'] = 'Somebody Here'
>>> data['location'] = 'Northampton'
>>> data['language'] = 'Python'
>>> url_values = urllib.urlencode(data)
>>> print url_values
name=Somebody+Here&language=Python&location=Northampton
>>> url = 'http://www.example.com/example.cgi'
>>> full_url = url + '?' + url_values
>>> data = urllib2.urlopen(full_url)
6、
import urllib
import urllib2

url = 'http://www.someserver.com/cgi-bin/register.cgi'
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
values = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
headers = { 'User-Agent' : user_agent }

data = urllib.urlencode(values)
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
the_page = response.read()

7、Handling Exceptions
1）URLError
>>> req = urllib2.Request('http://www.jianshu.com/p/5c7a1af4aa531')
>>> try: urllib2.urlopen(req)
>>> except URLError, e:
>>> print e.reason
>>> print e,e.code #分别表示凡返回错误类型，错误代码和类型，错误代码

2）HTTPError is the subclass of URLError raised in the specific case of HTTP URLs.
HTTPError
Every HTTP response from the server contains a numeric "status code". Sometimes the status code indicates that the server is unable to fulfil the request
3）Error Codes
100: ('Continue', 'Request received, please continue'),
101: ('Switching Protocols',
'Switching to new protocol; obey Upgrade header'),

200: ('OK', 'Request fulfilled, document follows'),
201: ('Created', 'Document created, URL follows'),
202: ('Accepted',
'Request accepted, processing continues off-line'),
203: ('Non-Authoritative Information', 'Request fulfilled from cache'),
204: ('No Content', 'Request fulfilled, nothing follows'),
205: ('Reset Content', 'Clear input form for further input.'),
206: ('Partial Content', 'Partial content follows.'),

300: ('Multiple Choices',
'Object has several resources -- see URI list'),
301: ('Moved Permanently', 'Object moved permanently -- see URI list'),
302: ('Found', 'Object moved temporarily -- see URI list'),
303: ('See Other', 'Object moved -- see Method and URL list'),
304: ('Not Modified',
'Document has not changed since given time'),
305: ('Use Proxy',
'You must use proxy specified in Location to access this '
'resource.'),
307: ('Temporary Redirect',
'Object moved temporarily -- see URI list'),

400: ('Bad Request',
'Bad request syntax or unsupported method'),
401: ('Unauthorized',
'No permission -- see authorization schemes'),
402: ('Payment Required',
'No payment -- see charging schemes'),
403: ('Forbidden',
'Request forbidden -- authorization will not help'),
404: ('Not Found', 'Nothing matches the given URI'),
405: ('Method Not Allowed',
'Specified method is invalid for this server.'),
406: ('Not Acceptable', 'URI not available in preferred format.'),
407: ('Proxy Authentication Required', 'You must authenticate with '
'this proxy before proceeding.'),
408: ('Request Timeout', 'Request timed out; try again later.'),
409: ('Conflict', 'Request conflict.'),
410: ('Gone',
'URI no longer exists and has been permanently removed.'),
411: ('Length Required', 'Client must specify Content-Length.'),
412: ('Precondition Failed', 'Precondition in headers is false.'),
413: ('Request Entity Too Large', 'Entity is too large.'),
414: ('Request-URI Too Long', 'URI is too long.'),
415: ('Unsupported Media Type', 'Entity body in unsupported format.'),
416: ('Requested Range Not Satisfiable',
'Cannot satisfy request range.'),
417: ('Expectation Failed',
'Expect condition could not be satisfied.'),

500: ('Internal Server Error', 'Server got itself in trouble'),
501: ('Not Implemented',
'Server does not support this operation'),
502: ('Bad Gateway', 'Invalid responses from another server/proxy.'),
503: ('Service Unavailable',
'The server cannot process the request due to a high load'),
504: ('Gateway Timeout',
'The gateway server did not receive a timely response'),
505: ('HTTP Version Not Supported', 'Cannot fulfill request.'),
4）
例子1：
from urllib2 import Request, urlopen, URLError, HTTPError
req = Request(someurl)
try:
response = urlopen(req)
except HTTPError, e:
print 'The server couldn\'t fulfill the request.'
print 'Error code: ', e.code
except URLError, e:
print 'We failed to reach a server.'
print 'Reason: ', e.reason
else:

注意：HTTPError 是URLError的子类，要写在前面
例子2：
from urllib2 import Request, urlopen, URLError
req = Request(someurl)
try:
response = urlopen(req)
except URLError, e:
if hasattr(e, 'reason'):
print 'We failed to reach a server.'
print 'Reason: ', e.reason
elif hasattr(e, 'code'):
print 'The server couldn\'t fulfill the request.'
print 'Error code: ', e.code
else:
# everything is fine

例子3：
from urllib2 import Request, urlopen
req = Request(someurl)
try:
response = urlopen(req)
except IOError, e:
if hasattr(e, 'reason'):
print 'We failed to reach a server.'
print 'Reason: ', e.reason
elif hasattr(e, 'code'):
print 'The server couldn\'t fulfill the request.'
print 'Error code: ', e.code
else:
# everything is fine
注意：URLError是IOError的子类，极少数情况下可能会报socket.error

8、info and geturl
geturl:
这将返回所获取页面的真实URL。这很有用，因为urlopen(或者使用的打开器对象)可能已经遵循了重定向。所获取的页面的URL可能与所请求的URL不相同
info:
这会返回一个类似于字典的对象，描述所获取的页面，特别是服务器发送的头部。它现在是一个httplib。HTTPMessage实例.
例子：
from urllib2 import Request,urlopen,URLError,HTTPError
url = 'https://passport.baidu.com/center?_t=1510744860'
req = Request(url)
response = urlopen(req)
print response.info()
print response.geturl()

9、Openers and Handlers
Openers：

　　当你获取一个URL你使用一个opener(一个urllib2.OpenerDirector的实例)。正常情况下，我们使用默认opener：通过urlopen。但你能够创建个性的openers。可以用build_opener来创建opener对象。一般可用于需要处理cookie或者不想进行redirection的应用场景（You will want to create openers if you want to fetch URLs with specific handlers installed, for example to get an opener that handles cookies, or to get an opener that does not handle redirections.）

　　以下是用代理ip模拟登录时（需要处理cookie）使用handler和opener的具体流程。

1 self.proxy = urllib2.ProxyHandler({'http': self.proxy_url})
2 self.cookie = cookielib.LWPCookieJar()
3 self.cookie_handler = urllib2.HTTPCookieProcessor(self.cookie)
4 self.opener = urllib2.build_opener(self.cookie_handler, self.proxy, urllib2.HTTPHandler)
Handles：

　　Openers使用处理器handlers，所有的“繁重”工作由handlers处理。每个handlers知道如何通过特定协议打开URLs，或者如何处理URL打开时的各个方面。例如HTTP重定向或者HTTP cookies。

更多关于Openers和Handlers的信息。http://www.voidspace.org.uk/python/articles/urllib2.shtml#openers-and-handlers

10、Proxies
proxy代理ip创建opener

Note：Currently urllib2 does not support fetching of https locations through a proxy. This can be a problem.
（http://www.voidspace.org.uk/python/articles/urllib2.shtml#proxies）

例子：
1 import urllib2
2 proxy——handler = urllib2.ProxyHandler({'http': '54.186.78.110:3128'})#注意要确保该代理ip可用
3 opener = urllib2.build_opener(proxy_handler)
4 request = urllib2.Request(url, post_data, login_headers)#该例中还需要提交post_data和header信息
5 response = opener.open(request)
6 print response.read().encode('utf-8')

11、Sockets and Layers
例子：
import socket
import urllib2

# timeout in seconds
timeout = 10
socket.setdefaulttimeout(timeout)

# this call to urllib2.urlopen now uses the default timeout
# we have set in the socket module
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req)

12、Cookie
urllib2 对 Cookie 的处理也是自动的。如果需要得到某个 Cookie 项的值，可以这么做：
例子：
import urllib2
import cookielib
cookie = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))
response = opener.open('http://www.baidu.com')
for item in cookie:
print 'Name = '+item.name
print 'Value = '+item.value

运行之后就会输出访问百度的Cookie值：
Name = BAIDUID
Value = C664216C4F7BD6B98DB0B300292E0A23:FG=1
Name = BIDUPSID
Value = C664216C4F7BD6B98DB0B300292E0A23
Name = H_PS_PSSID
Value = 1464_21099_17001_24879_22159
Name = PSTM
Value = 1510747061
Name = BDSVRTM
Value = 0
Name = BD_HOME
Value = 0
13、对付"反盗链"
某些站点有所谓的反盗链设置，其实说穿了很简单，
就是检查你发送请求的header里面，referer站点是不是他自己，
所以我们只需要像把headers的referer改成该网站即可，以baidu为例：
#...
headers = {
'Referer':'http://www.baidu.com/'
}
#...
headers是一个dict数据结构，你可以放入任何想要的header，来做一些伪装。
例如，有些网站喜欢读取header中的X-Forwarded-For来看看人家的真实IP，可以直接把X-Forwarde-For改了。

python urllib2库的简单总结的更多相关文章

python requests库的简单使用
requests是python的一个HTTP客户端库,跟urllib,urllib2类似,但比urllib,urllib2更加使用简单. 1. requests库的安装在你的终端中运行pip安装命令即 ...
python第三方库requests简单介绍
一.发送请求与传递参数简单demo: import requests r = requests.get(url='http://www.itwhy.org') # 最基本的GET请求 print(r ...
Python Pexpect库的简单使用
Python Pexpect库的使用简介最近需要远程操作一个服务器并执行该服务器上的一个python脚本,查到可以使用Pexpect这个库.记录一下. 什么是Pexpect?Pexpect能够产生 ...
Python turtle库绘制简单图形
一.简介 Python中的turtle库是一个直观有趣的图形绘制函数库.turtle库绘制图形有一个基本框架:一个小海龟在坐标系中爬行,其爬行轨迹形成了绘制图形. 二.简单的图形列举 1.绘制4个不同 ...
python requests库的简单运用
python requests的简单运用使用pycharm获取requests包 ctrl+alt+s Project:pythonProject pythoninterpreter 点+号搜索使 ...
python urllib2练习发送简单post
import urllib2 import urllib url = 'http://localhost/1.php' while True: data = raw_input('(ctrl+c ex ...
python第三方库，你要的这里都有
Python的第三方库多的超出我的想象. python 第三方模块转 https://github.com/masterpy/zwpy_lst Chardet,字符编码探测器,可以自动检测文本. ...
python常用库（转）
转自http://www.west999.com/info/html/wangluobiancheng/qita/20180729/4410114.html Python常用的库简单介绍一下 fuzz ...
Python全部库整理
库名称简介 Chardet字符编码探测器,可以自动检测文本.网页.xml的编码. colorama主要用来给文本添加各种颜色,并且非常简单易用. Prettytable主要用于在终端或浏览器端构建格式 ...

随机推荐

MySQL数据库的设计和表创建
首先,我们使用Navicat Premium编辑器创建一个用户,同时设置用户权限,MySQL默认有一个root用户,拥有最高权限下面,我们先创建一个用户: ①CREATE USER 'aaa'@' ...
【Linux command reference】
ubuntu16.04安装中文输入法: https://blog.csdn.net/singleyellow/article/details/77448246 ubuntu16.04 用vi编辑代码, ...
WSGI基础知识(转)
add by zhj: WSGI全称Web Server Gateway Interface,即Web网关接口.其实它并不是OSI七层协议中的协议,它就是一个接口(即函数)而已,而WSGI规定了该接口 ...
我的Android进阶之旅------>解决Error:Unable to find method 'org.gradle.api.internal.project.ProjectInternal.g
错误描述今天在Github上面下载了一份代码,然后导入到Android Studio中直接报了如下图所示的错误: 错误描述如下: Error: Unable to find method 'org. ...
Android开发之事件和事件监听器
写了一个打飞机的小程序,用于作为事件监听的学习,此程序须要有实体按键的手机才干运行. PlaneView.java: public class PlaneView extends View{ publ ...
python学习之路-第三天-函数
函数函数的定义关键字:def 使用global语句可以清楚地表明变量是在外面的块定义的示例:(函数运行完毕后x的值是2) #!/usr/bin/python # Filename: func_gl ...
__name__ = '__main__'有什么用
很多新手刚开始学习python的时候经常会看到python 中__name__ = \'__main__\' 这样的代码,可能很多新手一开始学习的时候都比较疑惑,python 中__name__ = ...
service 需要注意的地方
@service标记的class,只能用于标记了@controller的类,用于其他的会出错 mybatis查询查询到返回记录,查询不到返回null
9. Palindrome Number(判断整型数字是否是回文，直接暴力即可)
Determine whether an integer is a palindrome. Do this without extra space. class Solution: def isPal ...
centos6.5系统python2.6升级到python3.6
1.安装必备的工具 wget:yum install wget gcc:yum install gcc zlib zlib-devel: yum install zlib zlib-devel -y ...

python urllib2库的简单总结

python urllib2库的简单总结的更多相关文章

随机推荐

热门专题