1、urllib模块

1.urllib.urlopen(url[,data[,proxies]])

打开一个url的方法，返回一个文件对象，然后可以进行类似文件对象的操作。本例试着打开google

import urllib

f = urllib.urlopen('http://www.google.com.hk/')

firstLine = f.readline()   #读取html页面的第一行

urlopen返回对象提供方法：

- read([bytes])：读所以字节或者bytes个字节

- readline()：读一行

- readlines() ：读所有行

- fileno() ：返回文件句柄

- close() ：关闭url链接

- info()：返回一个httplib.HTTPMessage对象，表示远程服务器返回的头信息

- getcode()：返回Http状态码。如果是http请求，200请求成功完成;404网址未找到

- geturl()：返回请求的url

2.urllib.urlretrieve(url[,filename[,reporthook[,data]]])

urlretrieve方法将url定位到的html文件下载到你本地的硬盘中。如果不指定filename，则会存为临时文件。

urlretrieve()返回一个二元组(filename,mine_hdrs)

临时存放：

filename = urllib.urlretrieve('http://www.google.com.hk/')

type(filename)

<type 'tuple'>

print filename[0]

print filename[1]

输出：

'/tmp/tmp8eVLjq'

<httplib.HTTPMessage instance at 0xb6a363ec>

存为本地文件:

filename = urllib.urlretrieve('http://www.baidu.com/',filename='/home/dzhwen/python文件/Homework/urllib/google.html')

print type(filename)

print filename[0]

print filename[1]

输出：

<type 'tuple'>

'/home/dzhwen/python\xe6\x96\x87\xe4\xbb\xb6/Homework/urllib/google.html'

<httplib.HTTPMessage instance at 0xb6e2c38c>

reporthook参数使用如下：

def process(blk,blk_size,total_size):

	print('%d/%d - %.02f%%' %(blk*blk_size,total_size,(float)(blk * blk_size) / total_size * 100))

def download():

	filename,fileinfo = urllib.urlretrieve('http://cnblogs.com','index.html',reporthook=process)

输出结果：

0/46164 - 0.00%

8192/46164 - 17.75%

16384/46164 - 35.49%

24576/46164 - 53.24%

32768/46164 - 70.98%

40960/46164 - 88.73%

49152/46164 - 106.47%

blk * blk_size的有可能超过total_size，如上函数可以改写为：

def process(blk,blk_size,total_size):

	if total_size == -1:

		print "can't determine the file size, now retrived", blk * blk_size

	else:

		percentage = int((blk * blk_size * 100.0) / total_size)

		if percentage >= 100:

			print('%d/%d - %d%%' % (total_size, total_size, 100))

		else:

			print('%d/%d - %d%%' % (blk * blk_size, total_size, percentage))

运行后输出：

0/46238 - 0%

8192/46238 - 17%

16384/46238 - 35%

24576/46238 - 53%

32768/46238 - 70%

40960/46238 - 88%

46238/46238 - 100%

3.urllib.urlcleanup()

清除由于urllib.urlretrieve()所产生的缓存

4.urllib.quote(url)和urllib.quote_plus(url)

将url数据获取之后，并将其编码，从而适用与URL字符串中，使其能被打印和被web服务器接受。

urllib.quote('http://www.baidu.com')

转换结果：

'http%3A//www.baidu.com'

urllib.quote_plus('http://www.baidu.com')

转换结果：

'http%3A%2F%2Fwww.baidu.com'

5.urllib.unquote(url)和urllib.unquote_plus(url)

与4的函数相反。

6.urllib.urlencode(query)

将URL中的键值对以连接符&划分

这里可以与urlopen结合以实现post方法和get方法：

GET方法：

import urllib

params=urllib.urlencode({'spam':1,'eggs':2,'bacon':0})

f=urllib.urlopen("http://python.org/query?%s" % params)

print f.read()

POST方法：　　

import urllib

parmas = urllib.urlencode({'spam':1,'eggs':2,'bacon':0})

f=urllib.urlopen("http://python.org/query",parmas)

f.read()

2.urlparse模块

1.urlparse

作用：反向解析url

def parse_html():

	url = 'https://www.baidu.com/s?wd=python&rsv_spt=1&rsv_iqid=0xad2dc5550032146a&issp=1&f=8&rsv_bp=0&rsv_idx=2&ie=utf-8&tn=baiduhome_pg&rsv_enter=1&rsv_sug3=7&rsv_sug1=5&rsv_sug7=100&rsv_sug2=0&inputT=22&rsv_sug4=4980'

	result = urlparse.urlparse(url)

	# params = urlparse.parse_qs(result.query)

	print result

	# print params

运行结果：

ParseResult(scheme='https', netloc='www.baidu.com', path='/s', params='', query='wd=python&rsv_spt=1&rsv_iqid=0xad2dc5550032146a&issp=1&f=8&rsv_bp=0&rsv_idx=2&ie=utf-8&tn=baiduhome_pg&rsv_enter=1&rsv_sug3=7&rsv_sug1=5&rsv_sug7=100&rsv_sug2=0&inputT=22&rsv_sug4=4980', fragment='')

如上返回的是一个parseResult对象，其中包括协议类型、主机地址、路径、参数以及query

2.parse_qs

import urllib

import urlparse

def parse_html():

	url = 'https://www.baidu.com/s?wd=python&rsv_spt=1&rsv_iqid=0xad2dc5550032146a&issp=1&f=8&rsv_bp=0&rsv_idx=2&ie=utf-8&tn=baiduhome_pg&rsv_enter=1&rsv_sug3=7&rsv_sug1=5&rsv_sug7=100&rsv_sug2=0&inputT=22&rsv_sug4=4980'

	result = urlparse.urlparse(url)

	params = urlparse.parse_qs(result.query)

	# print result

	print params

if __name__ == '__main__':

	# demo()

	# demo2()

	parse_html()

运行结果：

{'wd': ['python'], 'rsv_spt': ['1'], 'rsv_iqid': ['0xad2dc5550032146a'], 'inputT': ['22'], 'f': ['8'], 'rsv_enter': ['1'], 'rsv_bp': ['0'], 'rsv_idx': ['2'], 'tn': ['baiduhome_pg'], 'rsv_sug4': ['4980'], 'rsv_sug7': ['100'], 'rsv_sug1': ['5'], 'issp': ['1'], 'rsv_sug3': ['7'], 'rsv_sug2': ['0'], 'ie': ['utf-8']}

3、urllib2模块

urllib2提供更加强大的功能，如cookie的管理，但并不能完全代替urllib，因为urllib.urlencode函数urllib2中是没有的

3.1 urllib2.urlopen()

作用：打开url

参数：

url
data = None
timeout = <object>

import urllib

import urllib2

def demo():

	url = 'http://www.cnblogs.com/hester/sllsl'

	try:

		s = urllib2.urlopen(url,timeout = 3)

	except urllib2.HTTPError,e:

		print e

	else:

		print s.read(100)

if __name__ == '__main__':

	demo()

运行结果：

<!DOCTYPE html>

<html lang="zh-cn">

<head>

<meta charset="utf-8"/>

<title>”温故而知新“

如果url更改为未知的网址：

url = 'http://www.cnblogs.com/hester/asdfas'

运行结果：

HTTP Error 404: Not Found

3.2 urllib2.Request()

作用：添加或者修改http头

参数：

url
data
headers

import urllib

import urllib2

def demo():

	url = 'http://www.cnblogs.com/hester'

	headers = {'User-Agent':'Mozilla/5.0','x-my-hester':'my value'}

	req = urllib2.Request(url,headers=headers)

	s = urllib2.urlopen(req)

	print s.read(100)

	print req.headers

	s.close()

if __name__ == '__main__':

	demo()

运行结果：

<!DOCTYPE html>

<html lang="zh-cn">

<head>

<meta charset="utf-8"/>

<title>”温故而知新“

{'X-my-hester': 'my value', 'User-agent': 'Mozilla/5.0'}

3.3 urllib2.bulid_opener()

作用：创建一个打开器

参数：

Handler列表

ProxyHandler
UnknownHandler
HTTPHandler
HTTPDefaultHandler
HTTPRedirectHandler
FTPHandler
FileHandler
HTTPErrorHandler
HTTPSHandler

OpenerDirector

import urllib

import urllib2

def request_post_debug():

	data = {'username':'hester_ge','password':'xxxxxxx'}

	headers = {'User-Agent':'Mozilla/5.0','x-my-hester':'my value'}

	req = urllib2.Request('http://www.cnblogs.com/hester',data = urllib.urlencode(data),headers=headers)

	opener = urllib2.build_opener(urllib2.HTTPHandler(debuglevel=1))

	s = opener.open(req)

	print s.read(100)

	s.close()

if __name__ == '__main__':

	request_post_debug()

运行结果：

send: 'POST /hester HTTP/1.1\r\nAccept-Encoding: identity\r\nContent-Length: 35\r\nHost: www.cnblogs.com\r\nX-My-Hester: my value\r\nUser-Agent: Mozilla/5.0\r\nConnection: close\r\nContent-Type: application/x-www-form-urlencoded\r\n\r\nusername=hester_ge&password=xxxxxxx'

reply: 'HTTP/1.1 200 OK\r\n'

header: Date: Sun, 03 Jul 2016 08:28:37 GMT

header: Content-Type: text/html; charset=utf-8

header: Content-Length: 14096

header: Connection: close

header: Vary: Accept-Encoding

header: Cache-Control: private, max-age=10

header: Expires: Sun, 03 Jul 2016 08:28:45 GMT

header: Last-Modified: Sun, 03 Jul 2016 08:28:35 GMT

header: X-UA-Compatible: IE=10

<!DOCTYPE html>

<html lang="zh-cn">

<head>

<meta charset="utf-8"/>

<title>”温故而知新“

3.4 urllib2.install_opener

作用：保存创建的opener

import urllib

import urllib2

def demo():

	url = 'http://www.cnblogs.com/hester'

	headers = {'User-Agent':'Mozilla/5.0','x-my-hester':'my value'}

	req = urllib2.Request(url,headers=headers)

	s = urllib2.urlopen(req)

	print s.read(100)

	print req.headers

	s.close()

# def request_post_debug():

# 	data = {'username':'hester_ge','password':'xxxxxxx'}

# 	headers = {'User-Agent':'Mozilla/5.0','x-my-hester':'my value'}

# 	req = urllib2.Request('http://www.cnblogs.com/hester',data = urllib.urlencode(data),headers=headers)

# 	opener = urllib2.build_opener(urllib2.HTTPHandler(debuglevel=1))

# 	s = opener.open(req)

# 	print s.read(100)

# 	s.close()

def install_opener():

	opener = urllib2.build_opener(urllib2.HTTPHandler(debuglevel=1),

								  urllib2.HTTPSHandler(debuglevel=1))

	urllib2.install_opener(opener)

if __name__ == '__main__':

	# request_post_debug()

	demo()

运行结果：

<!DOCTYPE html>

<html lang="zh-cn">

<head>

<meta charset="utf-8"/>

<title>”温故而知新“

{'X-my-hester': 'my value', 'User-agent': 'Mozilla/5.0'}

如上代码更改为：

if __name__ == '__main__':

	# request_post_debug()

	install_opener()

	demo()

运行结果：

send: 'GET /hester HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.cnblogs.com\r\nConnection: close\r\nX-My-Hester: my value\r\nUser-Agent: Mozilla/5.0\r\n\r\n'

reply: 'HTTP/1.1 200 OK\r\n'

header: Date: Sun, 03 Jul 2016 08:39:31 GMT

header: Content-Type: text/html; charset=utf-8

header: Content-Length: 14096

header: Connection: close

header: Vary: Accept-Encoding

header: Cache-Control: private, max-age=10

header: Expires: Sun, 03 Jul 2016 08:39:41 GMT

header: Last-Modified: Sun, 03 Jul 2016 08:39:31 GMT

header: X-UA-Compatible: IE=10

<!DOCTYPE html>

<html lang="zh-cn">

<head>

<meta charset="utf-8"/>

<title>”温故而知新“

{'X-my-hester': 'my value', 'User-agent': 'Mozilla/5.0'}

4、cookies模块

因HTTP协议是无状态的，服务器无法识别请求是否为同一计算机，所以需要使用cookies进行标示。

客户见浏览器先发送request给服务器，服务器收到请求后进行解析，然后发送response给客户机，set_cookies就存在与response中，由浏览器进行设置。

我们这边用到两个模块

cookielib.CookieJar 提供解析并保存cookie的接口

HTTPCookieProcessor 提供自动出来cookie的功能

#encoding=utf8

import urllib2

import cookielib

def handler_cookie():

	cookiejar = cookielib.CookieJar()

	handler = urllib2.HTTPCookieProcessor(cookiejar=cookiejar)

	opener = urllib2.build_opener(handler,urllib2.HTTPHandler(debuglevel=1))

	s = opener.open('http://www.douban.com/')

	print s.read(100)

	s.close()

	print '=' * 20

	print cookiejar._cookies

	print '=' * 20

	#发送第二次请求时，自动带上cookie

	s2 = opener.open('http://www.douban.com/')

	print s2.read(100)

	s2.close()

if __name__ == '__main__':

	handler_cookie()

运行结果：

/usr/bin/python2.7 /home/hester/PycharmProjects/untitled/demo4.py

send: 'GET / HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.douban.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'

reply: 'HTTP/1.1 301 Moved Permanently\r\n'

header: Date: Sun, 03 Jul 2016 10:01:41 GMT

header: Content-Type: text/html

header: Content-Length: 178

header: Connection: close

header: Location: https://www.douban.com/

header: Server: dae

<!DOCTYPE HTML>

<html lang="zh-cms-Hans" class="">

<head>

<meta charset="UTF-8">

<meta name="descrip

====================

{'.douban.com': {'/': {'ll': Cookie(version=0, name='ll', value='"118163"', port=None, port_specified=False, domain='.douban.com', domain_specified=True, domain_initial_dot=True, path='/', path_specified=True, secure=False, expires=1499076101, discard=False, comment=None, comment_url=None, rest={}, rfc2109=False), 'bid': Cookie(version=0, name='bid', value='dDz4rCqWvcQ', port=None, port_specified=False, domain='.douban.com', domain_specified=True, domain_initial_dot=True, path='/', path_specified=True, secure=False, expires=1499076101, discard=False, comment=None, comment_url=None, rest={}, rfc2109=False)}}}

====================

send: 'GET / HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.douban.com\r\nCookie: ll="118163"; bid=dDz4rCqWvcQ\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'

reply: 'HTTP/1.1 301 Moved Permanently\r\n'

header: Date: Sun, 03 Jul 2016 10:01:42 GMT

header: Content-Type: text/html

header: Content-Length: 178

header: Connection: close

header: Location: https://www.douban.com/

header: Server: dae

<!DOCTYPE HTML>

<html lang="zh-cms-Hans" class="">

<head>

<meta charset="UTF-8">

<meta name="descrip

Process finished with exit code 0

python urllib、urlparse、urllib2、cookielib的更多相关文章

使用Python中的urlparse、urllib抓取和解析网页（一）（转）
对搜索引擎.文件索引.文档转换.数据检索.站点备份或迁移等应用程序来说,经常用到对网页(即HTML文件)的解析处理.事实上,通过Python 语言提供的各种模块,我们无需借助Web服务器或者Web浏览 ...
python之（urllib、urllib2、lxml、Selenium+PhantomJS）爬虫
一.最近在学习网络爬虫的东西,说实话,没有怎么写过爬虫,Java里面使用的爬虫也没有怎么用过.这里主要是学习Python的时候,了解到Python爬虫的强大,和代码的简介,这里会简单的从入门看是说起, ...
Python中的urlparse、urllib抓取和解析网页（一）
对搜索引擎.文件索引.文档转换.数据检索.站点备份或迁移等应用程序来说,经常用到对网页(即HTML文件)的解析处理.事实上,通过Python 语言提供的各种模块,我们无需借助Web服务器或者Web浏览 ...
python爬虫主要就是五个模块：爬虫启动入口模块，URL管理器存放已经爬虫的URL和待爬虫URL列表，html下载器，html解析器，html输出器同时可以掌握到urllib2的使用、bs4（BeautifulSoup）页面解析器、re正则表达式、urlparse、python基础知识回顾（set集合操作）等相关内容。
本次python爬虫百步百科,里面详细分析了爬虫的步骤,对每一步代码都有详细的注释说明,可通过本案例掌握python爬虫的特点: 1.爬虫调度入口(crawler_main.py) # coding: ...
爬虫新手学习2-爬虫进阶(urllib和urllib2 的区别、url转码、爬虫GET提交实例、批量爬取贴吧数据、fidder软件安装、有道翻译POST实例、豆瓣ajax数据获取)
1.urllib和urllib2区别实例 urllib和urllib2都是接受URL请求相关模块,但是提供了不同的功能,两个最显著的不同如下: urllib可以接受URL,不能创建设置headers的 ...
HTTP请求的python实现（urlopen、headers处理、 Cookie处理、设置Timeout超时、重定向、Proxy的设置）
python实现HTTP请求的三中方式:urllib2/urllib.httplib/urllib 以及Requests urllib2/urllib实现 urllib2和urllib是python两 ...
python语言(六)mock接口开发、发邮件、写日志、新Excel操作
一.urllib模块 urllib模块是一个标准模块,直接import urllib即可,在python3里面只有urllib模块,在python2里面有urllib模块和urllib2模块. url ...
Python之路-python(Queue队列、进程、Gevent协程、Select\Poll\Epoll异步IO与事件驱动)
一.进程: 1.语法 2.进程间通讯 3.进程池二.Gevent协程三.Select\Poll\Epoll异步IO与事件驱动一.进程: 1.语法简单的启动线程语法 def run(name): ...
[Python] 中文编码问题：raw_input输入、文件读取、变量比较等str、unicode、utf-8转换问题
最近研究搜索引擎.知识图谱和Python爬虫比较多,中文乱码问题再次浮现于眼前.虽然市面上讲述中文编码问题的文章数不胜数,同时以前我也讲述过PHP处理数据库服务器中文乱码问题,但是此处还是准备简单做下 ...

随机推荐

zabbix 问题汇总
1.Zabbix agent on Zabbix server is unreachable for 5 minutes 查看日志sudo tailf /var/log/zabbix/zabbix_a ...
深入剖析java迭代器以及C#迭代器!
目录: 知道迭代器接口Iterable 为什么java的for增强可以自动迭代那些类可以被迭代通过什么方法迭代 1.知道迭代器接口Iterable 解析: 迭代器(iterator)是一种对象,它 ...
PHP初入，简易网页整理（布局&特效的使用）
html><html> <head> <meta charset="UTF-8"> <title></title> ...
设置为互斥按钮的一组Radio按钮的用法
设置为互斥按钮的一组Radio,只需要将第一个Radio的Group属性设置为True,并为之映射变量(DDX),其余radio的Group属性设置为False,不需要映射变量. 否则会出现不是互斥按 ...
四则运算GUI
一.题目描述我们在个人作业1中,用各种语言实现了一个命令行的四则运算小程序.进一步,本次要求把这个程序做成GUI(可以是Windows PC 上的,也可以是Mac.Linux,web,手机上的),成 ...
事后诸葛亮分析（Beta版本）
全组讨论的照片设想和目标我们的软件要解决什么问题?是否定义得很清楚?是否对典型用户和典型场景有清晰的描述? 解决代码分析.统计.管理等问题,定义的很清楚,有清晰的描述. 是否有充足的时间来做计划? ...
201521123067 《Java程序设计》第5周学习总结
201521123067 <Java程序设计>第5周学习总结 1. 本周学习总结 1.1 尝试使用思维导图总结有关多态与接口的知识点. 1.2 可选:使用常规方法总结其他上课内容. ●在本 ...
201521123044 《Java程序设计》第14周学习总结
1. 本章学习总结 1.1 以你喜欢的方式(思维导图或其他)归纳总结多流与文件相关内容. 友情提示:导图用ctrl+鼠标滚轮放大看更清楚些 2. 书面作业 1. MySQL数据库基本操作建立数据库, ...
在linux下通过hexdump生成一个十六进制的文本保存文件，解析此文件转变成正常源代码文件。
举例说明: 此十六进制保存的文件为此源代码hexdump生成的: #include<stdio.h> #include<string.h> #include<stdlib ...
vim格式化代码
在命令模式下,按键盘gg=G 命令含义: gg:到达文件头=:缩进G:直到文件尾

python urllib、urlparse、urllib2、cookielib

1、urllib模块

1.urllib.urlopen(url[,data[,proxies]])

2.urllib.urlretrieve(url[,filename[,reporthook[,data]]])

reporthook参数使用如下：

3.urllib.urlcleanup()

4.urllib.quote(url)和urllib.quote_plus(url)

5.urllib.unquote(url)和urllib.unquote_plus(url)

6.urllib.urlencode(query)

2.urlparse模块

1.urlparse

2.parse_qs

3、urllib2模块

3.1 urllib2.urlopen()

3.2 urllib2.Request()

3.3 urllib2.bulid_opener()

3.4 urllib2.install_opener

4、cookies模块

python urllib、urlparse、urllib2、cookielib的更多相关文章

随机推荐

热门专题