python3里的Urllib库

　　首先Urllib是python内置的HTTP请求库。

　　包括以下模块：

urllib.request 请求模块；
urllib.error 异常处理模块；
urllib.parse url解析模块；
urllib.robotparser robots.txt解析模块。

　　urllib常规发送请求方式

import urllib.parse

import urllib.request

data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf8')

response = urllib.request.urlopen('https://httpbin.org/post', data=data)

print(response.read())

　　运行结果：

b'{\n  "args": {}, \n  "data": "", \n  "files": {}, \n  "form": {\n    "word": "hello"\n  }, \n  "headers": {\n    "Accept-Encoding": "identity", \n    "Connection": "close", \n    "Content-Length": "10", \n    "Content-Type": "application/x-www-form-urlencoded", \n    "Host": "httpbin.org", \n    "User-Agent": "Python-urllib/3.6"\n  }, \n  "json": null, \n  "origin": "47.74.11.227", \n  "url": "https://httpbin.org/post"\n}\n'

　　需要post数据按指定编码格式编码，在使用urlopen时传递data参数即发送post请求，默认不加就是get请求。

　　设置超时时间：

import urllib.request

response = urllib.request.urlopen('http://httpbin.org/get', timeout=1)

print(response.read())

b'{\n  "args": {}, \n  "headers": {\n    "Accept-Encoding": "identity", \n    "Connection": "close", \n    "Host": "httpbin.org", \n    "User-Agent": "Python-urllib/3.6"\n  }, \n  "origin": "36.57.169.42", \n  "url": "https://httpbin.org/get"\n}\n'

　　设置超时时间主要是防止爬取某些页面速度极慢，爬的时间过长造成假死状态，从而影响我们的爬虫性能，所以会用到我们的超时时间。

　　urlopen打开的是什么类型的数据呢，或者是是什么对象。

import urllib.request

response = urllib.request.urlopen('https://www.python.org')

print(type(response))

<class 'http.client.HTTPResponse'>

　　这个对象可以read()方法读取。那么我们都可以读取到那些东西呢。

import urllib.request

response = urllib.request.urlopen('https://www.python.org')

print(response.status)

print(response.getheaders())

print(response.getheader('Server'))

#运行结果

#获取状态码

200

#获取请求头的所有信息

[('Server', 'nginx'), ('Content-Type', 'text/html; charset=utf-8'), ('X-Frame-Options', 'SAMEORIGIN'), ('x-xss-protection', '1; mode=block'), ('X-Clacks-Overhead', 'GNU Terry Pratchett'), ('Via', '1.1 varnish'), ('Fastly-Debug-Digest', 'a63ab819df3b185a89db37a59e39f0dd85cf8ee71f54bbb42fae41670ae56fd2'), ('Content-Length', ''), ('Accept-Ranges', 'bytes'), ('Date', 'Wed, 31 Jan 2018 11:39:52 GMT'), ('Via', '1.1 varnish'), ('Age', ''), ('Connection', 'close'), ('X-Served-By', 'cache-iad2150-IAD, cache-hnd18723-HND'), ('X-Cache', 'HIT, HIT'), ('X-Cache-Hits', '6, 11'), ('X-Timer', 'S1517398792.207858,VS0,VE0'), ('Vary', 'Cookie'), ('Strict-Transport-Security', 'max-age=63072000; includeSubDomains')]

#获取服务器

nginx

#获取整个页面的信息

import urllib.request

response = urllib.request.urlopen('https://www.python.org')

print(response.read().decode('utf-8'))

　　在一般情况下我们都需要对请求对象进行改装，加上请求头等信息，这时候我们就要自己定制request对象，然后给他加buff。

from urllib import request, parse

url = 'https://httpbin.org/post'

headers = {

    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',

    'Host': 'httpbin.org'

}

dict = {

    'name': 'Germey'

}

data = bytes(parse.urlencode(dict), encoding='utf8')

#自定义构造request包含了自定义请求头

req = request.Request(url=url, data=data, headers=headers, method='POST')

response = request.urlopen(req)

print(response.read().decode('utf-8'))

{

  "args": {},

  "data": "",

  "files": {},

  "form": {

    "name": "Germey"

  },

  "headers": {

    "Accept-Encoding": "identity",

    "Connection": "close",

    "Content-Length": "",

    "Content-Type": "application/x-www-form-urlencoded",

    "Host": "httpbin.org",

    "User-Agent": "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)"

  },

  "json": null,

  "origin": "36.57.169.42",

  "url": "https://httpbin.org/post"

}

　　另外一种加请求头的方式

from urllib import request, parse

url = 'https://httpbin.org/post'

dict = {

    'name': 'Germey'

}

data = bytes(parse.urlencode(dict), encoding='utf8')

req = request.Request(url=url, data=data, method='POST')

req.add_header('User-Agent', 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)')

response = request.urlopen(req)

print(response.read().decode('utf-8'))

　　结果与上一种请求一定是一模一样的。上面完成了最基本的http请求，如果我们想要完成更高级的http请求，使用代理或者带上cookie值等，那么我们需要使用handler方法，handler有很多种。

import urllib.request

# 构建一个HTTPHandler处理器对象，支持处理HTTP的请求

#http_handler = urllib2.HTTPHandler()

# 在HTTPHandler增加参数"debuglevel=1"将会自动打开Debug log 模式，

# 程序在执行的时候会打印收发包的信息

http_handler = urllib.request.HTTPHandler(debuglevel=1)

# 调用build_opener()方法构建一个自定义的opener对象，参数是构建的处理器对象

opener = urllib.request.build_opener(http_handler)

request = urllib.request.Request("http://www.baidu.com/")

response = opener.open(request)

print(response.read().decode('utf8'))

　　这里我们定制了一个自己的opener，构建一个Handler处理器对象，参数是一个字典类型，包括代理类型和代理服务器IP+PROT。

import urllib.request

# 代理开关，表示是否启用代理

proxyswitch = True

# proxyswitch = False

# 构建一个Handler处理器对象，参数是一个字典类型，包括代理类型和代理服务器IP+PROT

httpproxy_handler = urllib.request.ProxyHandler({"http" : "121.41.175.199:80"})

# 构建了一个没有代理的处理器对象

nullproxy_handler = urllib.request.ProxyHandler({})

if proxyswitch:

    opener = urllib.request.build_opener(httpproxy_handler)

else:

    opener = urllib.request.build_opener(nullproxy_handler)

# 构建了一个全局的opener，之后所有的请求都可以用urlopen()方式去发送，也附带Handler的功能

urllib.request.install_opener(opener)

request = urllib.request.Request("http://www.baidu.com/")

response = urllib.request.urlopen(request)

#print response.read().decode("gbk")

print(response.read().decode('utf8'))

　　保存cookie的handler

import http.cookiejar, urllib.request

cookie = http.cookiejar.CookieJar()

handler = urllib.request.HTTPCookieProcessor(cookie)

opener = urllib.request.build_opener(handler)

response = opener.open('http://www.baidu.com')

for item in cookie:

    print(item.name+"="+item.value)

#将cookie依次打印出来

BAIDUID=185CCA6964D4660F71AC56C5FD40F293:FG=1

BIDUPSID=185CCA6964D4660F71AC56C5FD40F293

H_PS_PSSID=25641_1460_24565_21125_18559

PSTM=1517404650

BDSVRTM=0

BD_HOME=0

cookie保存成文本文件

import http.cookiejar, urllib.request

filename = "cookie.txt"

cookie = http.cookiejar.MozillaCookieJar(filename)

handler = urllib.request.HTTPCookieProcessor(cookie)

opener = urllib.request.build_opener(handler)

response = opener.open('http://www.baidu.com')

cookie.save(ignore_discard=True, ignore_expires=True)

　　将本地的cookie读取出来然后使用本地文本里的cookie去访问网址：

import http.cookiejar, urllib.request

cookie = http.cookiejar.LWPCookieJar()

cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)

handler = urllib.request.HTTPCookieProcessor(cookie)

opener = urllib.request.build_opener(handler)

response = opener.open('http://www.baidu.com')

print(response.read().decode('utf-8'))

　　urlparse将网址分割成多个部分（也可以说是解析）。

from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')

print(type(result), result)

　　解析的结果：

<class 'urllib.parse.ParseResult'> ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

#另一种解析

from urllib.parse import urlparse

result = urlparse('www.baidu.com/index.html;user?id=5#comment', scheme='https')

print(result)

　　并且url里的scheme优先级更高。

from urllib.parse import urlparse

#url里有scheme，然后我们又传入一次，结果是什么呢

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment', scheme='https')

print(result)

#url win

ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

#不显示什么字段

from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment', allow_fragments=False)

print(result)

ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5#comment', fragment='')

from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html#comment', allow_fragments=False)

print(result)

#实际上，这个字段位false那么就减少了一次split，数据还是会存在上一级，但是path之前字段是必须有的

ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html#comment', params='', query='', fragment='')

#反向解析

from urllib.parse import urlunparse

data = ['http', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment']

print(urlunparse(data))

#解析结果还原url

http://www.baidu.com/index.html;user?a=6#comment

　　url拼接

from urllib.parse import urljoin

print(urljoin('http://www.baidu.com', 'FAQ.html'))

print(urljoin('http://www.baidu.com', 'https://cuiqingcai.com/FAQ.html'))

print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html'))

print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html?question=2'))

print(urljoin('http://www.baidu.com?wd=abc', 'https://cuiqingcai.com/index.php'))

print(urljoin('http://www.baidu.com', '?category=2#comment'))

print(urljoin('www.baidu.com', '?category=2#comment'))

print(urljoin('www.baidu.com#comment', '?category=2'))

from urllib.parse import urlencode

params = {

    'name': 'germey',

    'age': 22

}

base_url = 'http://www.baidu.com?'

url = base_url + urlencode(params)

print(url)

http://www.baidu.com?name=germey&age=22

　　url编码常用于汉字在url中传输时使用。以上就是urllib库常用的操作了。

　　随着深入的爬虫学习，后面要接触功能强大的Requests库，Requests 唯一的一个非转基因的 Python HTTP 库，人类可以安全享用。

　　Requests 允许你发送纯天然，植物饲养的 HTTP/1.1 请求，无需手工劳动。你不需要手动为 URL 添加查询字串，也不需要对 POST 数据进行表单编码。Keep-alive 和 HTTP 连接池的功能是 100% 自动化的，一切动力都来自于根植在 Requests 内部的 urllib3。

　　官方文档地址：Requests: 让 HTTP 服务人类。

python3里的Urllib库的更多相关文章

python3爬虫之Urllib库（一）
上一篇我简单说了说爬虫的原理,这一篇我们来讲讲python自带的请求库:urllib 在python2里边,用urllib库和urllib2库来实现请求的发送,但是在python3种在也不用那么麻烦了 ...
python3爬虫之Urllib库（二）
在上一篇文章中,我们大概讲了一下urllib库中最重要的两个请求方法:urlopen() 和 Request() 但是仅仅凭借那两个方法无法执行一些更高级的请求,如Cookies处理,代理设置等等 ...
Python3使用request/urllib库重定向问题
禁止自动重定向 python3的urllib.request模块发http请求的时候,如果服务器响应30x会自动跟随重定向,返回的结果是重定向后的最终结果而不是30x的响应结果. request是靠H ...
6.python3爬虫之urllib库
# 导入urllib.request import urllib.request # 向指定的url发送请求,并返回服务器响应的类文件对象 response = urllib.request.urlo ...
Python2/3中的urllib库
urllib库对照速查表 Python2.X Python3.X urllib urllib.request, urllib.error, urllib.parse urllib2 urllib.re ...
Python3.7中urllib.urlopen 报错问题
import urllib web = urllib.urlopen('https://www.baidu.com') f = web.read() print(f) 报错: Traceback (m ...
Python3中Urllib库基本使用
什么是Urllib? Python内置的HTTP请求库 urllib.request 请求模块 urllib.error 异常处理模块 urllib.par ...
全网最全的Windows下Python2 / Python3里正确下载安装用来向微信好友发送消息的itchat库（图文详解）
不多说,直接上干货! 建议,你用Anaconda2或Anaconda3. 见全网最全的Windows下Anaconda2 / Anaconda3里正确下载安装用来向微信好友发送消息的itchat库( ...
Python3 urllib库和requests库
1. Python3 使用urllib库请求网络 1.1 基于urllib库的GET请求请求百度首页www.baidu.com ,不添加请求头信息: import urllib.requests d ...

随机推荐

MATLAB——解数独
数独数独是一种逻辑游戏,玩家需要根据9x9盘面的已知数字,推理出剩余所有空格的数字,并满足每一行.每一列和每个粗线宫(3x3)内均含1~9,不重复. MATLAB中有关函数 M = dlmread( ...
mac 上使用移动硬盘
1. 打开终端,查看赢盘的Volume Name diskutil list 2. 更新fstab文件,此步骤需要输入密码 sudo nano /etc/fstab 3. 在fstab文件中写入一下内 ...
一款App的开发成本是多少?
答一: 接触过上万名创业者,开发上线过超过30款App,没有比我更适合回答这个问题的了.. 本文对想做好一款App项目的人来说这是一篇价值百万的回答!因为这是我们花了几百万试错成本试出来的经验! &l ...
黑苹果10.10.3手动开启SSD的TIRM提高硬盘效率
黑苹果10.10.3手动开启SSD的TIRM提高硬盘效率文章前言其实开启TIRM的方法有很多,比如用Clover注入的方式或者用其他的工具来方便完成,但是10.10.3刚刚出来有些工具还没有更新的 ...
CF895E Eyes Closed (期望)
题目链接利用期望的线性性质: $E(sum) = E(x_l) + E(x_{l+1})+ E(x_{l+2}) +.. E(x_r)$ 然后就考虑对于交换时两个区间元素的改动. 假设这两个区间 ...
vue-music:歌词的其他功能
由于歌词的播放需要歌曲播放,切换歌曲,歌曲的播放模式等等有关联,因此,需要在这几处处理相关问题 1.循环播放回不到开始位置 loop() { this.$refs.audio.currentTime ...
浏览器中如何获取想要的offsetwidth、、、clientwidth、、offsetheight、、、clientheight。。。
clientWidth是对象看到的宽度(不含边线,即border)scrollWidth是对象实际内容的宽度(若无padding,那就是边框之间距离,如有padding,就是左padding和右pad ...
Python3 安装pip 提示ModuleNotFoundError: No module named 'distutils.util'
环境ubutun14,python版本是python3.6. 今天在安装Pip 时出现ModuleNotFoundError: No module named 'distutils.util'.操作步 ...
（转）python之禅
凡是用过 Python的人,基本上都知道在交互式解释器中输入 import this 就会显示 Tim Peters 的 The Zen of Python,但它那偈语般的语句有点令人费解,所以我想分 ...
我的第一个ajax脚本
代码如下 //创建XMLHttpRequest对象 var xmlHttp=null; function creatXMLHttp(){ try{ xmlHttp = new XMLHttpReque ...

python3里的Urllib库

python3里的Urllib库的更多相关文章

随机推荐

热门专题