爬虫--Urllib库详解

1、什么是Urllib?

2、相比Python2的变化

3、用法讲解

(1)urlopen

urlllb.request.urlopen(url,data=None[timeout,],cahle=None,capath=None,cadefault=False,context=None)

#第一个参数为url网址，第二个参数为额外的数据，第三个参数为超时的设置，剩下的参数暂时用不到

######### GET 类型的请求 #############

import urllib.request

response =urllib.request.urlopen("http://ww.baidu.com")

print(response.read().decode("utf-8")

<!DOCTYPE html>

<!--STATUS OK-->

·······················

······················

·····················

<script>

if(navigator.cookieEnabled){

document.cookie="NOJS=;expires=Sat, 01 Jan 2000 00:00:00 GMT";

}

</script>

</body>

</html>

打印的结果为：

######### POST 类型的请求 #############

import urllib.request

import urllib.parse

data=bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf8')

response=urllib.request.urlopen("http://httpbin.org/post",data=data) # http://httpbin.org/post  HTTP测试的网址

print(response.read())

b'{\n  "args": {}, \n  "data": "", \n  "files": {}, \n  "form": {\n    "word": "hello"\n  }, \n  "headers": {\n    "Accept-Encoding": "identity", \n    "Connection": "close", \n    "Content-Length": "10", \n    "Content-Type": "application/x-www-form-urlencoded", \n    "Host": "httpbin.org", \n    "User-Agent": "Python-urllib/3.5"\n  }, \n  "json": null, \n  "origin": "221.208.253.76", \n  "url": "http://httpbin.org/post"\n}\n'

打印的结果为：

import urllib.request

############### 超时的设置 ###############

response=urllib.request.urlopen("http://httpbin.org/get",timeout=1) # 设置一个超时的时间，在规定的时间没有响应，则抛出异常

print(response.read())

b'{\n "args": {}, \n "headers": {\n "Accept-Encoding": "identity", \n "Connection": "close", \n "Host": "httpbin.org", \n "User-Agent": "Python-urllib/3.5"\n }, \n "origin": "221.208.253.76", \n "url": "http://httpbin.org/get"\n}\n'

打印的结果为：

import urllib.request

import urllib.error

import socket

############### 超时的设置,超出响应时间 ###############

try:

    response = urllib.request.urlopen('htp://httpbin.org/get', timeout=0.1)

except urllib.error.URLError as e:

    if isinstance(e.reason,socket.timeout):

        print("Time out")

 Time out

打印的结果为：

(2)响应

响应类型

import urllib.request

response=urllib.request.urlopen('https://www.python.org')

print(type(response))

<class 'http.client.HTTPResponse'>

打印的结果为：

状态码、响应头

import urllib.request

response =urllib.request.urlopen('https://www.python.org')

print(response.status) # 获取状态码

print(response.getheaders) # 获取响应头

print(response.getheader('Server'))

200

<bound method HTTPResponse.getheaders of <http.client.HTTPResponse object at 0x0000000002D04EB8>>

nginx

打印的结果为：

(3)request

import urllib.request

request=urllib.request.Request("https://python.org")

response=urllib.request.urlopen(request)

print(response.read().decode("utf-8"))

<!doctype html>

<!--[if lt IE 7]> <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9"> <![endif]-->

<!--[if IE 7]> <html class="no-js ie7 lt-ie8 lt-ie9"> <![endif]-->

<!--[if IE 8]> <html class="no-js ie8 lt-ie9"> <![endif]-->

<!--[if gt IE 8]><!--><html class="no-js" lang="en" dir="ltr"> <!--<![endif]-->

<head>

·················

····················

</body>

</html>

打印的结果为：

from urllib import request,parse

url='http://httpbin.org/post'

############ POST 请求 ###############

headers={

    "User-Agent":"Mozilla/4.0(compatible;MSIE 5.5;Windows NT)",

    "Host":'httpbin.org'

    }

dict={

    'name':"Germey"

    }

data =bytes(parse.urlencode(dict),encoding="utf-8")

req =request.Request(url=url,data=data,headers=headers,method='POST')

response=request.urlopen(req)

print(response.read().decode('utf-8'))

{

"args": {},

"data": "",

"files": {},

"form": {

"name": "Germey"

},

"headers": {

"Accept-Encoding": "identity",

"Connection": "close",

"Content-Length": "",

"Content-Type": "application/x-www-form-urlencoded",

"Host": "httpbin.org",

"User-Agent": "Mozilla/4.0(compatible;MSIE 5.5;Windows NT)"

},

"json": null,

"origin": "221.208.253.76",

"url": "http://httpbin.org/post"

}

打印的结果为：

from urllib import request,parse

url ="http://httpbin.org/post"

dict={

    'name':'Germey'

    }

data =bytes(parse.urlencode(dict),encoding='utf8')

req = request.Request(url=url,data=data,method="POST")

req.add_header('User-Agent','Mozilla/4.0(compatible;MSIE5.5;Windows NT)')

response = request.urlopen(req)

print(response.read().decode('utf-8'))

{

"args": {},

"data": "",

"files": {},

"form": {

"name": "Germey"

},

"headers": {

"Accept-Encoding": "identity",

"Connection": "close",

"Content-Length": "",

"Content-Type": "application/x-www-form-urlencoded",

"Host": "httpbin.org",

"User-Agent": "Mozilla/4.0(compatible;MSIE5.5;Windows NT)"

},

"json": null,

"origin": "221.208.253.76",

"url": "http://httpbin.org/post"

}

打印的结果为：

(4)Handler

代理

import urllib.request

proxy_handler = urllib.request.ProxyHandler({

    'http':'http://127.0.0.1:9743',   # 代理http

    'https':'https://127.0.0.1:9743'  # 代理https

    })

opener =urllib.request.build_opener(proxy_handler)

response=opener.open("http://www.baidu.com")

print(response.read())

因为我没有代理，所以打印出来的结果为：

urllib.error.URLError: <urlopen error [WinError 10061] 由于目标计算机积极拒绝，无法连接。>

打印的结果为：

Cookie

import http.cookiejar,urllib.request

cookie=http.cookiejar.CookieJar()  # 获取Cookie信息

handler=urllib.request.HTTPCookieProcessor(cookie) # 把Cookie信息放入到 handler中

opener=urllib.request.build_opener(handler) # 建立opener

response=opener.open("http://www.baidu.com")

for item in cookie:

    print(item.name+"=”+item.value)

BAIDUID=DDCB4C216AE8EE90C7D95E7AF8FA577F:FG=1

BIDUPSID=DDCB4C216AE8EE90C7D95E7AF8FA577F

H_PS_PSSID=1452_21078_26350_27111

PSTM=1536830732

BDSVRTM=0

BD_HOME=0

delPer=0

打印的结果为：

########### 把Cookie 保存成文件 ##########

import http.cookiejar,urllib.request

filename = "cookie.txt"

cookie=http.cookiejar.MozillaCookieJar(filename)

handler=urllib.request.HTTPCookieProcessor(cookie)

opener=urllib.request.build_opener(handler)

response=opener.open("http://www.baidu.com")

cookie.save(ignore_discard=True,ignore_expires=True)

在工程目录下多了一个cookie.txt文件

该文件的内容为：

# Netscape HTTP Cookie File

# http://curl.haxx.se/rfc/cookie_spec.html

# This is a generated file!  Do not edit.

.baidu.com TRUE   /  FALSE  3684314677 BAIDUID    CB67C520D33E28D7204C570EB7DFA28F:FG=1

.baidu.com TRUE   /  FALSE  3684314677 BIDUPSID   CB67C520D33E28D7204C570EB7DFA28F

.baidu.com TRUE   /  FALSE     H_PS_PSSID 1434_21113_26350_20930

.baidu.com TRUE   /  FALSE  3684314677 PSTM   1536831034

www.baidu.com  FALSE  /  FALSE     BDSVRTM    0

www.baidu.com  FALSE  /  FALSE     BD_HOME    0

www.baidu.com  FALSE  /  FALSE  2482910974 delPer 0

打印的结果为：

########### 另一种 Cookie 的保存案例 ##########

import http.cookiejar,urllib.request

filename = "cookies.txt"

cookie=http.cookiejar.LWPCookieJar(filename)

handler=urllib.request.HTTPCookieProcessor(cookie)

opener=urllib.request.build_opener(handler)

response=opener.open("http://www.baidu.com")

cookie.save(ignore_discard=True,ignore_expires=True)

代码运行结果与上面相同！

(5)异常处理

from urllib import request,error

try:

    response=request.urlopen("http://cuiqingcai.com/index.htm")

except error.URLError as e:

    print(e.reason)

Not Found

打印的结果为：

from urllib import request,error

try:

    response =request.urlopen('http://cuiqingcai.com/index.htm')

except error.HTTPError as e:

    print(e.reason,e.code,e.headers,sep='\n')

except error.URLError as e:

    print(e.reason)

else:

    print("Request Successfully")

Not Found

404

Server: nginx/1.10.3 (Ubuntu)

Date: Thu, 13 Sep 2018 11:08:18 GMT

Content-Type: text/html; charset=UTF-8

Transfer-Encoding: chunked

Connection: close

Vary: Cookie

Expires: Wed, 11 Jan 1984 05:00:00 GMT

Cache-Control: no-cache, must-revalidate, max-age=0

Link: <https://cuiqingcai.com/wp-json/>; rel="https://api.w.org/"

打印的结果为：

import socket

import urllib.request

import urllib.error

try:

    response = urllib.request.urlopen("https://www.baidu.com",timeout=0.000000001)

except urllib.error.URLError as e:

    print(type(e.reason))

    if isinstance(e.reason,socket.timeout):

        print("TimeOut")

<class 'socket.timeout'>

TimeOut

执行后的结果为：

(6)URL解析

urlparse

urllib.parse.urlparse(urlstring.scheme="",allow_fragments=True)

from urllib.parse import urlparse

result =urlparse("http://www.baidu.com/index.html;user?id=5i#comment")

print(type(result),result)

<class 'urllib.parse.ParseResult'> ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5i', fragment='comment')

打印的结果为：

######## 无协议类型 ###########

from urllib.parse import urlparse

result =urlparse("www.baidu.com/index.html;user?id=5i#comment,scheme=/https")

print(result)

ParseResult(scheme='', netloc='', path='www.baidu.com/index.html', params='user', query='id=5i', fragment='comment,scheme=/https')

打印后的结果为：

######## 默认的协议类型 ###########

from urllib.parse import urlparse

result=urlparse("http://www.baidu.com/index.html;user?id=5i#comment,scheme=/https")

print(result)

ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5i', fragment='comment,scheme=/https')

打印后的结果为：

from urllib.parse import urlparse

result =urlparse("http://www.baidu.com/index.html;user?id=5i#comment",allow_fragments=False)

print(result)

ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5i#comment', fragment='')

打印后的结果为：

from urllib.parse import urlparse

result =urlparse("http://www.baidu.com/index.htmlf#comment",allow_fragments=False)

print(result)

ParseResult(scheme='http', netloc='www.baidu.com', path='/index.htmlf#comment', params='', query='', fragment='')

打印后的结果为：

urlunparse

from urllib.parse import urlunparse

data =["http","www.baidu.cogn","index.html","user",'a=6','comment']

print(urlunparse(data))

http://www.baidu.cogn/index.html;user?a=6#comment

执行后的结果

urljoin(url拼接，前面若在为补充，后面若在为基准)

from urllib.parse import urljoin

print(urljoin('http://www.baidu.com','FAQ.html'))

print(urljoin('http://www.baidu.com','https://cuiqingcai.com/FAQ.html'))

print(urljoin('http://www.baidu.com/about.html','https://cuiqingcai.com/FAQ.html'))

print(urljoin('http://www.baidu.com/about.html','https://cuiqingcai.com/FAQ.html?question=2'))

print(urljoin('http://www.baidu.com?wd=abc','https://cuiqingcai.com/infex.php'))

print(urljoin('http://www.baidu.com','?category=2#commen:'))

print(urljoin('www.baidu.com','?category=2t#comment'))

print(urljoin('www.baidu.comi#comment','?category=2'))

http://www.baidu.com/FAQ.htmr

https://cuiqingcai.com/FAQ.html

https://cuiqingcai.com/FAQ.html

https://cuiqingcai.com/FAQ.html?question=2

https://cuiqingcai.com/infex.php

http://www.baidu.com?category=2#commen:

www.baidu.com?category=2t#comment

www.baidu.comi?category=2

打印的结果为：

urlencode(把字典对象转化为GET请求参数)

from urllib.parse import urlencode

params={

    'name':'germey',

    'agel':''

    }

base_url='http://www.baidu.com?'

url=base_url+urlencode(params)

print(url)

http://www.baidu.com?name=germey&agel=22

打印的结果为：

爬虫--Urllib库详解的更多相关文章

爬虫入门之urllib库详解(二)
爬虫入门之urllib库详解(二) 1 urllib模块 urllib模块是一个运用于URL的包 urllib.request用于访问和读取URLS urllib.error包括了所有urllib.r ...
Python爬虫系列-Urllib库详解
Urllib库详解 Python内置的Http请求库: * urllib.request 请求模块 * urllib.error 异常处理模块 * urllib.parse url解析模块 * url ...
python爬虫知识点总结（三）urllib库详解
一.什么是Urllib? 官方学习文档:https://docs.python.org/3/library/urllib.html 廖雪峰的网站:https://www.liaoxuefeng.com ...
爬虫（二）：Urllib库详解
什么是Urllib: python内置的HTTP请求库 urllib.request : 请求模块 urllib.error : 异常处理模块 urllib.parse: url解析模块 urllib ...
urllib库详解 --Python3
相关:urllib是python内置的http请求库,本文介绍urllib三个模块:请求模块urllib.request.异常处理模块urllib.error.url解析模块urllib.parse. ...
Lua的协程和协程库详解
我们首先介绍一下什么是协程.然后详细介绍一下coroutine库,然后介绍一下协程的简单用法,最后介绍一下协程的复杂用法. 一.协程是什么? (1)线程首先复习一下多线程.我们都知道线程——Thre ...
Python--urllib3库详解1
Python--urllib3库详解1 Urllib3是一个功能强大,条理清晰,用于HTTP客户端的Python库,许多Python的原生系统已经开始使用urllib3.Urllib3提供了很多pyt ...
Struts标签库详解【3】
struts2标签库详解要在jsp中使用Struts2的标志,先要指明标志的引入.通过jsp的代码的顶部加入以下的代码: <%@taglib prefix="s" uri= ...
STM32固件库详解
STM32固件库详解 emouse原创文章,转载请注明出处http://www.cnblogs.com/emouse/ 应部分网友要求,最新加入固件库以及开发环境使用入门视频教程,同时提供例程模板 ...

随机推荐

JSP传递数组给JS的方法
由于JSP页面的数组无法直接传到JS.所以采用以下方法来获取数组. <% String[] title = { "姓名 ", "学号 ", "性 ...
java 基本--数据类型转换--001
小可转大,大转小可能会损失精度(编译出错,需要强制转换)A: byte,short,char -> int -> long -> float ->doubleB: byte,s ...
Android------BottonTabBar
前言:一款简单好用封装好的AndroidUI控件,底部导航栏. 1.使用 1.1添加 compile 'com.hjm:BottomTabBar:1.1.1' 1.2 activity_main. ...
配置apt-get告诉下载源
本文转自:http://blog.csdn.net/hyl1718/article/details/7915296 方法: 1.修改源地址: cp /etc/apt/sources.list /etc ...
[STL] map，multimap，unordered_map基本用法
map的特性是,所有元素都会根据元素的键值自动被排序.map的所有元素都是pair,同时拥有键值(key)和实值(value).pair的第一元素被视为键值,第二元素被视为实值.map不允许两个元素拥 ...
arc076 F - Exhausted? (霍尔定理学习)
题目链接 Problem Statement There are M chairs arranged in a line. The coordinate of the i-th chair ($$$1 ...
BZOJ 1015 星球大战(并查集)
正着不好搞,考虑倒着搞.倒着搞就是一个并查集. # include <cstdio> # include <cstring> # include <cstdlib> ...
BZOJ4823 CQOI2017老C的方块（最小割）
如果将其转化为一个更一般的问题即二分图带权最小单边点覆盖(最小控制集)感觉是非常npc的.考虑原题给的一大堆东西究竟有什么奇怪的性质. 容易发现如果与特殊边相邻的两格子都放了方块,并且这两个格子都各有 ...
【刷题】BZOJ 3513 [MUTC2013]idiots
Description 给定n个长度分别为a_i的木棒,问随机选择3个木棒能够拼成三角形的概率. Input 第一行T(T<=100),表示数据组数. 接下来若干行描述T组数据,每组数据第一行是 ...
POJ.3624 Charm Bracelet(DP 01背包)
POJ.3624 Charm Bracelet(DP 01背包) 题意分析裸01背包代码总览 #include <iostream> #include <cstdio> # ...

爬虫--Urllib库详解

爬虫--Urllib库详解的更多相关文章

随机推荐

热门专题