urllib库使用方法

这周打算把学过的内容重新总结一下，便于以后翻阅查找资料。

urllib库是python的内置库，不需要单独下载。其主要分为四个模块：

1.urllib.request——请求模块

2.urllib.error——异常处理模块

3.urllib.parse——url解析模块

4.urllib.robotparser——用来识别网站的robot.txt文件（看看哪些内容是可以爬的，不常用）

1.urlopen

import urllib.request

response = urllib.request.urlopen('http://www.baidu.com')

print(response.read().decode('utf-8'))

超时读取

import socket

import urllib.request

import urllib.error

try:

    response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)

except urllib.error.URLError as e:

    if isinstance(e.reason, socket.timeout):

        print('TIME OUT')

响应内容分析

import urllib.request

response = urllib.request.urlopen('https://www.python.org')

print(type(response))

print(response.status)

print(response.getheaders())

print(response.getheader('Server'))

<class 'http.client.HTTPResponse'>

200

[('Server', 'nginx'), ('Content-Type', 'text/html; charset=utf-8'), ('X-Frame-Options', 'SAMEORIGIN'), ('x-xss-protection', '1; mode=block'), ('X-Clacks-Overhead', 'GNU Terry Pratchett'), ('Via', '1.1 varnish'), ('Content-Length', '50069'), ('Accept-Ranges', 'bytes'), ('Date', 'Mon, 26 Nov 2018 02:44:49 GMT'), ('Via', '1.1 varnish'), ('Age', '3121'), ('Connection', 'close'), ('X-Served-By', 'cache-iad2150-IAD, cache-sjc3143-SJC'), ('X-Cache', 'MISS, HIT'), ('X-Cache-Hits', '0, 254'), ('X-Timer', 'S1543200290.644687,VS0,VE0'), ('Vary', 'Cookie'), ('Strict-Transport-Security', 'max-age=63072000; includeSubDomains')]

nginx

2. request

用来传递更多的请求参数，url,headers,data, method

from urllib import request, parse

url = 'http://httpbin.org/post'

headers = {

    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',

    'Host': 'httpbin.org'

}

dict = {

    'name': 'Germey'

}

data = bytes(parse.urlencode(dict), encoding='utf8')

req = request.Request(url=url, data=data, headers=headers, method='POST')

response = request.urlopen(req)

print(response.read().decode('utf-8'))

{

  "args": {},

  "data": "",

  "files": {},

  "form": {

    "name": "Germey"

  },

  "headers": {

    "Accept-Encoding": "identity",

    "Connection": "close",

    "Content-Length": "11",

    "Content-Type": "application/x-www-form-urlencoded",

    "Host": "httpbin.org",

    "User-Agent": "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)"

  },

  "json": null,

  "origin": "117.136.66.101",

  "url": "http://httpbin.org/post"

}

另一种方法添加headers

from urllib import request, parse

url = 'http://httpbin.org/post'

dict = {

    'name': 'asd'

}

data = bytes(parse.urlencode(dict), encoding='utf8')

req = request.Request(url=url, data=data, method='POST')

req.add_header('User-Agent', 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)')  #如果有好多headers,可以循环添加

response = request.urlopen(req)

print(response.read().decode('utf-8'))

3. Handler

代理

import urllib.request

proxy_handler = urllib.request.ProxyHandler({

    'http': 'http://127.0.0.1:9743',

    'https': 'https://127.0.0.1:9743'

})

opener = urllib.request.build_opener(proxy_handler)

response = opener.open('https://www.httpbin,org/get')

print(response.read())

4. Cookie

获取cookie
import http.cookiejar, urllib.request

cookie = http.cookiejar.CookieJar()

handler = urllib.request.HTTPCookieProcessor(cookie)

opener = urllib.request.build_opener(handler)

response = opener.open('http://www.baidu.com')

for item in cookie:

    print(item.name+"="+item.value)

存储Cookie

import http.cookiejar, urllib.request

filename = "cookie.txt"

cookie = http.cookiejar.MozillaCookieJar(filename)

handler = urllib.request.HTTPCookieProcessor(cookie)

opener = urllib.request.build_opener(handler)

response = opener.open('http://www.baidu.com')

cookie.save(ignore_discard=True, ignore_expires=True)

获取Cookie

import http.cookiejar, urllib.request

cookie = http.cookiejar.MozillaCookieJar()

cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)

handler = urllib.request.HTTPCookieProcessor(cookie)

opener = urllib.request.build_opener(handler)

response = opener.open('http://www.baidu.com')

print(response.read().decode('utf-8'))

5. UrlError 和 HttpError

from urllib import request, error

try:

    response = request.urlopen('http://cuiqingcai.com/index.htm')

except error.HTTPError as e:

    print(e.reason, e.code, e.headers, sep='\n')

except error.URLError as e:

    print(e.reason)

else:

    print('Request Successfully')

对于urllib.parse，因为用的不多，就先不写了。上传一个文档链接吧，万一哪天用了还可以看看。

https://files.cnblogs.com/files/zhangguoxv/urllib%E8%AE%B2%E8%A7%A3.zip

urllib库使用方法的更多相关文章

urllib库使用方法 4 create headers
import urllib.requestimport urllib.parse url = "https://www.baidu.com/"#普通请求方法response = u ...
urllib库使用方法 3 get html
import urllib.requestimport urllib.parse #https://www.baidu.com/s?ie=UTF-8&wd=中国#将上面的中国部分内容,可以动态 ...
urllib库使用方法 2 parse
import urllib.parse #url.parse用法包含三个方法:quote url, unquote rul, urlencode#quote url 编码函数,url规范只识别字母.数 ...
urllib库使用方法1 request
urllib是可以模仿浏览器发送请求的库,Python自带 Python3中urllib分为:urllib.request和urllib.parse import urllib.request url ...
Python爬虫学习==>第七章：urllib库的基本使用方法
学习目的: urllib提供了url解析函数,所以需要学习正式步骤 Step1:什么是urllib urllib库是Python自带模块,是Python内置的HTTP请求库包含4个模块: >& ...
python--爬虫入门（七）urllib库初体验以及中文编码问题的探讨
python系列均基于python3.4环境 ---------@_@? --------------------------------------------------------------- ...
urllib库初体验以及中文编码问题的探讨
提出问题:如何简单抓取一个网页的源码解决方法:利用urllib库,抓取一个网页的源代码 ------------------------------------------------------- ...
Python爬虫入门 Urllib库的基本使用
1.分分钟扒一个网页下来怎样扒网页呢?其实就是根据URL来获取它的网页信息,虽然我们在浏览器中看到的是一幅幅优美的画面,但是其实是由浏览器解释才呈现出来的,实质它是一段HTML代码,加 JS.CSS ...
Python爬虫入门：Urllib库的基本使用
1.分分钟扒一个网页下来怎样扒网页呢?其实就是根据URL来获取它的网页信息,虽然我们在浏览器中看到的是一幅幅优美的画面,但是其实是由浏览器解释才呈现出来的,实质它是一段HTML代码,加 JS.CS ...

随机推荐

Rocket - diplomacy - DUEB参数模型分析
https://mp.weixin.qq.com/s/533bJxcPRgO4W2gf_OEhEw 分析DUEB参数模型中各种参数类型的可能性. 1. 节点类型根据参数的传播方向,可 ...
Nginx 笔记（一）nginx简介与安装
个人博客网:https://wushaopei.github.io/ (你想要这里多有) Nginx 简介: 1.介绍 nginx 的应用场景和具体可以做什么事情 2.介绍什么是反向代理 3.介 ...
Java实现 LeetCode 768 最多能完成排序的块 II（左右便利）
768. 最多能完成排序的块 II 这个问题和"最多能完成排序的块"相似,但给定数组中的元素可以重复,输入数组最大长度为2000,其中的元素最大为10**8. arr是一个可能包含 ...
Java实现 LeetCode 151 翻转字符串里的单词
151. 翻转字符串里的单词给定一个字符串,逐个翻转字符串中的每个单词. 示例 1: 输入: "the sky is blue" 输出: "blue is sky th ...
Redis企业级数据备份与恢复方案
一.持久化配置 RBD和AOF建议同时打开(Redis4.0之后支持) RDB做冷备,AOF做数据恢复(数据更可靠) RDB采取默认配置即可,AOF推荐采取everysec每秒策略 AOF和RDB还不 ...
iOS－pthread && NSThread && iOS9网络适配
几个概念: 进程:"正在运行"应用程序(app)就是一个进程,它至少包含一个线程: 进程的作用:为应用程序开辟内存空间: 线程:CPU调度的最小单元: ...
hibernate中的映射
hibernate中的映射是指Java类和数据库表中的属性来进行关联,然后通过类来操作数据库中,这就是简单的映射.
数据误操作，删库跑路？教你使用ApexSQLLog工具从 SQLServer日志恢复数据！
前几天同事不小心误操作,将SQLServer库的一张表的一个状态字段给刷成了一个统一状态,由于是update执行所以原来的相关状态无法确定.发生这种事情的时候我的小伙伴背后一凉,估计心里里面想这怕是 ...
JS基础知识笔记
2020-04-15 JS基础知识笔记 // new Boolean()传入的值与if判断一样 var test=new Boolean(); console.log(test); // false ...
EIGRP-15-其他和高级的EIGRP特性-1-路由器ID
与很多协议一样, EIGRP也使用了路由器ID (RTD)的概念,用一个4字节的编号来标识某个路由器实例.每个地址家族实例拥有自已独立的RID.工程师可以在一台路由器上,为多个EIGRP进程和地址家族 ...

urllib库使用方法

urllib库使用方法的更多相关文章

随机推荐

热门专题