0.爬虫 urlib库讲解 urlopen()与Request()

# 注意一下是import urllib.request 还是 form urllib import request

0. urlopen()

语法：urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

实例0：(这个函数一般就使用三个参数 url data timeout)

*添加的data参数需要使用bytes()方法将参数转换为字节流（区别于str的一种类型是一种比特流 010010010）编码的格式的内容，即bytes类型。

*response.read()是bytes类型的数据，需要decode（解码）一下。

import urllib.parse

import urllib.request

import urllib.error

url = 'http://httpbin.org/post'

data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf8')

try:

    response = urllib.request.urlopen(url, data=data,timeout=1)

except urllib.error.URLError as e:

    if isinstance(e.reason, socket.timeout):

        print('TIME OUT')

else:

    print(response.read().decode("utf-8"))

输出结果：

{

  "args": {},

  "data": "",

  "files": {},

  "form": {

    "word": "hello"

  },

  "headers": {

    "Accept-Encoding": "identity",

    "Content-Length": "10",

    "Content-Type": "application/x-www-form-urlencoded",

    "Host": "httpbin.org",

    "User-Agent": "Python-urllib/3.6"

  },

  "json": null,

  "origin": "101.206.170.234, 101.206.170.234",

  "url": "https://httpbin.org/post"

}

实例1：查看i状态码、响应头、响应头里server字段的信息

import urllib.request

response = urllib.request.urlopen('https://www.python.org')

print(response.status)

print(response.getheaders())

print(response.getheader('Server'))

输出结果：

200

[('Server', 'nginx'), ('Content-Type', 'text/html; charset=utf-8'), ('X-Frame-Options', 'DENY'), ('Via', '1.1 vegur'), ('Via', '1.1 varnish'), ('Content-Length', '48410'), ('Accept-Ranges', 'bytes'), ('Date', 'Tue, 09 Apr 2019 02:32:34 GMT'), ('Via', '1.1 varnish'), ('Age', '722'), ('Connection', 'close'), ('X-Served-By', 'cache-iad2126-IAD, cache-hnd18751-HND'), ('X-Cache', 'MISS, HIT'), ('X-Cache-Hits', '0, 1223'), ('X-Timer', 'S1554777154.210361,VS0,VE0'), ('Vary', 'Cookie'), ('Strict-Transport-Security', 'max-age=63072000; includeSubDomains')]

nginx

使用urllib库的urlopen()方法有很大的局限性，比如不能设置响应头的信息等。所以需要引入request()方法。

1. Request()

实例0：（这两种方法的实现效果是一样的）

import urllib.request

response = urllib.request.urlopen('https://www.python.org')

print(response.read().decode('utf-8'))

######################################

import urllib.request

req = urllib.request.Request('https://python.org')

response = urllib.request.urlopen(req)

print(response.read().decode('utf-8'))

下面主要讲解下使用Request()方法来实现get请求和post请求,并设置参数。

实例1：(post请求)

from urllib import request, parse

url = 'http://httpbin.org/post'

headers = {

    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',

    'Host': 'httpbin.org'

}

dict = {

    'name': 'Germey'

}

data = bytes(parse.urlencode(dict), encoding='utf8')

req = request.Request(url=url, data=data, headers=headers, method='POST')

response = request.urlopen(req)

print(response.read().decode('utf-8'))

亦可使用add_header()方法来添加报头，实现浏览器的模拟，添加data属性亦可如下书写：

补充：还可以使用bulid_opener()修改报头，不过多阐述，够用了就好。

from urllib import request, parse

url = 'http://httpbin.org/post'

dict = {

    'name': 'Germey'

}

data = parse.urlencode(dict).encode('utf-8')

req = request.Request(url=url, data=data, method='POST')

req.add_header('User-Agent', 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)')

response = request.urlopen(req)

print(response.read().decode('utf-8'))

实例2：(get请求) 百度关键字的查询

from urllib import request,parse

url = 'http://www.baidu.com/s?wd='

key = '路飞'

key_code = parse.quote(key)

url_all = url + key_code

"""

#第二种写法

url = 'http://www.baidu.com/s'

key = '路飞'

wd = parse.urlencode({'wd':key})

url_all = url + '?' + wd

"""

req = request.Request(url_all)

response = request.urlopen(req)

print(response.read().decode('utf-8'))

在这里，对编码decode、reqest模块里的quote()方法、urlencode()方法等就有疑问了，，对此，做一些说明：

parse.quote：将str数据转换为对应的编码
parse.urlencode：将字典中的k:v转换为K:编码后的v
parse.unquote：将编码后的数据转化为编码前的数据
decode 字符串解码 decode("utf-8")跟read()搭配很配！
encode 字符串编码

>>> str0 = '我爱你'

>>> str1 = str0.encode('gb2312')

>>> str1

b'\xce\xd2\xb0\xae\xc4\xe3'

>>> str2 = str0.encode('gbk')

>>> str2

b'\xce\xd2\xb0\xae\xc4\xe3'

>>> str3 = str0.encode('utf-8')

>>> str3

b'\xe6\x88\x91\xe7\x88\xb1\xe4\xbd\xa0'

>>> str00 = str1.decode('gb2312')

>>> str00

'我爱你'

>>> str11 = str1.decode('utf-8') #报错，因为str1是gb2312编码的

Traceback (most recent call last):

  File "<pyshell#9>", line 1, in <module>

    str11 = str1.decode('utf-8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xce in position 0: invalid continuation byte

* encoding指定编码格式

在这里，又有疑问了？read()、readline()、readlines()的区别：

read():全部，字符串str
reasline():一行
readlines():全部，列表list

0.爬虫 urlib库讲解 urlopen()与Request()的更多相关文章

1.爬虫 urlib库讲解 Handler高级用法
在前面我们总结了urllib库的 urlopen()和Request()方法的使用,在这一小节我们要使用相关的Handler来实现代理.cookies等功能. 写在前面: urlopen()方法不支持 ...
2.爬虫 urlib库讲解异常处理、URL解析、分析Robots协议
1.异常处理 URLError类来自urllib库的error模块,它继承自OSError类,是error异常模块的基类,由request模块产生的异常都可以通过这个类来处理. from urllib ...
3.爬虫 urlib库讲解总结
urllib库的总结: 用ProcessOn(安利这个软件,够用了)根据前面的几节内容做了个思维导图. urllib库一共有四个模块: request:它是最基本的模块,可以用来模拟发送请求 erro ...
4.爬虫 requests库讲解 GET请求 POST请求响应
requests库相比于urllib库更好用!!! 0.各种请求方式 import requests requests.post('http://httpbin.org/post') requests ...
5.爬虫 requests库讲解高级用法
0.文件上传 import requests files = {'file': open('favicon.ico', 'rb')} response = requests.post("ht ...
6.爬虫 requests库讲解总结
requests库的总结: 用ProcessOn根据前面的几节内容做了个思维导图:
Python爬虫与数据分析之爬虫技能：urlib库、xpath选择器、正则表达式
专栏目录: Python爬虫与数据分析之python教学视频.python源码分享,python Python爬虫与数据分析之基础教程:Python的语法.字典.元组.列表 Python爬虫与数据分析 ...
爬虫-Python爬虫常用库
一.常用库 1.requests 做请求的时候用到. requests.get("url") 2.selenium 自动化会用到. 3.lxml 4.beautifulsoup 5 ...
对于python爬虫urllib库的一些理解（抽空更新）
urllib库是Python中一个最基本的网络请求库.可以模拟浏览器的行为,向指定的服务器发送一个请求,并可以保存服务器返回的数据. urlopen函数: 在Python3的urllib库中,所有和网 ...

随机推荐

Web中的中文参数乱码
中文参数乱码 1 get方式传参,中文乱码修改tomcat中的配置server.xml 在修改端口的标签中添加属性URIEncoding="XXX&quo ...
django-auth认证模块
########django-auth认证模块######## auth模块:它是django自带的用户认证模块,帮我们解决了登陆,注册,注销,修改密码等等一系列的操作,封装成一个个方法,方便我们使 ...
ios微信公众号分享回调事件
IOS手机在分享成功后,回调事件无法正常执行,在回调方法里面加入: setTimeout(function () { //todo }, ); 例如: //分享 Share({ title: &quo ...
NEC html规范
HTML规范 - 整体结构 HTML基础设施文件应以“<!DOCTYPE ......>”首行顶格开始,推荐使用“<!DOCTYPE html>”. 必须申明文档的编码cha ...
纯css实现移动端横向滑动列表
前几天在公司做开发的时候碰到一个列表横向滑动的功能,当时用了iscroll做,结果导致手指触到列表的范围内竖向滑动屏幕滑动不了的问题. 这个问题不知道iscroll本身能不能解决,当时选择了换一种方式 ...
BZOJ2754: [SCOI2012]喵星球上的点名(AC自动机)
Time Limit: 20 Sec Memory Limit: 128 MBSubmit: 2816 Solved: 1246[Submit][Status][Discuss] Descript ...
BZOJ1030: [JSOI2007]文本生成器(AC自动机)
Time Limit: 1 Sec Memory Limit: 162 MBSubmit: 5984 Solved: 2523[Submit][Status][Discuss] Descripti ...
hibernate中配置单向多对一关联,和双向一对多,双向多对多
什么是一对多,多对一? 一对多,比如你去找一个父亲的所有孩子,孩子可能有两个,三个甚至四个孩子. 这就是一对多父亲是1 孩子是多多对一,比如你到了两个孩子,它们都是有一个共同的父亲. 此时孩子就是 ...
方别《QQ群霸屏技术》，又见《QQ群建群细则》
规则,时刻变动;QQ群系列,咱们再来一轮. QQ群霸屏技术,你说建群貌似很菜,大家仿佛都知道,其实只知其一不知其二. QQ群类别群分类,常规的就以下几种. 普通群. 建群随意,偏个性化,一言不合就拉 ...
hadoop2.7.2集群搭建
hadoop2.7.2集群搭建 1.修改hadoop中的配置文件进入/usr/local/src/hadoop-2.7.2/etc/hadoop目录,修改hadoop-env.sh,core-sit ...

0.爬虫 urlib库讲解 urlopen()与Request()

0.爬虫 urlib库讲解 urlopen()与Request()的更多相关文章

随机推荐

热门专题