爬虫学习--Urllib库基本使用 Day1

一、Urllib库详解

1、什么是Urllib

Python内置的HTTP请求库

urllib.request 　　　请求模块（模拟实现传入网址访问）

urllib.error 　　异常处理模块（如果出现错误，进行捕捉这个异常，然后进行重试和其他的操作保证程序不会意外的中止）

urllib.parse url解析模块（工具模块，提供了许多url处理方法，例如：拆分，合并等）

urllib.robotparser robots.txt解析模块（主要是用来识别网页的robots.txt文件，判断哪些网站是可以爬的，哪些是不可以爬的）

2、相比Python变化

Python2

import urllib2

response = urllib2.urlopen('http://www.baidu.com')

Python3

import urllib.request

response = urllib.request.urlopen('http://www.baidu.com')

3、基本用法

Urllib

urlopen

urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,cadefault=False,context=None)

方法1

 import urllib.request

 response = urllib.request.urlopen('http://www.baidu.com')

 print(response.read().decode('utf-8'))  # 获取相应体的内容，用decode('utf-8')显示

方法2

import urllib.request

import urllib.parse

data = bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf-8')

response = urllib.request.urlopen('http://httpbin.org/post',data=data) # 加了data 是已post形式传递 ，不加则是get方式传递

print(response.read())

方法3

 import urllib.request

 response = urllib.request.urlopen('http://httpbin.org/get',timeout=1)

 print(response.read())

方法4

 import socket

 import urllib.request

 import urllib.error

 try:

     response = urllib.request.urlopen('http://httpbin.org/get',timeout=0.1)

 except urllib.error.URLError as e:

     if isinstance(e.reason,socket.timeout):

         print('TIME OUT')

响应

响应类型

 import urllib.request

 response = urllib.request.urlopen('http://www.baidu.com')

 print(type(response))

状态码、响应头

 import urllib.request

 response = urllib.request.urlopen('http://www.python.org')

 print(response.status) # 获取状态码

 print(response.getheaders())  # 获取响应头

 print(response.getheader('Server')) # 获取特定的响应头，这里拿 Server举例

Request

url作为对象传给urlopen

 import urllib.request

 request = urllib.request.Request('https://python.org') # 把url封装成一个对象

 response = urllib.request.urlopen(request)  # 把对象传给urlopen一样可以访问

 print(response.read().decode('utf-8'))

添加request请求的方式

 from urllib import request,parse

 url = 'http://httpbin.org/post'

 headers={

     'User-Agent':'Mozilla/4.0(compatible;MSIE 5.5;Windows NT)',

     'Host':'httpbin.org'

 }

 dict = {

     'name':'Germey'

 }

 data = bytes(parse.urlencode(dict),encoding='utf-8')

 req = request.Request(url=url,data=data,headers=headers,method='POST')

 response = request.urlopen(req)

 print(response.read().decode('utf-8'))

request.add_header()方法

 from urllib import request,parse

 url = 'http://httpbin.org/post'

 dict = {

     'name':'Germey'

 }

 data = bytes(parse.urlencode(dict),encoding='utf-8')

 req = request.Request(url=url,data=data,method='POST')

 req.add_header('User-Agent','Mozilla/4.0(compatible;MSIE 5.5;Windows NT)')

 response = request.urlopen(req)

 print(response.read().decode('utf-8'))

Handler

代理

 import urllib.request

 # 构建了两个代理Handler，一个有代理IP，一个没有代理IP

 httpproxy_handler = urllib.request.ProxyHandler({"http" : "127.0.0.1:9743"})

 nullproxy_handler = urllib.request.ProxyHandler({})

 #定义一个代理开关

 proxySwitch = True

 # 通过 urllib2.build_opener()方法使用这些代理Handler对象，创建自定义opener对象

 # 根据代理开关是否打开，使用不同的代理模式

 if proxySwitch:

     opener = urllib.request.build_opener(httpproxy_handler)

 else:

     opener = urllib.request.build_opener(nullproxy_handler)

 request = urllib.request.Request("http://www.baidu.com/")

 # 使用opener.open()方法发送请求才使用自定义的代理，而urlopen()则不使用自定义代理。

 response = opener.open(request)

 # 就是将opener应用到全局，之后所有的，不管是opener.open()还是urlopen() 发送请求，都将使用自定义代理。

 urllib.request.install_opener(opener)

 # response = urlopen(request)

 print(response.read())

使用选择的代理构建代理处理器对象

 import urllib.request

 # 使用选择的代理构建代理处理器对象

 proxy_handler = urllib.request.ProxyHandler({

     'http':'http://127.0.0.1:9743',

     'https':'https://127.0.0.1:9743'

 })

 opener = urllib.request.build_opener(proxy_handler)

 request = urllib.request.Request("http://www.baidu.com")

 response = opener.open(request)

 print(response.read())

Cookie维持登陆状态的一个机制

实现cookie的获取

import http.cookiejar,urllib.request

 import http.cookiejar,urllib.request

 cookie = http.cookiejar.CookieJar()

 handler = urllib.request.HTTPCookieProcessor(cookie)

 opener = urllib.request.build_opener(handler)

 response = opener.open('http://www.baidu.com')

 for item in cookie:

     print(item.name+"="+item.value)

把cookie保存成一个文本文件

 import http.cookiejar,urllib.request

 filename = "cookie.txt"

 cookie = http.cookiejar.MozillaCookieJar(filename) # CookieJar子类的一个对象 MozillaCookieJar()

 handler = urllib.request.HTTPCookieProcessor(cookie)

 opener = urllib.request.build_opener(handler)

 response = opener.open('http://www.baidu.com')

 cookie.save(ignore_discard=True,ignore_expires=True) #  MozillaCookieJar()里包含了一个save()方法保存成txt文件

Cookie另一种保存格式方法2

 import http.cookiejar,urllib.request

 filename = "cookie.txt"

 cookie = http.cookiejar.LWPCookieJar(filename) # CookieJar子类的一个对象 LWPCookieJar()

 handler = urllib.request.HTTPCookieProcessor(cookie)

 opener = urllib.request.build_opener(handler)

 response = opener.open('http://www.baidu.com')

 cookie.save(ignore_discard=True,ignore_expires=True) #  LWPCookieJar()里包含了一个save()方法保存成txt文件

用cookie方法2的方法读取获取到的Cookie(LWPCookieJar())

import http.cookiejar,urllib.request

cookie = http.cookiejar.LWPCookieJar()

cookie.load('cookie.txt',ignore_discard=True,ignore_expires=True)

handler = urllib.request.HTTPCookieProcessor(cookie)

opener = urllib.request.build_opener(handler)

response = opener.open('http://www.baidu.com')

print(response.read().decode('utf-8')) # 用文本文件的方式存储cookie,再读取出来放在request里请求访问网页，请求的结果就是登陆时候的看到的结果

URL解析

 # urlparse  urllib.parse.urlparse(urlstring,scheme='',allow_fragments=True)

 # 把url分割成许多部分

 from urllib.parse import urlparse,urlunparse

 result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')

 print(type(result),result) # 输出 <class 'urllib.parse.ParseResult'> ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

 # 指定协议类型

 result = urlparse('www.baidu.com/index.html;user?id=5#comment',scheme='https')

 print(result) # 输出 ParseResult(scheme='https', netloc='', path='www.baidu.com/index.html', params='user', query='id=5', fragment='comment')

 #如果url里添加了协议，后面分割的就是这个协议方式

 result = urlparse('http://www.baidu.com/index.html;user?id=5#comment',scheme='https')

 print(result) # 输出 ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

 #锚点链接 allow_fragments参数

 result = urlparse('http://www.baidu.com/index.html;user?id=5#comment',allow_fragments=False)

 print(result) # 将comment拼接到query里 ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5#comment', fragment='')

 #把query去掉，直接拼接到path里

 result = urlparse('http://www.baidu.com/index.html#comment',allow_fragments=False)

 print(result) # 输出 ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html#comment', params='', query='', fragment='')

 #-----------------------------------------------------------------------------------------------------------------------

 # urlunparse 将url里的参数进行拼接成完整的url

 data = ['http','www.baidu.com','index.html','user','a=6','comment']

 print(urlunparse(data)) # 输出 http://www.baidu.com/index.html;user?a=6#comment

 #-----------------------------------------------------------------------------------------------------------------------

 # urljoin 后面url里的字段会覆盖前面的url

 from urllib.parse import urljoin

 print(urljoin('http://www.baidu.com/about.html','https://cuiqincai.com/FAQ.html'))

 # 输出 https://cuiqincai.com/FAQ.html

 #-----------------------------------------------------------------------------------------------------------------------

 from urllib.parse import urlencode

 params = {

     'name':'germey',

     'age':22

 }

 base_url = 'http://www.baidu.com?'

 url = base_url + urlencode(params) # 把字典转换成请求参数

 print(url) # 输出 http://www.baidu.com?name=germey&age=22

异常处理

 # from urllib import request,error # 1，2可用

 # 打印出异常处理

 # try:

 #     response = request.urlopen('http://wyh.com/index.html')

 # except error.URLError as e:

 #     print(e.reason) # 打印出异常原理，保证程序是正常运行的

 # 具体可以捕捉哪些异常

 # try:

 #     response = request.urlopen('http://wyh.com/index.html')

 # except error.HTTPError as e: # HTTPError是子类异常

 #     print(e.reason,e.code,e.headers,sep='\n') # e.headers 打印响应头的一些信息

 # except error.URLError as e:  # URLError是父类异常

 #     print(e.reason)

 # else:

 #     print('Request Successfully!')

 # 加一个原因判断

 import socket

 import urllib.request

 import urllib.error

 try:

     response = urllib.request.urlopen('http://www.baidu.com',timeout=0.01)

 except urllib.error.URLError as e:

     print(type(e.reason)) # 它是一个类

     if isinstance(e.reason,socket.timeout): # isinstance()方法判断是不是匹配的

         print('TIME OUT!')

爬虫学习--Urllib库基本使用 Day1的更多相关文章

python爬虫之urllib库（三）
python爬虫之urllib库(三) urllib库访问网页都是通过HTTP协议进行的,而HTTP协议是一种无状态的协议,即记不住来者何人.举个栗子,天猫上买东西,需要先登录天猫账号进入主页,再去 ...
python爬虫之urllib库（二）
python爬虫之urllib库(二) urllib库超时设置网页长时间无法响应的,系统会判断网页超时,无法打开网页.对于爬虫而言,我们作为网页的访问者,不能一直等着服务器给我们返回错误信息,耗费 ...
python爬虫之urllib库（一）
python爬虫之urllib库(一) urllib库 urllib库是python提供的一种用于操作URL的模块,python2中是urllib和urllib2两个库文件,python3中整合在了u ...
Python爬虫学习：Python内置的爬虫模块urllib库
urllib库 urllib库是Python中一个最基本的网络请求的库.它可以模拟浏览器的行为发送请求(都是这样),从而获取返回的数据 urllib.request 在Python3的urllib库当 ...
（爬虫）urllib库
一.爬虫简介什么是爬虫?通俗来讲爬虫就是爬取网页数据的程序. 要了解爬虫,还需要了解HTTP协议和HTTPS协议:HTTP协议是超文本传输协议,是一种发布和接收HTML页面的传输协议:HTTPS协议 ...
爬虫之urllib库
一.urllib库简介简介 Urllib是Python内置的HTTP请求库.其主要作用就是可以通过代码模拟浏览器发送请求.它包含四个模块: urllib.request :请求模块 urllib.e ...
python爬虫之urllib库介绍
一.urllib库 urllib是Python自带的一个用于爬虫的库,其主要作用就是可以通过代码模拟浏览器发送请求.其常被用到的子模块在Python3中的为urllib.request和urllib. ...
爬虫中urllib库
一.urllib库 urllib是Python自带的一个用于爬虫的库,其主要作用就是可以通过代码模拟浏览器发送请求.其常被用到的子模块在Python3中的为urllib.request和urllib. ...
python爬虫之urllib库
请求库 urllib urllib主要分为几个部分 urllib.request 发送请求urllib.error 处理请求过程中出现的异常urllib.parse 处理urlurllib.robot ...

随机推荐

CentOS 8 网卡设置
本次测试环境是在虚拟机上测试网卡配置文件路径:/etc/sysconfig/network-scripts/ifcfg-ens33 [root@localhost ~]# cd /etc/sysco ...
移动端的<meta>标签
<head> <meta charset="UTF-8" />  <meta name="keywo ...
C# 获取顶级（一级）域名方法
/// <summary> /// 获取域名的顶级域名 /// </summary> /// <param name="domain">< ...
SQL SERVER数据库基本语法汇总，仅代表个人整理，仅供参考
以下SQL基本语法皆由本人整理,以下做一个汇总,关于游标,可作为了解,不要求掌握,其他查询.修改.删除操作等基本语法必须会使用.select * from [dbo].[TBICJE]select m ...
macOS10.14.2 gem 更新问题
macOS10.14.2,最近cocoapods不能正常使用了. 终端输入 sudo gem update –system 显示如下错误 ERROR: While executing gem … (G ...
微信小程序中事件
微信小程序中事件一.常见的事件有类型触发条件最低版本 touchstart 手指触摸动作开始 touchmove 手指触摸后移动 touchcancel 手指触摸动作被打断,如来电提醒,弹窗 ...
设置Linux支持中文
1.首先在command输入locale,可以看到Linux下默认的系统语言的是英文 2.vim ~/.bashrc打开这个文件,该文件夹相当于系统配置文件 3.打开后,将后三行命令输入到文档中,最后 ...
MSSQL提权之xp_cmdshell
0x01 前提 getshell或者存在sql注入并且能够执行命令. sql server是system权限,sql server默认就是system权限. 0x02 xp_cmdshell 有了xp ...
Python：的web爬虫实现及原理(BeautifulSoup工具)
最近一直在学习python,学习完了基本语法就练习了一个爬虫demo,下面总结下. 主要逻辑是 1)初始化url管理器,也就是将rooturl加入到url管理器中 2)在url管理器中得到新的new_ ...
spring源码系列8:AOP源码解析之代理的创建
回顾首先回顾: JDK动态代理与CGLIB动态代理 Spring中的InstantiationAwareBeanPostProcessor和BeanPostProcessor的区别我们得知 JDK ...

爬虫学习--Urllib库基本使用 Day1

爬虫学习--Urllib库基本使用 Day1的更多相关文章

随机推荐

热门专题