urllib模块

urllib模块简介：

urllib提供了一系列用于操作URL的功能。包含urllib.request,urllib.error,urllib.parse,urllib.robotparser四个子模块

urllib.request打开和浏览url中内容
urllib.error包含从 urllib.request发生的错误或异常
urllib.parse解析url
urllib.robotparser解析 robots.txt文件

urllib.request.urlopen()格式：
urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

urllib模块介绍：

urlopen函数参数：

url: 需要打开的网址
data：Post提交的数据
timeout：设置网站的访问超时时间

urlopen返回对象提供方法：

read() , readline() ,readlines() , fileno() , close() ：对HTTPResponse类型数据进行操作

info()：返回HTTPMessage对象，表示远程服务器返回的头信息

getcode()：返回Http状态码。如果是http请求，200请求成功完成 ; 404网址未找到

geturl()：返回请求的url

urlopen返回对象提供的属性：

- status：返回Http状态码。如果是http请求，200请求成功完成 ; 404网址未找到

- reason：返回数字，比如200

url参数的使用

get请求

例如对百度的一个URL https://www.baidu.com/进行抓取：

#ｄｅｍｏｅ５.py

from urllib import request

url = 'https://www.baidu.com/'

f = request.urlopen(url)

data = f.read()

print('Status:', f.status, f.reason)

print('Data:', data)

运行程序可以得到如下：

C:\Pycham\venv\Scripts\python.exe C:/Pycham/ｄｅｍｏｅ５.py

Status: 200 OK

Data: b'<html>\r\n<head>\r\n\t<script>\r\n\t\tlocation.replace(location.href.replace("https://","http://"));\r\n\t</script>\r\n</head>\r\n<body>\r\n\t<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>\r\n</body>\r\n</html>'

Process finished with exit code 0

Data的数据格式为bytes类型，需要decode（）解码，转换成str类型

我们将最后一句代码改为

print('Data:', data.decode('utf-8'))

#ｄｅｍｏｅ５.py

from urllib import request

url = 'https://www.baidu.com/'

f = request.urlopen(url)

data = f.read()

print('Status:', f.status, f.reason)

print('Data:', data.decode('utf-8'))

运行程序可以得到如下：

C:\Pycham\venv\Scripts\python.exe C:/Pycham/ｄｅｍｏｅ５.py

Status: 200 OK

Data: <html>

<head>

	<script>

		location.replace(location.href.replace("https://","http://"));

	</script>

</head>

<body>

	<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>

</body>

</html>

Process finished with exit code 0

这样得到的内容就可以与网页编码内容一样了

data参数的使用

post请求

urlopen（）的data参数默认为None，当data参数不为空的时候，urlopen（）提交方式为Post

Post的数据必须是bytes或者iterable of bytes，不能是str，如果是str需要进行encode（）编码

然后作为data参数传递给Request对象。编码是使用一个urllib.parse库中的函数完成的。

import urllib.parse

import urllib.request

url = 'http://www.someserver.com/cgi-bin/register.cgi'

values = {'name' : 'Michael Foord',

          'location' : 'Northampton',

          'language' : 'Python' }

data = urllib.parse.urlencode(values)

data = data.encode('ascii') # data should be bytes

req = urllib.request.Request(url, data)

with urllib.request.urlopen(req) as response:

   the_page = response.read()

如果不传递data参数，那urllib就会使用GET请求方式。GET方式和POST方式的其中一个区别在于POST请求经常有副作用：它们会以某种方式改变系统的状态(比如，在网上下订单，会有一英担的午餐肉罐头送到你家门口)。尽管HTTP标准明确说POST方式总是会造成副作用，而GET方式从来不会，但是并没有保证措施。数据也可以用GET方式传递，只要把它编码在url中。

timeout参数的使用
在某些网络情况不好或者服务器端异常的情况会出现请求慢的情况，或者请求异常，所以这个时候我们需要给
请求设置一个超时时间，而不是让程序一直在等待结果。例子如下：

#ｄｅｍｏｅ５.py

from urllib import request

url = 'https://www.baidu.com/'

f = request.urlopen(url,timeout=0.1)

data = f.read()

print('Status:', f.status, f.reason)

print('Data:', data)

使用Request包装请求

有很多网站为了防止程序爬虫爬网站造成网站瘫痪或者会给不同的浏览器发送不同的版本，会需要携带一些headers头部信息才能访问，最长见的有user-agent参数

格式：

`urllib.request.Request`(url, data=None, headers={}, method=None)

使用request（）来包装请求，再通过urlopen（）获取页面

用来包装头部的数据：

User-Agent ：这个头部可以携带如下几条信息：浏览器名和版本号、操作系统名和版本号、默认语言
Referer：可以用来防止盗链，有一些网站图片显示来源http://***.com，就是检查Referer来鉴定的
Connection：表示连接状态，记录Session的状态。

第一种方法：

import urllib.request

url = "https://www.baidu.com/"

#创建Request对象

request = urllib.request.Request(url)

#添加http的header

request.add_header('User-Agent','Mozilla/5.0 (compatible; MSIE 5.5; Windows NT)')

#发送请求获取结果

response2 = urllib.request.urlopen(request)

data =response2.read()

print(response2.status,response2.reason)       #打印请求的状态码

print(len(data))                               #输出网页字符串的长度

print(data.decode())                           #输出网页内容

运行结果如下：

C:\Pycham\venv\Scripts\python.exe C:/Pycham/ｄｅｍｏｅ５.py

200 OK

227

<html>

<head>

	<script>

		location.replace(location.href.replace("https://","http://"));

	</script>

</head>

<body>

	<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>

</body>

</html>

Process finished with exit code 0

第二种方法：

import urllib.request

url= 'https://www.baidu.com/'

headers = {

    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '

                  'Chrome/45.0.2454.85 Safari/537.36 115Browser/6.0.3',

    'Referer': 'https://www.baidu.com/',

    'Connection': 'keep-alive'

}

request = urllib.request.Request(url, headers=headers)

response = urllib.request.urlopen(request).read()

data = response.decode('utf-8')

print(data)

运行结果如下：

</script>

<script type="text/javascript">var Cookie={set:function(e,t,o,i,s,n){document.cookie=e+"="+(n?t:escape(t))+(s?"; expires="+s.toGMTString():"")+(i?"; path="+i:"; path=/")+(o?"; domain="+o:"")},get:function(e,t){var o=document.cookie.match(new RegExp("(^| )"+e+"=([^;]*)(;|$)"));return null!=o?unescape(o[2]):t},clear:function(e,t,o){this.get(e)&&(document.cookie=e+"="+(t?"; path="+t:"; path=/")+(o?"; domain="+o:"")+";expires=Fri, 02-Jan-1970 00:00:00 GMT")}};!function(){function save(e){var t=[];for(tmpName in options)options.hasOwnProperty(tmpName)&&"duRobotState"!==tmpName&&t.push('"'+tmpName+'":"'+options[tmpName]+'"');

var o="{"+t.join(",")+"}";bds.comm.personalData?$.ajax({url:"//www.baidu.com/ups/submit/addtips/?product=ps&tips="+encodeURIComponent(o)+"&_r="+(new Date).getTime(),success:function(){writeCookie(),"function"==typeof e&&e()}}):(writeCookie(),"function"==typeof e&&setTimeout(e,0))}function set(e,t){options[e]=t}function get(e){return options[e]}function writeCookie(){if(options.hasOwnProperty("sugSet")){var e="0"==options.sugSet?"0":"3";clearCookie("sug"),Cookie.set("sug",e,document.domain,"/",expire30y)

}if(options.hasOwnProperty("sugStoreSet")){var e=0==options.sugStoreSet?"0":"1";clearCookie("sugstore"),Cookie.set("sugstore",e,document.domain,"/",expire30y)}if(options.hasOwnProperty("isSwitch")){var t={0:"2",1:"0",2:"1"},e=t[options.isSwitch];clearCookie("ORIGIN"),Cookie.set("ORIGIN",e,document.domain,"/",expire30y)}if(options.hasOwnProperty("imeSwitch")){var e=options.imeSwitch;clearCookie("bdime"),Cookie.set("bdime",e,document.domain,"/",expire30y)}}function writeBAIDUID(){var e,t,o,i=Cookie.get("BAIDUID");

/FG=(\d+)/.test(i)&&(t=RegExp.$1),/SL=(\d+)/.test(i)&&(o=RegExp.$1),/NR=(\d+)/.test(i)&&(e=RegExp.$1),options.hasOwnProperty("resultNum")&&(e=options.resultNum),options.hasOwnProperty("resultLang")&&(o=options.resultLang),Cookie.set("BAIDUID",i.replace(/:.*$/,"")+("undefined"!=typeof o?":SL="+o:"")+("undefined"!=typeof e?":NR="+e:"")+("undefined"!=typeof t?":FG="+t:""),".baidu.com","/",expire30y,!0)}function clearCookie(e){Cookie.clear(e,"/"),Cookie.clear(e,"/",document.domain),Cookie.clear(e,"/","."+document.domain),Cookie.clear(e,"/",".baidu.com")

}function reset(e){options=defaultOptions,save(e)}var defaultOptions={sugSet:1,sugStoreSet:1,isSwitch:1,isJumpHttps:1,imeSwitch:0,resultNum:10,skinOpen:1,resultLang:0,duRobotState:"000"},options={},tmpName,expire30y=new Date;expire30y.setTime(expire30y.getTime()+94608e7);try{if(bds&&bds.comm&&bds.comm.personalData){if("string"==typeof bds.comm.personalData&&(bds.comm.personalData=eval("("+bds.comm.personalData+")")),!bds.comm.personalData)return;for(tmpName in bds.comm.personalData)defaultOptions.hasOwnProperty(tmpName)&&bds.comm.personalData.hasOwnProperty(tmpName)&&"SUCCESS"==bds.comm.personalData[tmpName].ErrMsg&&(options[tmpName]=bds.comm.personalData[tmpName].value)

}try{parseInt(options.resultNum)||delete options.resultNum,parseInt(options.resultLang)||"0"==options.resultLang||delete options.resultLang}catch(e){}writeCookie(),"sugSet"in options||(options.sugSet=3!=Cookie.get("sug",3)?0:1),"sugStoreSet"in options||(options.sugStoreSet=Cookie.get("sugstore",0));var BAIDUID=Cookie.get("BAIDUID");"resultNum"in options||(options.resultNum=/NR=(\d+)/.test(BAIDUID)&&RegExp.$1?parseInt(RegExp.$1):10),"resultLang"in options||(options.resultLang=/SL=(\d+)/.test(BAIDUID)&&RegExp.$1?parseInt(RegExp.$1):0),"isSwitch"in options||(options.isSwitch=2==Cookie.get("ORIGIN",0)?0:1==Cookie.get("ORIGIN",0)?2:1),"imeSwitch"in options||(options.imeSwitch=Cookie.get("bdime",0))

}catch(e){}window.UPS={writeBAIDUID:writeBAIDUID,reset:reset,get:get,set:set,save:save}}(),function(){var e="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/static/protocol/https/plugins/every_cookie_4644b13.js";("Mac68K"==navigator.platform||"MacPPC"==navigator.platform||"Macintosh"==navigator.platform||"MacIntel"==navigator.platform)&&(e="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/static/protocol/https/plugins/every_cookie_mac_82990d4.js"),setTimeout(function(){$.ajax({url:e,cache:!0,dataType:"script"})},0);var t=navigator&&navigator.userAgent?navigator.userAgent:"",o=document&&document.cookie?document.cookie:"",i=!!(t.match(/(msie [2-8])/i)||t.match(/windows.*safari/i)&&!t.match(/chrome/i)||t.match(/(linux.*firefox)/i)||t.match(/Chrome\/29/i)||t.match(/mac os x.*firefox/i)||o.match(/\bISSW=1/)||0==UPS.get("isSwitch"));

bds&&bds.comm&&(bds.comm.supportis=!i,bds.comm.isui=!0),window.__restart_confirm_timeout=!0,window.__confirm_timeout=8e3,window.__disable_is_guide=!0,window.__disable_swap_to_empty=!0,window.__switch_add_mask=!0;var s="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/static/protocol/https/global/js/all_async_search_7edb824.js",n="/script";document.write("<script src='"+s+"'><"+n+">"),bds.comm.newindex&&$(window).on("index_off",function(){$('<div class="c-tips-container" id="c-tips-container"></div>').insertAfter("#wrapper"),window.__sample_dynamic_tab&&$("#s_tab").remove()

}),bds.comm&&bds.comm.ishome&&Cookie.get("H_PS_PSSID")&&(bds.comm.indexSid=Cookie.get("H_PS_PSSID"));var a=$(document).find("#s_tab").find("a");a&&a.length>0&&a.each(function(e,t){t.innerHTML&&t.innerHTML.match(/新闻/)&&(t.innerHTML="资讯",t.href="//www.baidu.com/s?rtt=1&bsst=1&cl=2&tn=news&word=",t.setAttribute("sync",!0))})}();</script>

<script>

if(bds.comm.supportis){

    window.__restart_confirm_timeout=true;

    window.__confirm_timeout=8000;

    window.__disable_is_guide=true;

    window.__disable_swap_to_empty=true;

}

initPreload({

    'isui':true,

    'index_form':"#form",

    'index_kw':"#kw",

    'result_form':"#form",

    'result_kw':"#kw"

});

</script>

<script>

if(navigator.cookieEnabled){

	document.cookie="NOJS=;expires=Sat, 01 Jan 2000 00:00:00 GMT";

}

</script>

</body>

</html>

Process finished with exit code 0

爬虫之urllib.request基础使用（一）的更多相关文章

Python做简单爬虫（urllib.request怎么抓取https以及伪装浏览器访问的方法）
一:抓取简单的页面: 用Python来做爬虫抓取网站这个功能很强大,今天试着抓取了一下百度的首页,很成功,来看一下步骤吧首先需要准备工具: 1.python:自己比较喜欢用新的东西,所以用的是Pyt ...
python爬虫之urllib库
请求库 urllib urllib主要分为几个部分 urllib.request 发送请求urllib.error 处理请求过程中出现的异常urllib.parse 处理urlurllib.robot ...
python3爬虫初探（一）之urllib.request
---恢复内容开始--- #小白一个,在此写下自己的python爬虫初步的知识.如有错误,希望谅解并指出. #欢迎和大家交流python爬虫相关的问题 #2016/6/18 #----第一把武器--- ...
python3.6 urllib.request库实现简单的网络爬虫、下载图片
#更新日志:#0418 爬取页面商品URL#0421 更新添加爬取下载页面图片功能#0423 更新添加发送邮件功能# 优化爬虫异常处理.错误页面及空页面处理# 优化爬虫关键字黑名单.白名单,提 ...
在python3中使用urllib.request编写简单的网络爬虫
转自:http://www.cnblogs.com/ArsenalfanInECNU/p/4780883.html Python官方提供了用于编写网络爬虫的包 urllib.request, 我们主要 ...
爬虫之urllib包以及request模块和parse模块
urllib简介简介 Python3中将python2.7的urllib和urllib2两个包合并成了一个urllib库 Python3中,urllib库包含有四个模块: urllib.reques ...
爬虫——urllib.request库的基本使用
所谓网页抓取,就是把URL地址中指定的网络资源从网络流中读取出来,保存到本地.在Python中有很多库可以用来抓取网页,我们先学习urllib.request.(在python2.x中为urllib2 ...
爬虫初探(1)之urllib.request
-----------我是小白------------ urllib.request是python3自带的库(python3.x版本特有),我们用它来请求网页,并获取网页源码. # 导入使用库 imp ...
爬虫小探-Python3 urllib.request获取页面数据
使用Python3 urllib.request中的Requests()和urlopen()方法获取页面源码,并用re正则进行正则匹配查找需要的数据. #forex.py#coding:utf-8 ' ...

随机推荐

php 获取压缩包名
参考链接: https://segmentfault.com/q/1010000000721799 通过curl方式获取压缩包名: function getFile($url, $save_dir = ...
变量,id()
>>> a = 1 >>> print id(a) 2870961640 >>> b = a >>> print id(b) 2 ...
SpringBoot2.x个性化启动banner设置和debug日志
3.SpringBoot2.x个性化启动banner设置和debug日志简介:自定义应用启动的趣味性日志图标和查看调试日志 1.启动获取更多信息 java -jar xxx.jar --debug ...
2017-2018-2 20155303『网络对抗技术』Exp9：Web安全基础
2017-2018-2 『网络对抗技术』Exp9:Web安全基础 --------CONTENTS-------- 一.基础问题回答 1.SQL注入攻击原理,如何防御? 2.XSS攻击的原理,如何防御 ...
电脑kail linux 连接手机Nethunter，手机和电脑互传文件
1.开启nethunter的ssh 修改/etc/ssh/sshd_config 参考:解决kali linux 开启ssh服务后连接不上的问题 2.如果在手机终端修改不了(我的就是怎么也改不了),可 ...
aircrack-ng笔记
开启监听: airmon-ng start wlan0 抓包: airodump-ng wlan0mon 查看wifi ^C结束 airodump-ng -c 6 --bssid C8:3A:35:3 ...
C++学习1-（C语言基础、VS快捷键）
C语言基础复习 1.三码正数: 3码合1 ,正数的反码/补码就是其本身负数: 原码就是符号位加上真值的绝对值,即用第一位表示符号,其余位表示值原码:11010101 负数的反码是在其原码的基础上 ...
如何交叉编译 linux kernel 内核
Compilation We first need to move the config file by running cp arch/arm/configs/bcmrpi_cutdown_defc ...
Jenkins与网站代码上线解决方案【转】
转自 Jenkins与网站代码上线解决方案 - 惨绿少年 https://www.nmtui.com/clsn/lx524.html 1.1 前言 Jenkins是一个用Java编写的开源的持续集成工 ...
xpath路径前可用方法测试
$x("string-length(//h3[@class='t'])") 8 $x("string(//h3[@class='t'])") "XPa ...

爬虫之urllib.request基础使用（一）