urllib 模块 - module urllib

 urllib 模块 - urllib module



 获取 web 页面,

     html = urllib.request.urlopen("http://www.zzyzz.top/")

     html2 = urllib.request.Request("http://www.zzyzz.top/")

     print("html",html)

     print("html2",html2)

     output,

         html <http.client.HTTPResponse object at 0x0395DFF0>

         html2 <urllib.request.Request object at 0x03613930>

         Methods of HTTPResponse object,

             geturl() — return the URL of the resource retrieved,

                     commonly used to determine if a redirect was followed

                     得到最终显示给用户的页面的 url (并不一定是所提供参数的 url, 因为有可能有

                     redirect 情况)

             info() — return the meta-information of the page, such as headers, in the

                     form of an email.message_from_string() instance (see Quick Reference

                     to HTTP Headers)

             getcode() – return the HTTP status code of the response.

         Methods of Request object,

             Request.full_url

                 The original URL passed to the constructor.

                 Request.full_url is a property with setter, getter and a deleter.

                 Getting full_url returns the original request URL with the fragment,

                 if it was present.

                 即 'URL' 参数(区别于 HTTPResponse object 的 geturl() 方法)

             Request.type

                 The URI scheme.

                 'http' , 'https' 等 字符串 

             Request.host

                 The URI authority, typically a host, but may also contain a port

                 separated by a colon.

                 即 host IP Addr. (可能会同时得到 port 端口号)

             Request.origin_req_host

                 The original host for the request, without port.

                 即 host IP Addr, 不含 port 信息.

             Request.selector

                 The URI path. If the Request uses a proxy, then selector will be the

                 full URL that is passed to the proxy.

                 即 访问 server 的 path(相对于server 的 root 来说),

                 例如  '/' 表示 server root 跟目录. 

             Request.data

                 The entity body for the request, or None if not specified.

                 例如 POST 的 form 信息.  urllib.request.Request("http://www.zzyzz.top/",data)

                     # data = {"Hi":"Hello"}

             Request.unverifiable

                 boolean, indicates whether the request is unverifiable as defined by RFC 2965.

             Request.method

                 The HTTP request method to use. By default its value is None, which means

                 that get_method()will do its normal computation of the method to be used.

                 Its value can be set (thus overriding the default computation in get_method())

                 either by providing a default value by setting it at the class level in a

                 Request subclass, or by passing a value in to the Request constructor

                 via the method argument.

             Request.get_method()

                 Return a string indicating the HTTP request method. If Request.method

                 is not None,return its value, otherwise return 'GET' if Request.data

                 is None, or 'POST' if it’s not.This is only meaningful for HTTP requests.

                 'POST' 或者 'GET'

             Request.add_header(key, val)

                 Add another header to the request. Headers are currently ignored by

                 all handlers except HTTP handlers,where they are added to the list

                 of headers sent to the server. Note that there cannot be more than

                 one header with the same name, and later calls will overwrite previous

                 calls in case the key collides.Currently, this is no loss of HTTP

                 functionality, since all headers which have meaning when used more

                 than once have a (header-specific) way of gaining the same

                 functionality using only one header.

             Request.add_unredirected_header(key, header)

                 Add a header that will not be added to a redirected request.

             Request.has_header(header)

                 Return whether the instance has the named header (checks both

                 regular and unredirected).

             Request.remove_header(header)

                 Remove named header from the request instance (both from regular

                 and unredirected headers).

             Request.get_full_url()

                 Return the URL given in the constructor.

                 得到的其实是  Request.full_url 

             Request.set_proxy(host, type)

                 Prepare the request by connecting to a proxy server. The host and

                 type will replace those of the instance, and the instance’s selector

                 will be the original URL given in the constructor.

             Request.get_header(header_name, default=None)

                 Return the value of the given header. If the header is not present,

                 return the default value.

             Request.header_items()

                 Return a list of tuples (header_name, header_value) of the Request headers.

     例子, 获取 html codes,

         urlobj = urllib.request.Request("http://www.zzyzz.top/")

         with urllib.request.urlopen(urlobj) as FH:           # 文件类对象

             print(FH.read().decode('utf8'))

 Authentication,

     当访问一个需要进行认证的 URL, 会得到一个 HTTP 401 错误,表示所访问的 URL 需要 Authentication.

     Authentication 通常由种形式,

         1, 浏览器 explorer 显示一个弹出框, 要求用户提供 用户名 密码进行认证, 它是基于 cookies 的.

         2, form 表单形式的认证, 在 web 界面要求用户提供 用户名 密码, 然后通过 POST 方法将认证信息

             发送给 server 端进行认证.

         基于 cookies 的 Authentication 认证  -  Basic HTTP Authentication

             import urllib.request

             # Create an OpenerDirector with support for Basic HTTP Authentication...

             auth_handler = urllib.request.HTTPBasicAuthHandler()

             auth_handler.add_password(realm= None,

                                       uri="http://www.zzyzz.top/",

                                       user='userid',

                                       passwd='password')

             opener = urllib.request.build_opener(auth_handler)

             # ...and install it globally so it can be used with urlopen.

             urllib.request.install_opener(opener)

             html = urllib.request.urlopen("http://www.zzyzz.top/")

             print(html.read().decode('utf8'))

         基于 form 表单的 Authentication 认证,

             再 server 端是通常这样处理, 对用户 submit(POST) 的 form 表单的数据信息做验证,

             若验证通过 redirect 到授权页面, 否者 redirect 到 login 界面要求用户重新 POST

             认证信息.

             所以对于这一类的认证, 正常按照 POST form 的方法对待就可以了.

             urlobj = urllib.request.Request("http://www.zzyzz.top/",{"id":"userid","pw":"password"})

             with urllib.request.urlopen(urlobj) as FH:           # 文件类对象

                 print(FH.read().decode('utf8'))

 异常处理 - error handling

     urllib 异常主要分为两类, 链接错误 跟 数据错误

         链接类错误(错误的 URL 地址, URL 使用了一个不支持的协议,主机名不存在 等),

             404 Page Not Found

             链接过程中的异常是 urllib.request.URLError 的实例, 或其子类的实例.

             比如, urllib.request.HTTPError, 其是一种文件类对象.

             例子,

                 import sys, urllib.request

                 urlobj = urllib.request.Request("http://10.240.26.249/HELLO")

                 try:

                     with urllib.request.urlopen(urlobj) as FH:

                         print(FH.read().decode('utf8'))

                 except urllib.request.HTTPError as e:

                     print("HTTPError has been detected : ", e )

                     print("Error document :\n")

                     print(e.read().decode('utf8'))

                     sys.exit(1)

                 except urllib.request.URLError as e:

                     print("URLError has been detected : ", e)

                     sys.exit(2)

         数据类异常,

             比如, 通信上的错误会使 socket 对象在调用 read() 方法时候发生 socket.error 异常.

             或在数据传送过程中通信终断了 等等.

 Reference,

     https://docs.python.org/3/library/urllib.request.html#module-urllib.request

urllib 模块 - module urllib的更多相关文章

全局变量 urllib模块 json模块
1.vars() 查看一个.py文件中的全局变量 print(vars()) #重点 __name__': '__main__ '__file__': 'C:/Users/lenovo/Pychar ...
python3 urllib模块使用
urllib模块使用 urllib.request urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=N ...
python学习笔记：网络请求——urllib模块
python操作网络,也就是打开一个网站,或者请求一个http接口,可以使用urllib模块.urllib模块是一个标准模块,直接import urllib即可,在python3里面只有urllib模 ...
【py网页】urllib模块，urlopen
Python urllib 库提供了一个从指定的 URL 地址获取网页数据,然后对其进行分析处理,获取想要的数据. 下面是在 Python Shell 里的 urllib 的使用情况: 01 Pyth ...
Python内置的urllib模块不支持https协议的解决办法
Django站点使用django_cas接入SSO(单点登录系统),配置完成后登录,抛出“urlopen error unknown url type: https”异常.寻根朔源发现是python内 ...
Python核心模块——urllib模块
现在Python基本入门了,现在开始要进军如何写爬虫了! 先把最基本的urllib模块弄懂吧. urllib模块中的方法 1.urllib.urlopen(url[,data[,proxies]]) ...
Python3学习笔记（urllib模块的使用）转http://www.cnblogs.com/Lands-ljk/p/5447127.html
Python3学习笔记(urllib模块的使用) 1.基本方法 urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, ...
python urllib模块的urlopen()的使用方法及实例
Python urllib 库提供了一个从指定的 URL 地址获取网页数据,然后对其进行分析处理,获取想要的数据. 一.urllib模块urlopen()函数: urlopen(url, data=N ...
python爬虫-urllib模块
urllib 模块是一个高级的 web 交流库,其核心功能就是模仿web浏览器等客户端,去请求相应的资源,并返回一个类文件对象.urllib 支持各种 web 协议,例如:HTTP.FTP.Gophe ...

随机推荐

C++中全排列函数next_permutation 用法
今天蓝桥杯刷题时发现一道字符串排序问题,突然想起next_permutation()函数和prev_permutation()函数. 就想写下next_permutation()的用法 next_pe ...
cogs 247. 售票系统线段树
247. 售票系统 ★★☆ 输入文件:railway.in 输出文件:railway.out 简单对比时间限制:1 s 内存限制:128 MB [问题描述] 某次列车途经C个城市,城市 ...
Java小白集合源码的学习系列：Vector
目录 Vector源码学习 Vector继承体系 Vector核心源码基本属性构造器扩容机制 Enumeration 概述源码描述具体操作 Vector总结 Vector源码学习前文传送门 ...
Go 每日一库之 cobra
简介 cobra是一个命令行程序库,可以用来编写命令行程序.同时,它也提供了一个脚手架, 用于生成基于 cobra 的应用程序框架.非常多知名的开源项目使用了 cobra 库构建命令行,如Kubern ...
dp-最大递增子段和
Nowadays, a kind of chess game called “Super Jumping! Jumping! Jumping!” is very popular in HDU. M ...
Window初始化Git环境
安装Git 去到官网下载地址,找到自己电脑的对应版本,下载安装就好啦,这里就不一一说明了 https://git-scm.com/download/win 初始化Git环境第一步:打开git-bas ...
测试工具Fiddler（三）—— 常见功能介绍
Fiddler的功能面板 1.statistics:请求的性能指标:全世界范围的性能测试: RTP:一个请求的从发送出去到返回的时间: Show chart可以看出图表的示例: 2.inspector ...
清晰架构（Clean Architecture）的Go微服务: 事物管理
为了支持业务层中的事务,我试图在Go中查找类似Spring的声明式事务管理,但是没找到,所以我决定自己写一个. 事务很容易在Go中实现,但很难做到正确地实现. 需求: 将业务逻辑与事务代码分开. 在编 ...
[hdu2255] 奔小康赚大钱
Description 传说在遥远的地方有一个非常富裕的村落,有一天,村长决定进行制度改革:重新分配房子. 这可是一件大事,关系到人民的住房问题啊.村里共有 \(n\) 间房间,刚好有 \(n\) 家 ...
java面试| 线程面试题集合
集合的面试题就不罗列了,基本上在深入理解集合系列已覆盖「深入浅出」java集合Collection和Map 「深入浅出」集合List 「深入浅出」集合Set 这里搜罗网上常用线程面试题, ...

urllib 模块 - module urllib

urllib 模块 - module urllib的更多相关文章

随机推荐

热门专题