urllib基本使用 urlopen(),Request

urllib包含的常用模块：
import  urllib.request      # 打开和读取url请求
import  urllib.error        # 异常处理模块
import  urllib.parse        # url解析模块
import  urllib.robotparser  # robots.txt解析模块

"""
urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, 
    cadefault=False, context=None)

url:  需要打开的网址
data：默认为None，当data参数不为空的时候，提交方式为Post。
timeout：设置网站的访问超时时间

直接用urllib.request模块的urlopen（）获取页面内容，返回的数据格式为bytes类型，需要decode()解码，转换成str类型。

# 1、 get请求
import  urllib.request 

response = urllib.request.urlopen('http://www.baidu.com/')
print(response.read().decode('utf-8'))

# 2、 post请求
import  urllib.request     
import  urllib.parse

data = bytes(urllib.parse.urlencode({'world':'hello'}),encoding='utf8')
response = urllib.request.urlopen('http://httpbin.org/post',data= data)
print(response.read().decode('utf-8'))

# 3、超时时间
import  socket
import  urllib.request     
import  urllib.error

try:
    response = urllib.request.urlopen('http://httpbin.org/get',timeout= 1)
except urllib.error.URLError as e:
    if isinstance(e.reason,socket.timeout):
        print('time out')

urlopen返回对象提供方法：

- read() , readline() ,readlines() , fileno() , close() ：对HTTPResponse类型数据进行操作
- info()：返回HTTPMessage对象，表示远程服务器返回的头信息
- getcode()：返回Http状态码。如果是http请求，200请求成功完成;404网址未找到
- geturl()：返回请求的url
- response.status :请求状态码
- response.getheader(): 响应头信息,也可以传参数，获取特定的信息
"""

"""
使用Request
urllib.request.Request(url, data=None, headers={}, method=None)
使用request（）来包装请求，再通过urlopen（）获取页面。
1、data（默认空）：是伴随 url 提交的数据（比如要post的数据），同时 HTTP 请求将从 "GET"方式 改为 "POST"方式。
2、headers（默认空）：是一个字典，包含了需要发送的HTTP报头的键值对。

在 HTTP Request 中加入特定的 Header，来构造一个完整的HTTP请求消息。
可以通过调用Request.add_header() 添加/修改一个特定的header 也可以通过调用Request.get_header()来查看已有的header。

"""
url = 'http://httpbin.rog/post'
# 通过 User-Agent 伪装成浏览器
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36',
    'Connection': 'keep-alive'
}

dict = {
    'name':'Andy'
}

data = bytes(urllib.parse.urlencode(dict),encoding='utf8')

# url 连同data, headers，一起构造Request请求，构造并返回一个Request对象
request = urllib.request.Request(url=url,data=data,headers=headers)

#可以通过调用Request.add_header() 添加/修改一个特定的header
#request.add_header("Connection", "keep-alive")
#可以通过调用Request.get_header()来查看header信息
#print(request.get_header(header_name='Connection'))

# 第一个字母大写，后面的全部小写
#print(request.get_header("User-agent"))

# Request对象作为urlopen()方法的参数，发送给服务器并接收响应
response = urllib.request.urlopen(request)
html = response.read().decode('utf-8')
print(html)

urllib基本使用 urlopen(),Request的更多相关文章

python3 使用urllib报错urlopen error EOF occurred in violation of protocol (_ssl.c:841)
python3源码: import urllib.request from bs4 import BeautifulSoup response = urllib.request.urlopen(&qu ...
【py网页】urllib模块，urlopen
Python urllib 库提供了一个从指定的 URL 地址获取网页数据,然后对其进行分析处理,获取想要的数据. 下面是在 Python Shell 里的 urllib 的使用情况: 01 Pyth ...
【python3】urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:777)>
在玩爬虫的时候,针对https ,需要单独处理.不然就会报错: 解决办法:引入 ssl 模块即可核心代码 imort ssl ssl._create_default_https_context = ...
urllib基础-请求对象request
简单的案例-爬取百度首页 from urllib import request ''' 爬取百度首页 ''' # 确定爬去目标 base_url = 'http://www.baidu.com' # ...
python urllib模块的urlopen()的使用方法及实例
Python urllib 库提供了一个从指定的 URL 地址获取网页数据,然后对其进行分析处理,获取想要的数据. 一.urllib模块urlopen()函数: urlopen(url, data=N ...
urllib.error.URLError: <urlopen error [WinError 10061] 由于目标计算机积极拒绝，无法连接。>
因为昨天我用fiddler抓包实验它的基本功能,今天运行程序时没有打开fiddler,所以配置的代理失效了,返回这样的错误. 这个问题是因为代理设置失效,换一个代理或者取消设置代理即可.
#python# error:urllib.error.URLError: <urlopen error [Errno 11001] getaddrinfo failed>
设置代理后访问网页报错,百度有人说地址拼写不对,确认拼写后依然报错因为使用的是xici免费代理,想到可能代理不可用造成getaddrinfo failed, 更换其他代理,error消失
网络爬虫urllib：request之urlopen
网络爬虫urllib:request之urlopen 网络爬虫简介定义:按照一定规则,自动抓取万维网信息的程序或脚本. 两大特征: 能按程序员要求下载数据或者内容能自动在网络上流窜(从一个网页跳转 ...
python3.6 urllib.request库实现简单的网络爬虫、下载图片
#更新日志:#0418 爬取页面商品URL#0421 更新添加爬取下载页面图片功能#0423 更新添加发送邮件功能# 优化爬虫异常处理.错误页面及空页面处理# 优化爬虫关键字黑名单.白名单,提 ...

随机推荐

java中short、int、long、float、double取值范围
一.分析基本数据类型的特点,最大值和最小值.1.基本类型:int 二进制位数:32包装类:java.lang.Integer最小值:Integer.MIN_VALUE= -2147483648 (-2 ...
idea tomcat 怎样出现update classes and resources
idea Tomcat 出现update classes and resources 出现热加载正确配置应该是这个在 Deployment (调度,部署) 中点击 + 选择war explored ...
[Android] 给图像加入相框、圆形圆角显示图片、图像合成知识
前一篇文章讲述了Android触屏setOnTouchListener实现突破缩放.移动.绘制和加入水印,继续我的"随手拍"项目完毕给图片加入相框.圆形圆角显示图片和图像合 ...
B11:解释器模式 Iterpreter
给定一个语言,定义它的文法的一种表示,并定义一个解释器,这个解释器使用该表示来解释语言中的句子. UML: 示例代码: abstract class Expression { abstract pub ...
poj1113Wall 求凸包周长 Graham扫描法
#include<iostream> #include<algorithm> #include<cmath> using namespace std; typede ...
Spring 配置中的 ${}
Spring 配置中的 ${}  <!-- Configurer tha ...
通过内存映射文件来颠倒文本内容（暂没有处理Unicode和换行符）
// ReverseFileDemo.cpp : 定义控制台应用程序的入口点. // #include "stdafx.h" #include <windows.h> ...
nginx location 或操作
location ~* (\.(7z|bat|bak|ini|log|rar|sql|swp|tar|zip|gz|git|asp|svn)|/phpmyadmin) { deny all; }
VmWare下安装CentOS6
为什么选择CentOS ? 1. 主流: 目前的Linux操作系统主要应用于生产环境,主流企业级Linux系统仍旧是RedHat或者CentOS 2. 免费: RedHat 和CentOS差别不大,C ...
emcas自己所熟悉的快捷键
刚开始用emacs,看完Tutorial了后又用emcas做了一些笔记. 现将自己脑海中觉得比较重要的快捷键一一列出,该列表将持续更新: C = Ctrl M = Alt 查找或打开(新)文件 C- ...

urllib基本使用 urlopen(),Request

urllib基本使用 urlopen(),Request的更多相关文章

随机推荐

热门专题