Python 3.X 要使用urllib.request 来抓取网络资源。转
Python 3.X 要使用urllib.request 来抓取网络资源。
最简单的方式:
#coding=utf-8
import urllib.request
response = urllib.request.urlopen('http://python.org/')
buff = response.read()
#显示
html = buff.decode("utf8")
response.close()
print(html)
使用Request的方式:
#coding=utf-8
import urllib.request
req = urllib.request.Request('http://www.voidspace.org.uk')
response = urllib.request.urlopen(req)
buff = response.read()
#显示
the_page = buff.decode("utf8")
response.close()
print(the_page)
这种方式同样可以用来处理其他URL,例如FTP:
#coding=utf-8
import urllib.request
req = urllib.request.Request('ftp://ftp.pku.edu.cn/')
response = urllib.request.urlopen(req)
buff = response.read()
#显示
the_page = buff.decode("utf8")
response.close()
print(the_page)
使用POST请求:
import urllib.parseimport
urllib.requesturl = 'http://www.someserver.com/cgi-bin/register.cgi'
values = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
data = urllib.parse.urlencode(values)
req = urllib.request.Request(url, data)
response = urllib.request.urlopen(req)
the_page = response.read()
使用GET请求:
import urllib.request
import urllib.parse
data = {}
data['name'] = 'Somebody Here'
data['location'] = 'Northampton'
data['language'] = 'Python'
url_values = urllib.parse.urlencode(data)
print(url_values)
name=Somebody+Here&language=Python&location=Northampton
url = 'http://www.example.com/example.cgi'
full_url = url + '?' + url_values
data = urllib.request.open(full_url)
添加header:
import urllib.parse
import urllib.request url = 'http://www.someserver.com/cgi-bin/register.cgi'
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
values = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
headers = { 'User-Agent' : user_agent } data = urllib.parse.urlencode(values)
req = urllib.request.Request(url, data, headers)
response = urllib.request.urlopen(req)
the_page = response.read()
错误处理:
req = urllib.request.Request('http://www.pretend_server.org')
try: urllib.request.urlopen(req)
except urllib.error.URLError as e:
print(e.reason)
返回的错误代码:
# Table mapping response codes to messages; entries have the
# form {code: (shortmessage, longmessage)}.
responses = {
100: ('Continue', 'Request received, please continue'),
101: ('Switching Protocols',
'Switching to new protocol; obey Upgrade header'), 200: ('OK', 'Request fulfilled, document follows'),
201: ('Created', 'Document created, URL follows'),
202: ('Accepted',
'Request accepted, processing continues off-line'),
203: ('Non-Authoritative Information', 'Request fulfilled from cache'),
204: ('No Content', 'Request fulfilled, nothing follows'),
205: ('Reset Content', 'Clear input form for further input.'),
206: ('Partial Content', 'Partial content follows.'), 300: ('Multiple Choices',
'Object has several resources -- see URI list'),
301: ('Moved Permanently', 'Object moved permanently -- see URI list'),
302: ('Found', 'Object moved temporarily -- see URI list'),
303: ('See Other', 'Object moved -- see Method and URL list'),
304: ('Not Modified',
'Document has not changed since given time'),
305: ('Use Proxy',
'You must use proxy specified in Location to access this '
'resource.'),
307: ('Temporary Redirect',
'Object moved temporarily -- see URI list'), 400: ('Bad Request',
'Bad request syntax or unsupported method'),
401: ('Unauthorized',
'No permission -- see authorization schemes'),
402: ('Payment Required',
'No payment -- see charging schemes'),
403: ('Forbidden',
'Request forbidden -- authorization will not help'),
404: ('Not Found', 'Nothing matches the given URI'),
405: ('Method Not Allowed',
'Specified method is invalid for this server.'),
406: ('Not Acceptable', 'URI not available in preferred format.'),
407: ('Proxy Authentication Required', 'You must authenticate with '
'this proxy before proceeding.'),
408: ('Request Timeout', 'Request timed out; try again later.'),
409: ('Conflict', 'Request conflict.'),
410: ('Gone',
'URI no longer exists and has been permanently removed.'),
411: ('Length Required', 'Client must specify Content-Length.'),
412: ('Precondition Failed', 'Precondition in headers is false.'),
413: ('Request Entity Too Large', 'Entity is too large.'),
414: ('Request-URI Too Long', 'URI is too long.'),
415: ('Unsupported Media Type', 'Entity body in unsupported format.'),
416: ('Requested Range Not Satisfiable',
'Cannot satisfy request range.'),
417: ('Expectation Failed',
'Expect condition could not be satisfied.'), 500: ('Internal Server Error', 'Server got itself in trouble'),
501: ('Not Implemented',
'Server does not support this operation'),
502: ('Bad Gateway', 'Invalid responses from another server/proxy.'),
503: ('Service Unavailable',
'The server cannot process the request due to a high load'),
504: ('Gateway Timeout',
'The gateway server did not receive a timely response'),
505: ('HTTP Version Not Supported', 'Cannot fulfill request.'),
}
Python 3.X 要使用urllib.request 来抓取网络资源。转的更多相关文章
- Python做简单爬虫(urllib.request怎么抓取https以及伪装浏览器访问的方法)
一:抓取简单的页面: 用Python来做爬虫抓取网站这个功能很强大,今天试着抓取了一下百度的首页,很成功,来看一下步骤吧 首先需要准备工具: 1.python:自己比较喜欢用新的东西,所以用的是Pyt ...
- 使用python/casperjs编写终极爬虫-客户端App的抓取-ZOL技术频道
使用python/casperjs编写终极爬虫-客户端App的抓取-ZOL技术频道 使用python/casperjs编写终极爬虫-客户端App的抓取
- [Python爬虫] 之十四:Selenium +phantomjs抓取媒介360数据
具体代码如下: # coding=utf-8import osimport refrom selenium import webdriverimport selenium.webdriver.supp ...
- 使用Request+正则抓取猫眼电影(常见问题)
目前使用Request+正则表达式,爬取猫眼电影top100的例子很多,就不再具体阐述过程! 完整代码github:https://github.com/connordb/Top-100 总结一下,容 ...
- python网络爬虫 - 设定重试次数内反复抓取
import urllib.request def download(url, num_retries=2): print('Downloading:', url) try: html = urlli ...
- python爬虫(一)_爬虫原理和数据抓取
本篇将开始介绍Python原理,更多内容请参考:Python学习指南 为什么要做爬虫 著名的革命家.思想家.政治家.战略家.社会改革的主要领导人物马云曾经在2015年提到由IT转到DT,何谓DT,DT ...
- Python爬虫入门教程 29-100 手机APP数据抓取 pyspider
1. 手机APP数据----写在前面 继续练习pyspider的使用,最近搜索了一些这个框架的一些使用技巧,发现文档竟然挺难理解的,不过使用起来暂时没有障碍,估摸着,要在写个5篇左右关于这个框架的教程 ...
- Python爬虫【四】Scrapy+Cookies池抓取新浪微博
1.设置ROBOTSTXT_OBEY,由true变为false 2.设置DEFAULT_REQUEST_HEADERS,将其改为request headers 3.根据请求链接,发出第一个请求,设置一 ...
- Python大数据:外部数据获取(网页抓取)
import urllib2 as url import cookielib,StringIO,gzip,json import pandas as pd import numpy as np #定义 ...
随机推荐
- [RabbitMQ学习笔记] - 初识RabbitMQ
RabbitMQ是一个由erlang开发的AMQP的开源实现. 核心概念 Message 消息,消息是不具名的,它由消息头和消息体组成,消息体是不透明的,而消息头则由 一系列的可选属性组成,这些属性包 ...
- entity framework浅谈
1. 什么是EF 微软提供的ORM工具. ORM让开发人员节省数据库访问代码的时间. 将更多的时间放在业务逻辑层面上. 开发人员使用linq语言, 对数据库进行操作. 2. EF的使用场景 EF有三种 ...
- FZU 2150 Fire Game(点火游戏)
FZU 2150 Fire Game(点火游戏) Time Limit: 1000 mSec Memory Limit : 32768 KB Problem Description - 题目描述 ...
- 用flvplayer.swf在网页中播放视频(网页中flash视频播放的实现)
原:http://blog.csdn.net/ricciozhang/article/details/46868201 由于公司项目的需求,需要在展示一些信息的时候能够播放视频,拿到这个要求,我就从最 ...
- 关于jQ的Ajax操作
jQ的Ajax操作 什么是AJAX AJAX = 异步的javascript和XML(Asynchronous Javascript and XML) 它不是一门编程语言,而是利用JavaScript ...
- 【译】第42节---EF6-DbSet.AddRange & DbSet.RemoveRange
原文:http://www.entityframeworktutorial.net/entityframework6/addrange-removerange.aspx EF 6中的DbSet引入了新 ...
- [luogu]P1852跳跳棋
题目重点是每次不能跳过两个棋子 即对于每一个棋子的状态(a,b,c) (a<b<c) 最多有两种移动的方式 1.中间往两边跳 (a,b,c)-->(2b-a,a,c)或(a,c,2b ...
- 单源最短路——Dijkstra模板
算法思想: 类似最小生成树的贪心算法,从起点 v0 每次新拓展一个距离最小的点,再以这个点为中间点,更新起点到其他点的距离. 算法实现: 需要定义两个一维数组:①vis[ i ] 表示是否从源点到顶点 ...
- Qt532.QString_填充字符
1.代码: void MainWindow::on_pushButton_clicked() { QString str = "; QString str01 = str.leftJusti ...
- IPC 之 Socket 的使用
一.概述 我们知道在开发中,即时通讯.设备间的通信都是使用 Socket 实现,那当然用它来实现进程间通信更是不成问题.Socket 即套接字,是一个对 TCP / IP协议进行封装 的编程调用接口( ...