Python 3.X 要使用urllib.request 来抓取网络资源。转

Python 3.X 要使用urllib.request 来抓取网络资源。

最简单的方式：

#coding=utf-8

import urllib.request

response = urllib.request.urlopen('http://python.org/')

buff = response.read()

#显示

html = buff.decode("utf8")

response.close()

print(html)

使用Request的方式：

#coding=utf-8

import urllib.request

req = urllib.request.Request('http://www.voidspace.org.uk')

response = urllib.request.urlopen(req)

buff = response.read()

#显示

the_page = buff.decode("utf8")

response.close()

print(the_page)

这种方式同样可以用来处理其他URL，例如FTP：

#coding=utf-8

import urllib.request

req = urllib.request.Request('ftp://ftp.pku.edu.cn/')

response = urllib.request.urlopen(req)

buff = response.read()

#显示

the_page = buff.decode("utf8")

response.close()

print(the_page)

使用POST请求：

import urllib.parseimport

urllib.requesturl = 'http://www.someserver.com/cgi-bin/register.cgi'

values = {'name' : 'Michael Foord',

          'location' : 'Northampton',

          'language' : 'Python' }

          

data = urllib.parse.urlencode(values)

req = urllib.request.Request(url, data)

response = urllib.request.urlopen(req)

the_page = response.read()

使用GET请求：

import urllib.request

import urllib.parse

data = {}

data['name'] = 'Somebody Here'

data['location'] = 'Northampton'

data['language'] = 'Python'

url_values = urllib.parse.urlencode(data)

print(url_values)

name=Somebody+Here&language=Python&location=Northampton

url = 'http://www.example.com/example.cgi'

full_url = url + '?' + url_values

data = urllib.request.open(full_url)

添加header：

import urllib.parse

import urllib.request

url = 'http://www.someserver.com/cgi-bin/register.cgi'

user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'

values = {'name' : 'Michael Foord',

          'location' : 'Northampton',

          'language' : 'Python' }

headers = { 'User-Agent' : user_agent }

data = urllib.parse.urlencode(values)

req = urllib.request.Request(url, data, headers)

response = urllib.request.urlopen(req)

the_page = response.read()

错误处理：

req = urllib.request.Request('http://www.pretend_server.org')

try: urllib.request.urlopen(req)

except urllib.error.URLError as e:

    print(e.reason)

返回的错误代码：

# Table mapping response codes to messages; entries have the

# form {code: (shortmessage, longmessage)}.

responses = {

    100: ('Continue', 'Request received, please continue'),

    101: ('Switching Protocols',

          'Switching to new protocol; obey Upgrade header'),

    200: ('OK', 'Request fulfilled, document follows'),

    201: ('Created', 'Document created, URL follows'),

    202: ('Accepted',

          'Request accepted, processing continues off-line'),

    203: ('Non-Authoritative Information', 'Request fulfilled from cache'),

    204: ('No Content', 'Request fulfilled, nothing follows'),

    205: ('Reset Content', 'Clear input form for further input.'),

    206: ('Partial Content', 'Partial content follows.'),

    300: ('Multiple Choices',

          'Object has several resources -- see URI list'),

    301: ('Moved Permanently', 'Object moved permanently -- see URI list'),

    302: ('Found', 'Object moved temporarily -- see URI list'),

    303: ('See Other', 'Object moved -- see Method and URL list'),

    304: ('Not Modified',

          'Document has not changed since given time'),

    305: ('Use Proxy',

          'You must use proxy specified in Location to access this '

          'resource.'),

    307: ('Temporary Redirect',

          'Object moved temporarily -- see URI list'),

    400: ('Bad Request',

          'Bad request syntax or unsupported method'),

    401: ('Unauthorized',

          'No permission -- see authorization schemes'),

    402: ('Payment Required',

          'No payment -- see charging schemes'),

    403: ('Forbidden',

          'Request forbidden -- authorization will not help'),

    404: ('Not Found', 'Nothing matches the given URI'),

    405: ('Method Not Allowed',

          'Specified method is invalid for this server.'),

    406: ('Not Acceptable', 'URI not available in preferred format.'),

    407: ('Proxy Authentication Required', 'You must authenticate with '

          'this proxy before proceeding.'),

    408: ('Request Timeout', 'Request timed out; try again later.'),

    409: ('Conflict', 'Request conflict.'),

    410: ('Gone',

          'URI no longer exists and has been permanently removed.'),

    411: ('Length Required', 'Client must specify Content-Length.'),

    412: ('Precondition Failed', 'Precondition in headers is false.'),

    413: ('Request Entity Too Large', 'Entity is too large.'),

    414: ('Request-URI Too Long', 'URI is too long.'),

    415: ('Unsupported Media Type', 'Entity body in unsupported format.'),

    416: ('Requested Range Not Satisfiable',

          'Cannot satisfy request range.'),

    417: ('Expectation Failed',

          'Expect condition could not be satisfied.'),

    500: ('Internal Server Error', 'Server got itself in trouble'),

    501: ('Not Implemented',

          'Server does not support this operation'),

    502: ('Bad Gateway', 'Invalid responses from another server/proxy.'),

    503: ('Service Unavailable',

          'The server cannot process the request due to a high load'),

    504: ('Gateway Timeout',

          'The gateway server did not receive a timely response'),

    505: ('HTTP Version Not Supported', 'Cannot fulfill request.'),

    }

Python 3.X 要使用urllib.request 来抓取网络资源。转的更多相关文章

Python做简单爬虫（urllib.request怎么抓取https以及伪装浏览器访问的方法）
一:抓取简单的页面: 用Python来做爬虫抓取网站这个功能很强大,今天试着抓取了一下百度的首页,很成功,来看一下步骤吧首先需要准备工具: 1.python:自己比较喜欢用新的东西,所以用的是Pyt ...
使用python/casperjs编写终极爬虫-客户端App的抓取-ZOL技术频道
使用python/casperjs编写终极爬虫-客户端App的抓取-ZOL技术频道使用python/casperjs编写终极爬虫-客户端App的抓取
[Python爬虫] 之十四：Selenium +phantomjs抓取媒介360数据
具体代码如下: # coding=utf-8import osimport refrom selenium import webdriverimport selenium.webdriver.supp ...
使用Request+正则抓取猫眼电影（常见问题）
目前使用Request+正则表达式,爬取猫眼电影top100的例子很多,就不再具体阐述过程! 完整代码github:https://github.com/connordb/Top-100 总结一下,容 ...
python网络爬虫 - 设定重试次数内反复抓取
import urllib.request def download(url, num_retries=2): print('Downloading:', url) try: html = urlli ...
python爬虫(一)_爬虫原理和数据抓取
本篇将开始介绍Python原理,更多内容请参考:Python学习指南为什么要做爬虫著名的革命家.思想家.政治家.战略家.社会改革的主要领导人物马云曾经在2015年提到由IT转到DT,何谓DT,DT ...
Python爬虫入门教程 29-100 手机APP数据抓取 pyspider
1. 手机APP数据----写在前面继续练习pyspider的使用,最近搜索了一些这个框架的一些使用技巧,发现文档竟然挺难理解的,不过使用起来暂时没有障碍,估摸着,要在写个5篇左右关于这个框架的教程 ...
Python爬虫【四】Scrapy+Cookies池抓取新浪微博
1.设置ROBOTSTXT_OBEY,由true变为false 2.设置DEFAULT_REQUEST_HEADERS,将其改为request headers 3.根据请求链接,发出第一个请求,设置一 ...
Python大数据：外部数据获取（网页抓取）
import urllib2 as url import cookielib,StringIO,gzip,json import pandas as pd import numpy as np #定义 ...

随机推荐

Docker网络配置概述
Overview One of the reasons Docker containers and services are so powerful is that you can connect t ...
Kubernetes简介
Kubernetes is an open-source platform designed to automate deploying, scaling, and operating applica ...
将 Graphviz .dot 文件转换为其他格式的图像
参考: Graphviz: How to go from .dot to a graph? 将 Graphviz .dot 文件转换为其他格式的图像在Linux系统下,使用以下命令: dot -Tp ...
17秋 SDN课程第二次上机作业
1.控制器floodlight所示可视化图形拓扑的截图,及主机拓扑连通性检测截图拓扑连通性 2.利用字符界面下发流表,使得'h1'和'h2' ping 不通流表截图连通性 3.利用字符界面下发 ...
20. --erg--=--org--=--urg-- 做，工作（词20、21）
词汇速记21
python学习 day013打卡内置函数
本节主要内容: 内置函数: 内置函数就是python给你提供的.拿来直接用的函数,比如print,input等等.截止到python版本3.6.2 python一共提供了68个内置函数.他们就是pyt ...
linux 基本命令2（12月27日笔记）
1.ifconfig 作用:用于操作网卡相关的指令简单语法:#ifconfig (获取网卡信息) 2.reboot 作用:重新启动计算机语法1:#reboot ...
使用lombok 找不到方法
在setting里面查找并设置就好了
SpringMVC获取页面表单参数的几种方式
以下几种方式只有在已搭好的SpringMVC环境中,才能执行成功! 首先,写一个登陆页面和一个Bean类 <%@ page language="java" co ...
Boostrap导航栏跳转到其他页面或外部链接
想要在boostrap下增加一个标签a,并设置其href属性来实现跳转功能(具体是想在导航栏中添加,点击某个导航栏部件时跳转至其他页面),但是发现事情并不是想象中的那么简单: “Bootstrap为这 ...

Python 3.X 要使用urllib.request 来抓取网络资源。转

Python 3.X 要使用urllib.request 来抓取网络资源。转的更多相关文章

随机推荐

热门专题