Python爬虫学习笔记（一）

1.urllib2简介

urllib2的是爬取URL（统一资源定位器）的Python模块。它提供了一个非常简单的接口，使用urlopen函数。它能够使用多种不同的协议来爬取URL。
它还提供了一个稍微复杂的接口，用于处理常见的情况 - 如基本身份验证，cookies，代理等。

2.抓取URLs

使用urlib2的简单方式可以像下面一样：

import urllib2

response = urllib2.urlopen('http://python.org/')

html = response.read()

print html

输出就是爬取的网页内容。

我们可以使用urllib2抓取格式形式的url，可以将‘http：’用‘ftp：’，‘file：’等代替。http是基于请求应答模式，urllib2使用Request代表HTTP请求，最简单的形式是创建一个Request对象，指定要获取的URL。使用Request对象调用urlopen，返回一个请求的URL

响应对象。此响应是一个类似文件的对象，这意味着你可以对这个对象使用.read（）：

import urllib2

req = urllib2.Request('http://www.voidspace.org.uk')

response = urllib2.urlopen(req)

the_page = response.read()

print the_page

urlib2可以使用各种URL模式，例如可以使用ftp形式：

req = urllib2.Request('ftp://example.com/')

3.Data

有时你想将数据发送到一个URL（通常是URL将指向一个CGI（通用网关接口）脚本或其他Web应用程序）。

通过HTTP，这通常使用一个POST请求。这是当你提交你填写的HTML表单，浏览器通常使用POST请求。

并非所有POST都都来源于表单：你可以使用一个POST传送任意数据到自己的应用程序。

在通常情况下HTML表单，需要对数据编码成标准方式，然后传递到请求对象作为数据参数。编码是使用的函数来自urllib库不是从urllib2的。

import urllib

import urllib2

url = 'http://www.someserver.com/cgi-bin/register.cgi'

values = {'name' : 'Michael Foord',

'location' : 'Northampton',

'language' : 'Python' }

data = urllib.urlencode(values)

req = urllib2.Request(url, data)

response = urllib2.urlopen(req)

the_page = response.read()

如果你没有提交data参数，urllib2使用GET请求。GET和POST请求不同之处在于POST请求通常有“副作用”：他们以某种方式改变了系统的状态。

虽然HTTP标准明确规定，POST可能会引起副作用，而GET请求从来没有引起副作用，data也可以在HTTP GET请求通过在URL本身编码来传送。

>>> import urllib2

>>> import urllib

>>> data = {}

>>> data['name'] = 'Somebody Here'

>>> data['location'] = 'Northampton'

>>> data['language'] = 'Python'

>>> url_values = urllib.urlencode(data)

>>> print url_values # The order may differ.

name=Somebody+Here&language=Python&location=Northampton

>>> url = 'http://www.example.com/example.cgi'

>>> full_url = url + '?' + url_values

>>> data = urllib2.urlopen(full_url)

全的URL需要加一个？在URL后面，后面跟着encoded values。

4 Headers

我们将在这里讨论一个特定的HTTP头，来说明如何headers添加到您的HTTP请求。有些网站不喜欢被程序浏览，或发送不同的版本内容到不同的浏览器。

urllib2默认的自身标识为Python-urllib/ XY（x和y是Python主版本和次版本号,例如Python-urllib/2.5），这可能会使网站迷惑，或只是简单的不能正常工作。

浏览器通过User-Agent标识自己，当你创建一个Request对象，你可以传送一个包含头部的字典。

下面的例子标题的字典作出了和上面同样的要求，但自身标识为 Internet Explorer 5 。

import urllib

import urllib2

url = 'http://www.someserver.com/cgi-bin/register.cgi'

user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'

values = {'name' : 'Michael Foord',

'location' : 'Northampton',

'language' : 'Python' }

headers = { 'User-Agent' : user_agent }

data = urllib.urlencode(values)

req = urllib2.Request(url, data, headers)

response = urllib2.urlopen(req)

the_page = response.read()

5 URLError

urlopen不能处理的响应时（通常的Python APIs异常如ValueError,TypeError等也会同时产生）他会引发URLError。

HTTPError是URLError的子类，一般在特定的HTTP URL中产生。

通常，URLError产生是因为没有网络连接（到指定的服务器的路由），或指定的服务器不存在。在这种情况下，所提出的异常将有一个“reason”属性，它是含有一个元组包含错误代码和文本错误消息。

import urllib2

req = urllib2.Request('http://www.pretend_server.org')

try:

  urllib2.urlopen(req)

except urllib2.URLError as e:

  print e.reason

输出是：

[Errno -2] Name or service not known

6 HTTPError

来自服务器的HTTP响应包含一个数字“状态码”。

有时，状态代码表示服务器无法完成请求。默认处理程序将处理一些这类的响应（例如，如果该响应是一个“重定向”，请求客户端从不同的URL获取文档，urllib2将会处理）。

对于那些它不能处理，urlopen会引发HTTPError。典型错误包括“404”（找不到网页），“403”（要求禁止），和'401'（需要身份验证）。

下面是Error Codes

# Table mapping response codes to messages; entries have the

# form {code: (shortmessage, longmessage)}.

responses = {

100: ('Continue', 'Request received, please continue'),

101: ('Switching Protocols',

'Switching to new protocol; obey Upgrade header'),

200: ('OK', 'Request fulfilled, document follows'),

201: ('Created', 'Document created, URL follows'),

202: ('Accepted',

'Request accepted, processing continues off-line'),

203: ('Non-Authoritative Information', 'Request fulfilled from cache'),

204: ('No Content', 'Request fulfilled, nothing follows'),

205: ('Reset Content', 'Clear input form for further input.'),

206: ('Partial Content', 'Partial content follows.'),

300: ('Multiple Choices',

'Object has several resources -- see URI list'),

301: ('Moved Permanently', 'Object moved permanently -- see URI list'),

302: ('Found', 'Object moved temporarily -- see URI list'),

303: ('See Other', 'Object moved -- see Method and URL list'),

304: ('Not Modified',

'Document has not changed since given time'),

305: ('Use Proxy',

'You must use proxy specified in Location to access this '

'resource.'),

307: ('Temporary Redirect',

'Object moved temporarily -- see URI list'),

400: ('Bad Request',

'Bad request syntax or unsupported method'),

401: ('Unauthorized',

'No permission -- see authorization schemes'),

402: ('Payment Required',

'No payment -- see charging schemes'),

403: ('Forbidden',

'Request forbidden -- authorization will not help'),

404: ('Not Found', 'Nothing matches the given URI'),

405: ('Method Not Allowed',

'Specified method is invalid for this server.'),

406: ('Not Acceptable', 'URI not available in preferred format.'),

407: ('Proxy Authentication Required', 'You must authenticate with '

'this proxy before proceeding.'),

408: ('Request Timeout', 'Request timed out; try again later.'),

409: ('Conflict', 'Request conflict.'),

410: ('Gone',

'URI no longer exists and has been permanently removed.'),

411: ('Length Required', 'Client must specify Content-Length.'),

412: ('Precondition Failed', 'Precondition in headers is false.'),

413: ('Request Entity Too Large', 'Entity is too large.'),

414: ('Request-URI Too Long', 'URI is too long.'),

415: ('Unsupported Media Type', 'Entity body in unsupported format.'),

416: ('Requested Range Not Satisfiable',

'Cannot satisfy request range.'),

417: ('Expectation Failed',

'Expect condition could not be satisfied.'),

500: ('Internal Server Error', 'Server got itself in trouble'),

501: ('Not Implemented',

'Server does not support this operation'),

502: ('Bad Gateway', 'Invalid responses from another server/proxy.'),

503: ('Service Unavailable',

'The server cannot process the request due to a high load'),

504: ('Gateway Timeout',

'The gateway server did not receive a timely response'),

505: ('HTTP Version Not Supported', 'Cannot fulfill request.'),

}

当错误被返回一个HTTP错误代码和错误页面提高服务器响应。您可以使用为页面上的响应HTTPError这样的实例返回。这意味着，以及代码属性，它也有阅读中，getURL和信息，方法。
当一个错误号产生后，服务器会返回一个HTTP错误号和一个错误页面。
可以使用HTTPError实例作为页面返回的response应答对象。
这表示和错误属性一样，它同样包含了read,geturl,和info方法。

import urllib2

req = urllib2.Request('http://www.python.org/fish.html')

try:

    urllib2.urlopen(req)

except urllib2.HTTPError as e:

    print e.code

    print e.read()

运行发现：

404
<!doctype html>



<html class="no-js" lang="en" dir="ltr">