爬虫第一篇基本库的使用—

在Python2中有urllib2和urllib3两个库来实现请求的发送，在Pyhon3中则统一为urllib。

urilib包含以下4个模块

 request：最基本的请求模块，可以用来实现请求的发送

 error：异常处理模块，用于处理异常，使我们的程序不会意外终止

 parse：工具模块，提供了URL多种处理方法，拆分，解析与合并

 robotparser：用来识别网站的robots.txt文件，判断我们可以爬取哪些网站，哪些不可以

一 request模块使用方法

1.urlopen()　　

基本HTTP请求构造方法

 #模拟浏览器发送请求访问Python官网

 import urllib

 response = urllib.request.urlopen("http://www.python.org")

 print(response.read().decode("utf-8"))

返回结果是一个HTTPResponse类型的对象，主要包含read()，readinto()，getheader(name)，getheaders()，fileno()和getcode()方法　

主要参数：

url：要请求的地址

data：post请求传送的表单数据

timeout：请求的超时时间

from urllib import parse,request

try:

    data = bytes(parse.urlencode({'hello': 'word'}), encoding="utf-8")

    response = request.urlopen("http://httpbin.org/post",data=data,timeout=1)

except error.URLError as e:

    if isinstance(e.reason,socket.timeout):

        print("TIME OUT")

超时时间设置为1秒时输出

{"args":{},"data":"","files":{},"form":{"hello":"word"},"headers":{"Accept-Encoding":"identity","Connection":"close","Content-Length":"","Content-Type":"application/x-www-form-urlencoded","Host":"httpbin.org","User-Agent":"Python-urllib/3.6"},"json":null,"origin":"183.203.223.38","url":"http://httpbin.org/post"}

超时时间设置为0.01秒时捕获异常输出TIME OUT

2.Request

当我们不满足与简单的http请求时，需要使用Request类来创建复杂的http请求

主要参数：

url：要请求的URL

data：要传送的表单数据，需要使用urlencode编码

headers：请求头

origin_req_host：请求方host名称或IP地址

unverifiable：表示这个请求是无法验证的，默认是False，意思就是说用户没有足够的权限来接收这个请求的结果

method：指定请求的方法

from urllib import request,parse

data = bytes(parse.urlencode({"name":"sunqi"}),encoding='utf-8')

head = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36",
　　　　　"Host":"httpbin.org"} 
req = request.Request('http://httpbin.org/post',data=data,headers=head,origin_req_host="",method="POST") 
res = request.urlopen(req) 
print(res.read().decode('utf-8'))

{"args":{},"data":"","files":{},"form":{"name":"sunqi"},"headers":{"Accept-Encoding":"identity","Connection":"close","Content-Length":"","Content-Type":"application/x-www-form-urlencoded","Host":"httpbin.org","User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"},"json":null,"origin":"183.203.223.38","url":"http://httpbin.org/post"}

3.高级用法各种handler

HTTPDefaultErrorHandler：用于处理http响应错误，错误都会跑出HTTPError异常

HTTPRedirectHandler：用于处理重定向

HTTPCookieProcessor：用于处理Cookies

ProxyHandler：用于设置代理，默认代理为空

HTTPBasicAuthHandler：用于管理认证，如果一个链接打开时需要认证，那么可以用它来解决认证问题

HTTPPasswordMgr：用于管理密码。它维护了用户名和密码的表

OpenerDirector：可以实现open方法，返回值类型和urlopen方法一样，利用handler来构建opener

HTTPPasswordMgrWithDefaultRealm()类将创建一个密码管理对象，用来保存 HTTP 请求相关的用户名和密码，主要应用两个场景：

1. 验证代理授权的用户名和密码 (ProxyBasicAuthHandler())
2. 验证Web客户端的的用户名和密码 (HTTPBasicAuthHandler())

（1）爬取需要认证的网站时需要用到认证handler

from urllib.request import HTTPBasicAuthHandler,HTTPPasswordMgrWithDefaultRealm,build_opener

from urllib.error import URLError

username = ""

password = ""

url = "http://localhost:5000/"

p = HTTPPasswordMgrWithDefaultRealm()#密码管理类的对象

p.add_password(None,url,username,password)#对象添加密码

auth_handler = HTTPBasicAuthHandler(p)#密码认证实例化认证handler

opener = build_opener(auth_handler)

try:

    result = opener.open(url)

    print(result.status())

    html = result.read().decode('utf-8')

    print(html)

except URLError as e:

    print(e.reason)

（2）涉及到Cookies相关处理

获取cookies

import http.cookiejar,urllib.request

# 打印输出cookie

cookie = http.cookiejar.CookieJar()

handler = urllib.request.HTTPCookieProcessor(cookie)

opener = urllib.request.build_opener(handler)

response = opener.open("http://www.baidu.com")

for item in cookie:

        print(item.name+"->"+item.value)

存储cookies到文件

ignore_discard：即使cookies将被丢弃也将它保存下来

ignore_expires：如果在该文件中cookies已经存在，则覆盖原文件写入

filename = "cookies_LWP.txt"

cookie = http.cookiejar.LWPCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open("http://www.baidu.com")
cookie.save(ignore_discard=True,ignore_expires=True)

filename = "cookies_Mozilla.txt"

cookie = http.cookiejar.MozillaCookieJar(filename)#Mozilla格式的文件

handler = urllib.request.HTTPCookieProcessor(cookie)

opener = urllib.request.build_opener(handler)

response = opener.open("http://www.baidu.com")

cookie.save(ignore_discard=True,ignore_expires=True)

从文件读取Cookies

cookie = http.cookiejar.LWPCookieJar()

cookie.load("cookies_LWP.txt",ignore_expires=True,ignore_discard=True)

handler = urllib.request.HTTPCookieProcessor(cookie)

opener = urllib.request.build_opener(handler)

response = opener.open("http://www.baidu.com")

print(response.read().decode('utf-8'))

二 error模块使用用法

1.URLError

来自error模块，是error模块的基类，由request模块产生的异常都可以用它来捕获处理

#URLError

try :

    request.urlopen("http://cuiqingcai.com/index.htm")

except error.URLError as e:

    print(e.reason)#Not Found

2.HTTPError

URLError的子类，专用于处理HTTP请求的错误

code：返回的HTTP状态码

reason：造成错误的原因

headers：返回的请求头

#HTTPerror URLError的子类

try:

    request.urlopen("http://cuiqingcai.com/index.htm")

except error.HTTPError as e:

    print(e.code)#状态码

    print(e.reason)#错误原因

    print(e.headers)#请求头

404

Not Found

Server: nginx/1.10.3 (Ubuntu)

Date: Thu, 19 Jul 2018 01:18:42 GMT

Content-Type: text/html; charset=UTF-8

Transfer-Encoding: chunked

Connection: close

Vary: Cookie

Expires: Wed, 11 Jan 1984 05:00:00 GMT

Cache-Control: no-cache, must-revalidate, max-age=0

Link: <https://cuiqingcai.com/wp-json/>; rel="https://api.w.org/"

运行结果

3.reason返回的不一定是一个字符串，可能是一个类对象

#返回的不是字符串的reason

try:

    response = request.urlopen("http://www.baidu.com",timeout=0.01)

except error.HTTPError as e:

    print(e.reason,e.headers,e.code)

except error.URLError as e:

    print(e.reason)

    print(type(e.reason))#e.reason返回的不一定是字符串

    if isinstance(e.reason,socket.timeout):

        print("TIME OUT")

三 parse模块使用方法

parse提供的方法主要是解析，拆分，合并链接

urlparse()：URL识别与分段，将链接拆分为6个部分：协议，域名，路径，参数，查询条件，锚点

unlunparse()：URL构造，参数长度必须为6个（上边的六个）

urlsplit()：URL拆分，不单独拆分参数部分，将参数部分与路径部分合并到一起

urlunsplit()：URL构造，参数长度必须为5（上比的五个）

urljoin()：URL拼接，基础链接作为第一个参数，新的链接作为第二个参数

urlencode()：将查询条件组成的参数字典序列化为get请求的参数

parse_qs()：经get请求URL的参数反序列化为参数字典

parse_qsl()：将get请求的参数转化为参数元组组成的列表

quote()：将内容编码为URL编码的格式，比如中文

unquote()：将URL编码格式转换为内容

from urllib.parse import urlparse

# urlprase  解析链接

result = urlparse("http://www.baidu.com/index.html;user?id=5#comment",allow_fragments=True)

print(type(result))

print(result)

'''

<class 'urllib.parse.ParseResult'>

ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

协议，域名，路径，参数，查询条件，锚点

标准连接格式

schema://netloc/path;params?query#fragment

urlparse(url="",scheme="",allow_fragments=True or False)

设置协议时，如果url没有协议开头，则只用默认协议，如果有协议开头就使用url自身协议

未设置设置allow_fragment选项时锚点与query一组

没有query时，锚点被解析到path中

'''

from urllib.parse import urlunparse

# urlunprase  构造链接

data = ("http","www.baidu.com","index.html","user","","")

data = {"schema":'http',"netloc":'www.baidu.com',"path":'index.html',"params":'',"query":'',"fragment":''}

url = urlunparse(data)

print(url)

'''

构造链接,数据结构长度必须为6,列表和元组都可以

'''

#urlsplit 链接分割

from urllib.parse import urlsplit

result = urlsplit("http://www.baidu.com/index.html;user?id=5#comment")

print(result)#不单独区分param参数，与path合并

#urlunsplit 构造链接

from urllib.parse import urlunsplit

data = ("http","www.baidu.com","index.html","user","")#要求长度必须为5

url = urlunsplit(data)

print(url)

from urllib.parse import urljoin

#urljoin 两个链接组合成一个新链接，前者为base_url,后者为给定的url

#base_url 提供协议，域名与路径，当要构造的url没有时会自动补充，否则不补充

print(urljoin("http://www,baidu.com","eg.html"))

#无协议，无域名自动填充 http://www,baidu.com/eg.html

print(urljoin("http://www.baidu.com","//www.sun.com/eg.html"))

#无协议自动填充协议  http://www.sun.com/eg.html

print(urljoin("http://www.baidu.com",'http://www.sun.com/index.html'))

#都有不填充

print(urljoin("http://www.baidu.com/index.html","http:"))

#http://www.baidu.com/index.html

print(urljoin("http://www.baidu.com/index.html?wd=123","http://www.sun.com/index.html"))

#http://www.sun.com/index.html

print(urljoin("http://www.baidu.com/index.html?wd=123#comment","http://www.sun.com/index.php?wd=456#com"))

#http://www.sun.com/index.php?wd=456#com  base_url自带的params，query，fragment不起作用

#urlencode 构造get请求

from urllib.parse import urlencode

params = {"name":"sunqi","password":""}

url = "http://www.baidu.com?"+urlencode(params)

print(url)

#http://www.baidu.com?name=sunqi&password=123456 将字典转化为get提交的参数

from urllib.parse import parse_qs,parse_qsl

#反序列化参数为字典和元组

result = urlparse("http://www.baidu.com?name=sunqi&password=123456")

print(result[4])

dic = parse_qs(result[4])#转换为字典{'name': ['sunqi'], 'password': ['123456']}

tu = parse_qsl(result[4])#转换为元组[('name', 'sunqi'), ('password', '123456')]

print(dic)

print(tu)

from urllib.parse import quote,unquote

#将字符串转换为URL编码与解码

keyword = "壁纸"

url = "http://www.baidu.com/s?wd="+quote(keyword)

print(url)#http://www.baidu.com/s?wd=%E5%A3%81%E7%BA%B8

print(unquote(url))#http://www.baidu.com/s?wd=壁纸

四分析robots协议

Robots协议：也称作爬虫协议，机器人协议，全名叫做网络爬虫排除标准（Robots Exclusion Protocol），用来告诉爬虫和搜索引擎哪些页面可以抓取，哪些不可以。

　　　　　　　通常是一个叫做robots.txt的文本文件。如果存在，爬虫就会按照制定的规则爬取页面，如果不存在，搜索爬虫就会访问所有可以直接访问的页面。

样例：

User-agent:*

Disallow:/

Allow:/public/

所有爬虫只能爬取/public/目录下的内容.User-agent对爬虫限制，为*的话对任意爬虫都有效。Disallow指定了不允许爬取的页面，Allow指定了允许爬取的页面

RobotFileParser()类，可以根据一个网站的robots.txt文件来判断一个爬虫是否有权限来爬取这个网页，常用方法:

set_urk()：用来设置robots.txt 文件的位置链接

read()：用来读取robots.txt 文件并用来分析

parse()：解析robots.txt文件，传入的是文件的某些行的内容

can_fetch()：传入两个参数，第一个是User-agent，第二个是抓取的URL，用来判断的和是否可以抓取指定的URL，返回布尔值

modified()：将当前时间设置为上次抓取和分析robots.txt的时间

from urllib.robotparser import RobotFileParser

from urllib.request import urlopen

robot = RobotFileParser()

result = urlopen('http://www.zhihu.com/robots.txt').read().decode('utf-8')

print(result)

robot.parse(result)

print(robot.can_fetch('*','http://www.zhihu.com/question/29173647/answer/437189494'))#True

print(robot.can_fetch('*','https://www.zhihu.com/signin?next=%2Fexplore'))#True

print(robot.can_fetch('*','http://www.zhihu.com/inbox/7013224000'))#True

response = urlopen('https://www.zhihu.com/inbox/7013224000').read().decode("utf-8")

filename = "zhuhi_inbox.html"

with open(filename,"w") as f:

    f.write(response)

User-agent: Googlebot

Disallow: /login

Disallow: /logout

Disallow: /resetpassword

Disallow: /terms

Disallow: /search

Disallow: /notifications

Disallow: /settings

Disallow: /inbox

Disallow: /admin_inbox

Disallow: /*?guide*

User-agent: Googlebot-Image

Disallow: /login

Disallow: /logout

Disallow: /resetpassword

Disallow: /terms

Disallow: /search

Disallow: /notifications

Disallow: /settings

Disallow: /inbox

Disallow: /admin_inbox

Disallow: /*?guide*

User-agent: Baiduspider-news

Disallow: /login

Disallow: /logout

Disallow: /resetpassword

Disallow: /terms

Disallow: /search

Disallow: /notifications

Disallow: /settings

Disallow: /inbox

Disallow: /admin_inbox

Disallow: /*?guide*

User-agent: Baiduspider

Disallow: /login

Disallow: /logout

Disallow: /resetpassword

Disallow: /terms

Disallow: /search

Disallow: /notifications

Disallow: /settings

Disallow: /inbox

Disallow: /admin_inbox

Disallow: /*?guide*

User-agent: Baiduspider-image

Disallow: /login

Disallow: /logout

Disallow: /resetpassword

Disallow: /terms

Disallow: /search

Disallow: /notifications

Disallow: /settings

Disallow: /inbox

Disallow: /admin_inbox

Disallow: /*?guide*

User-agent: Sosospider

Disallow: /login

Disallow: /logout

Disallow: /resetpassword

Disallow: /terms

Disallow: /search

Disallow: /notifications

Disallow: /settings

Disallow: /inbox

Disallow: /admin_inbox

Disallow: /*?guide*

User-agent: bingbot

Disallow: /login

Disallow: /logout

Disallow: /resetpassword

Disallow: /terms

Disallow: /search

Disallow: /notifications

Disallow: /settings

Disallow: /inbox

Disallow: /admin_inbox

Disallow: /*?guide*

User-agent: 360Spider

Disallow: /login

Disallow: /logout

Disallow: /resetpassword

Disallow: /terms

Disallow: /search

Disallow: /notifications

Disallow: /settings

Disallow: /inbox

Disallow: /admin_inbox

Disallow: /*?guide*

User-agent: HaosouSpider

Disallow: /login

Disallow: /logout

Disallow: /resetpassword

Disallow: /terms

Disallow: /search

Disallow: /notifications

Disallow: /settings

Disallow: /inbox

Disallow: /admin_inbox

Disallow: /*?guide*

User-agent: yisouspider

Disallow: /login

Disallow: /logout

Disallow: /resetpassword

Disallow: /terms

Disallow: /search

Disallow: /notifications

Disallow: /settings

Disallow: /inbox

Disallow: /admin_inbox

Disallow: /*?guide*

User-agent: YoudaoBot

Disallow: /login

Disallow: /logout

Disallow: /resetpassword

Disallow: /terms

Disallow: /search

Disallow: /notifications

Disallow: /settings

Disallow: /inbox

Disallow: /admin_inbox

Disallow: /*?guide*

User-agent: Sogou Orion spider

Disallow: /login

Disallow: /logout

Disallow: /resetpassword

Disallow: /terms

Disallow: /search

Disallow: /notifications

Disallow: /settings

Disallow: /inbox

Disallow: /admin_inbox

Disallow: /*?guide*

User-agent: Sogou News Spider

Disallow: /login

Disallow: /logout

Disallow: /resetpassword

Disallow: /terms

Disallow: /search

Disallow: /notifications

Disallow: /settings

Disallow: /inbox

Disallow: /admin_inbox

Disallow: /*?guide*

User-agent: Sogou blog

Disallow: /login

Disallow: /logout

Disallow: /resetpassword

Disallow: /terms

Disallow: /search

Disallow: /notifications

Disallow: /settings

Disallow: /inbox

Disallow: /admin_inbox

Disallow: /*?guide*

User-agent: Sogou spider2

Disallow: /login

Disallow: /logout

Disallow: /resetpassword

Disallow: /terms

Disallow: /search

Disallow: /notifications

Disallow: /settings

Disallow: /inbox

Disallow: /admin_inbox

Disallow: /*?guide*

User-agent: Sogou inst spider

Disallow: /login

Disallow: /logout

Disallow: /resetpassword

Disallow: /terms

Disallow: /search

Disallow: /notifications

Disallow: /settings

Disallow: /inbox

Disallow: /admin_inbox

Disallow: /*?guide*

User-agent: Sogou web spider

Disallow: /login

Disallow: /logout

Disallow: /resetpassword

Disallow: /terms

Disallow: /search

Disallow: /notifications

Disallow: /settings

Disallow: /inbox

Disallow: /admin_inbox

Disallow: /*?guide*

User-agent: EasouSpider

Request-rate: 1/2 # load 1 page per 2 seconds

Crawl-delay: 10

Disallow: /login

Disallow: /logout

Disallow: /resetpassword

Disallow: /terms

Disallow: /search

Disallow: /notifications

Disallow: /settings

Disallow: /inbox

Disallow: /admin_inbox

Disallow: /*?guide*

User-agent: MSNBot

Request-rate: 1/2 # load 1 page per 2 seconds

Crawl-delay: 10

Disallow: /login

Disallow: /logout

Disallow: /resetpassword

Disallow: /terms

Disallow: /search

Disallow: /notifications

Disallow: /settings

Disallow: /inbox

Disallow: /admin_inbox

Disallow: /*?guide*

User-Agent: *

Disallow: /

知乎robots.txt 文件内容