Python 爬虫常用的库

一、常用库

1、requests 做请求的时候用到。

requests.get("url")

2、selenium 自动化会用到。

3、lxml

4、beautifulsoup

5、pyquery 网页解析库说是比beautiful 好用，语法和jquery非常像。

6、pymysql 存储库。操作mysql数据的。

7、pymongo 操作MongoDB 数据库。

8、redis 非关系型数据库。

9、jupyter 在线记事本。

二、什么是Urllib

Python内置的Http请求库

urllib.request 请求模块　　模拟浏览器

urllib.error 异常处理模块

urllib.parse url解析模块　　工具模块，如：拆分、合并

urllib.robotparser robots.txt 解析模块　　

2和3的区别

Python2

import urllib2

response = urllib2.urlopen('http://www.baidu.com');

Python3

import urllib.request

response =urllib.request.urlopen('http://www.baidu.com');

用法：

urlOpen 发送请求给服务器。

urllib.request.urlopen(url,data=None[参数],[timeout,]*,cafile=None,capath=None,cadefault=false,context=None)

例子：

例子1：

import urllib.requests

response=urllib.reqeust.urlopen('http://www.baidu.com')

print(response.read().decode('utf-8'))

　　例子2：

　　import urllib.request

　　import urllib.parse

　　data=bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf8')

　　response=urllib.reqeust.urlopen('http://httpbin.org/post',data=data)

　　print(response.read())

　　注：加data就是post发送，不加就是以get发送。

　　例子3：

　　超时测试

　　import urllib.request

　　response =urllib.request.urlopen('http://httpbin.org/get',timeout=1)

　　print(response.read())

　　-----正常

　　import socket

　　import urllib.reqeust

　　import urllib.error

　　try:

　　　　response=urllib.request.urlopen('http://httpbin.org/get',timeout=0.1)

　　except urllib.error.URLError as e:

　　　　if isinstance(e.reason,socket.timeout):

　　　　　　print('TIME OUT')

　　这是就是输出 TIME OUT

响应

响应类型

import urllib.request

response=urllib.request.urlopen('https://www.python.org')

print(type(response))

输出:print(type(response))

　　　状态码、响应头

　　　import urllib.request

　　　response = urllib.request.urlopen('http://www.python.org')

　　　print(response.status) // 正确返回200

　　　print(response.getheaders()) //返回请求头

　　 print(response.getheader('Server'))　　

三、Request 可以添加headers

　　import urllib.request

　　request=urllib.request.Request('https://python.org')

　　response=urllib.request.urlopen(request)

　　print(response.read().decode('utf-8'))

　　例子：

　　from urllib import request,parse

　　url='http://httpbin.org/post'

　　headers={

　　　　User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36

　　　　Host:httpbin.org

　　}

　　dict={

　　　　'name':'Germey'

　　}

　　data=bytes(parse.urlencode(dict),encoding='utf8')

　　req= request.Request(url=url,data=data,headers=headers,method='POST')

　　response = request.urlopen(req)

　　print(response.read().decode('utf-8'))

四、代理

　　import urllib.request

　　proxy_handler =urllib.request.ProxyHandler({

　　　　'http':'http://127.0.0.1:9743',

　　　　'https':'http://127.0.0.1:9743',

　　})

　　opener =urllib.request.build_opener(proxy_handler)

　　response= opener.open('http://httpbin.org/get')

　　print(response.read())

五、Cookie

　　import http.cookiejar,urllib.request

　　cookie = http.cookiejar.Cookiejar()

　　handler=urllib.request.HTTPCookieProcessor(cookie)

　　opener = urllib.request.build_opener(handler)

　　response = opener.open('http://www.baidu.com')

　　for item in cookie:

　　　　print(item.name+"="+item.value)

　　第一种保存cookie方式

　　import http.cookiejar,urllib.request

　　filename = 'cookie.txt'　　

　　cookie =http.cookiejar.MozillaCookieJar(filename)

　　handler= urllib.request.HTTPCookieProcessor(cookie)

　　opener=urllib.request.build_opener(handler)

　　response= opener.open('http://www.baidu.com')

　　cookie.save(ignore_discard=True,ignore_expires=True)

　　第二种保存cookie方式

　　import http.cookiejar,urllib.request

　　filename = 'cookie.txt'

　　cookie =http.cookiejar.LWPCookieJar(filename)

　　handler=urllib.request.HTTPCookieProcessor(cookie)

　　opener=urllib.request.build_opener(handler)

　　response=opener.open('http://www.baidu.com')

　　cookie.save(ignore_discard=True,ignore_expires=True)

　　读取cookie

　　import http.cookiejar,urllib.request

　　cookie=http.cookiejar.LWPCookieJar()

　　cookie.load('cookie.txt',ignore_discard=True,ignore_expires=True)

　　handler=urllib.request.HTTPCookieProcessor(cookie)

　　opener=urllib.request.build_opener(handler)

　　response=opener.open('http://www.baidu.com')

　　print(response.read().decode('utf-8'))

六、异常处理

　　例子1：

　　from urllib import reqeust,error

　　 try:

　　　　response =request.urlopen('http://cuiqingcai.com/index.htm')　

　　except error.URLError as e:

　　　　print(e.reason)　　//url异常捕获

　　例子2:

　　from urllib import reqeust,error

　　 try:

　　　　response =request.urlopen('http://cuiqingcai.com/index.htm')　

　　except error.HTTPError as e:

　　　　print(e.reason,e.code,e.headers,sep='\n')　　//url异常捕获

　　except error.URLError as e:

　　　　print(e.reason)　　

　　else:

　　　　print('Request Successfully')

7、URL解析

　　urlparse //url 拆分

　　urllib.parse.urlparse(urlstring,scheme='',allow_fragments=True)

　　例子：

　　from urllib.parse import urlparse //url 拆分

　　result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')

　　print(type(result),result)

　　结果：

　　例子2：

　　from urllib.parse import urlparse //没有http

　　result = urlparse('www.baidu.com/index.html;user?id=5#comment',scheme='https')

　 print(result)

　　例子3：

　　from urllib.parse import urlparse

　　result = urlparse('http://www.baidu.com/index.html;user?id=5#comment',scheme='https')

　　print(result)

　　例子4：

　　from urllib.parse import urlparse

　　result = urlparse('http://www.baidu.com/index.html;user?id=5#comment',allow_fragments=False)

　　print(result)

　　例子5：

　　from urllib.parse import urlparse

　　result = urlparse('http://www.baidu.com/index.html#comment',allow_fragments=False)

　　print(result)

七、拼接

　　urlunparse

　　例子：

　　from urllib.parse import urlunparse

　　data=['http','www.baidu.com','index.html','user','a=6','comment']

　　print(urlunparse(data))

　　urljoin

　　from urllib.parse import urljoin

　　print(urljoin('http://www.baidu.com','FAQ.html'))

　　后面覆盖前面的

　　urlencode

　　from urllib.parse import urlencode

　　params={

　　　　'name':'gemey',

　　　　'age':22

　　}

　　base_url='http//www.baidu.com?'

　　url = base_url+urlencode(params)

　　print(url)

　　http://www.baidu.com?name=gemey&age=22

Python 爬虫常用的库的更多相关文章

python爬虫常用第三方库
这个列表包含与网页抓取和数据处理的Python库网络通用 urllib -网络库(stdlib). requests -网络库. grab – 网络库(基于pycurl). pycurl – 网络 ...
python爬虫常用的库
1,请求:requests requests.get(url, headers) requests.post(url, data=data, files=files) urllib模块: Py ...
Python爬虫之selenium库使用详解
Python爬虫之selenium库使用详解本章内容如下: 什么是Selenium selenium基本使用声明浏览器对象访问页面查找元素多个元素查找元素交互操作交互动作执行JavaS ...
python爬虫之urllib库（三）
python爬虫之urllib库(三) urllib库访问网页都是通过HTTP协议进行的,而HTTP协议是一种无状态的协议,即记不住来者何人.举个栗子,天猫上买东西,需要先登录天猫账号进入主页,再去 ...
python爬虫之urllib库（二）
python爬虫之urllib库(二) urllib库超时设置网页长时间无法响应的,系统会判断网页超时,无法打开网页.对于爬虫而言,我们作为网页的访问者,不能一直等着服务器给我们返回错误信息,耗费 ...
python爬虫之urllib库（一）
python爬虫之urllib库(一) urllib库 urllib库是python提供的一种用于操作URL的模块,python2中是urllib和urllib2两个库文件,python3中整合在了u ...
python爬虫(四)_urllib2库的基本使用
本篇我们将开始学习如何进行网页抓取,更多内容请参考:python学习指南 urllib2库的基本使用所谓网页抓取,就是把URL地址中指定的网络资源从网络流中读取出来,保存到本地.在Python中有很 ...
python爬虫之requests库
在python爬虫中,要想获取url的原网页,就要用到众所周知的强大好用的requests库,在2018年python文档年度总结中,requests库使用率排行第一,接下来就开始简单的使用reque ...
Python爬虫常用之PyQuery
PyQuery是解析页面常用的库.是python对jquery的封装.下面是一份解析基本页面的代码.后期用到复杂或者实用的方式再增加. from pyquery import PyQuery as p ...

随机推荐

Python数据类型深入学习之数字
一. 数字常量 1. 下面来看看Python的数字常量中都要哪些类型: 数字常量 129,-89,0 一般整数 9999848499999L,4594646469l 长整型数(无限大小) 1.232 ...
转：【专题三】自定义Web服务器
前言: 经过前面的专题中对网络层协议和HTTP协议的简单介绍相信大家对网络中的协议有了大致的了解的, 本专题将针对HTTP协议定义一个Web服务器,我们平常浏览网页通过在浏览器中输入一个网址就可以看到 ...
[转载]asp.net中，<%#%>,<%=%>和<%%>分别是什么意思，有什么区别
在asp.net中经常出现包含这种形式<%%>的html代码,总的来说包含下面这样几种格式: 一. <%%> 这种格式实际上就是和asp的用法一样的,只是asp中里面是vbsc ...
ubuntu16.04 无法连接wifi和校园宽带问题的解决办法
我遇到的问题是在ubuntu16.04系统下无法进行上海大学校园宽带连接或者校园wifi连接,我一个一个来解决这两个问题. 1.无法连接校园宽带的问题:输入校园账号和密码后,宽带始终连接不上.(上海大 ...
怎样从外网访问内网SQLServer数据库？
本地安装了一个SQLServer数据库,只能在局域网内访问到,怎样从外网也能访问到本地的SQLServer数据库呢?本文将介绍具体的实现步骤. 1. 准备工作 1.1 安装并启动SQLServer数据 ...
MySQL 主表与从表
通过上一篇随笔,笔者了解到,实体完整性是通过主键约束实现的,而参照完整性是通过外键约束实现的,两者都是为了保证数据的完整性和一致性. 主键约束比较好理解,就是主键值不能为空且不重复,已经强调好多次,所 ...
每日linux命令学习-历史指令查询（history、fc、alias）
linux历史机制对命令行中输入的命令进行编号并依此保存,以维护命令历史.登录会话期间输入的命令保存在shell内存中,若终止命令则添加至历史文件. 1. 箭头符号方向键使用键盘上的箭头方向键可以从 ...
Web开发笔记 #06# 前后端分离
前后端分离关于“前后端分离”的深入讨论: 如何正确理解前后端分离? Web 前后端分离的意义大吗? 在上面有看到有谈“国外it公司分工”的回答,感觉挺有意思的.大概是讲国外it公司并不分前后端,只分 ...
linux 搭建svn(待完成)
http://blog.csdn.net/lazy_cc/article/details/8726500搭建仓库 http://blog.csdn.net/xocoder/article/detail ...
Java中五种遍历HashMap的方式
import java.util.HashMap; import java.util.Iterator; import java.util.Map; public class Java8Templat ...

Python 爬虫常用的库

Python 爬虫常用的库的更多相关文章

随机推荐

热门专题