常见的爬虫分析库(4)-爬虫之PyQuery
PyQuery 是 Python 仿照 jQuery 的严格实现。语法与 jQuery 几乎完全相同。
官方文档:http://pyquery.readthedocs.io/
安装
1
|
pip install pyquery |
初始化
字符串初始化
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
html = ''' <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> ''' from pyquery import PyQuery as pq doc = pq(html) print (doc( 'li' )) |
URL初始化
1
2
3
|
from pyquery import PyQuery as pq doc = pq(url = 'http://www.baidu.com' ) print (doc( 'head' )) |
文件初始化
1
2
3
|
from pyquery import PyQuery as pq doc = pq(filename = 'demo.html' ) print (doc( 'li' )) |
基本CSS选择器
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
html = ''' <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> ''' from pyquery import PyQuery as pq doc = pq(html) print (doc( '#container .list li' )) |
查找元素
子元素
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
html = ''' <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> ''' from pyquery import PyQuery as pq doc = pq(html) items = doc( '.list' ) print ( type (items)) print (items) lis = items.find( 'li' ) print ( type (lis)) print (lis) |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
html = ''' <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> ''' from pyquery import PyQuery as pq doc = pq(html) items = doc( '.list' ) lis = items.children() print ( type (lis)) print (lis) |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
html = ''' <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> ''' from pyquery import PyQuery as pq doc = pq(html) items = doc( '.list' ) lis = items.children( '.active' ) print (lis) |
父元素
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
html = ''' <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> ''' from pyquery import PyQuery as pq doc = pq(html) items = doc( '.list' ) container = items.parent() print ( type (container)) print (container) |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
html = ''' <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> ''' from pyquery import PyQuery as pq doc = pq(html) items = doc( '.list' ) parents = items.parents() print ( type (parents)) print (parents) |
1
2
|
parent = items.parents( '.wrap' ) print (parent) |
兄弟元素
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
html = ''' <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> ''' from pyquery import PyQuery as pq doc = pq(html) li = doc( '.list .item-0.active' ) print (li.siblings()) |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
html = ''' <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> ''' from pyquery import PyQuery as pq doc = pq(html) li = doc( '.list .item-0.active' ) print (li.siblings( '.active' )) |
遍历
单个元素
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
html = ''' <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> ''' from pyquery import PyQuery as pq doc = pq(html) li = doc( '.item-0.active' ) print (li) |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
html = ''' <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> ''' from pyquery import PyQuery as pq doc = pq(html) lis = doc( 'li' ).items() print ( type (lis)) for li in lis: print (li) |
获取信息
获取属性
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
html = ''' <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> ''' from pyquery import PyQuery as pq doc = pq(html) a = doc( '.item-0.active a' ) print (a) print (a.attr( 'href' )) print (a.attr.href) |
获取文本
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
html = ''' <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> ''' from pyquery import PyQuery as pq doc = pq(html) a = doc( '.item-0.active a' ) print (a) print (a.text()) |
获取HTML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
html = ''' <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> ''' from pyquery import PyQuery as pq doc = pq(html) li = doc( '.item-0.active' ) print (li) print (li.html()) |
DOM操作
addClass、removeClass
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
html = ''' <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> ''' from pyquery import PyQuery as pq doc = pq(html) li = doc( '.item-0.active' ) print (li) li.removeClass( 'active' ) print (li) li.addClass( 'active' ) print (li) |
attr、css
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
html = ''' <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> ''' from pyquery import PyQuery as pq doc = pq(html) li = doc( '.item-0.active' ) print (li) li.attr( 'name' , 'link' ) print (li) li.css( 'font-size' , '14px' ) print (li) |
remove
1
2
3
4
5
6
7
8
9
10
11
12
|
html = ''' <div class="wrap"> Hello, World <p>This is a paragraph.</p> </div> ''' from pyquery import PyQuery as pq doc = pq(html) wrap = doc( '.wrap' ) print (wrap.text()) wrap.find( 'p' ).remove() print (wrap.text()) |
其他DOM方法 http://pyquery.readthedocs.io/en/latest/api.html
伪类选择器
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
html = ''' <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> ''' from pyquery import PyQuery as pq doc = pq(html) li = doc( 'li:first-child' ) print (li) li = doc( 'li:last-child' ) print (li) li = doc( 'li:nth-child(2)' ) print (li) li = doc( 'li:gt(2)' ) print (li) li = doc( 'li:nth-child(2n)' ) print (li) li = doc( 'li:contains(second)' ) print (li) |
常见的爬虫分析库(4)-爬虫之PyQuery的更多相关文章
- 常见的爬虫分析库(3)-Python正则表达式与re模块
在线正则表达式测试 http://tool.oschina.net/regex/ 常见匹配模式 模式 描述 \w 匹配字母数字及下划线 \W 匹配非字母数字下划线 \s 匹配任意空白字符,等价于 [\ ...
- 常见的爬虫分析库(2)-xpath语法
xpath简介 1.xpath使用路径表达式在xml和html中进行导航 2.xpath包含标准函数库 3.xpath是一个w3c的标准 xpath节点关系 1.父节点 2.子节点 3.同胞节点 4. ...
- 常见的爬虫分析库(1)-Python3中Urllib库基本使用
原文来自:https://www.cnblogs.com/0bug/p/8893677.html 什么是Urllib? Python内置的HTTP请求库 urllib.request ...
- 爬虫requests库 之爬虫贴吧
首先要观察爬虫的URL规律,爬取一个贴吧所有页的数据,观察点击下一页时URL是如何变化的. 思路: 定义一个类,初始化方法什么都不用管 定义一个run方法,用来实现主要逻辑 3 class Tieba ...
- 2.爬虫 urlib库讲解 异常处理、URL解析、分析Robots协议
1.异常处理 URLError类来自urllib库的error模块,它继承自OSError类,是error异常模块的基类,由request模块产生的异常都可以通过这个类来处理. from urllib ...
- Python爬虫Urllib库的基本使用
Python爬虫Urllib库的基本使用 深入理解urllib.urllib2及requests 请访问: http://www.mamicode.com/info-detail-1224080.h ...
- Python爬虫—requests库get和post方法使用
目录 Python爬虫-requests库get和post方法使用 1. 安装requests库 2.requests.get()方法使用 3.requests.post()方法使用-构造formda ...
- [python爬虫]Requests-BeautifulSoup-Re库方案--Requests库介绍
[根据北京理工大学嵩天老师“Python网络爬虫与信息提取”慕课课程编写 文章中部分图片来自老师PPT 慕课链接:https://www.icourse163.org/learn/BIT-10018 ...
- Python爬虫与数据分析之爬虫技能:urlib库、xpath选择器、正则表达式
专栏目录: Python爬虫与数据分析之python教学视频.python源码分享,python Python爬虫与数据分析之基础教程:Python的语法.字典.元组.列表 Python爬虫与数据分析 ...
随机推荐
- 识别oracle数据库软件版本号
由于Oracle数据库不断发展并可能需要维护,因此Oracle会定期生成新版本.并非所有客户最初都订阅新版本或需要对其现有版本进行特定维护.因此,该产品的多个版本同时存在. 可能需要多达五个数字才能完 ...
- 系统更新报错--NO_PUBKEY
错误信息 W: An error occurred during the signature verification. The repository is not updated and the p ...
- Netty实现简单UDP服务器
本文参考<Netty权威指南> 文件列表: ├── ChineseProverbClientHandler.java ├── ChineseProverbClient.java ├── C ...
- Mysql 5.* 数据库备份及导入
作者:邓聪聪 倒出数据文件 1) 导出数据和表结构: 进入数据库查看表结构 msql -u用户名 -p密码 msql -u用户名 -p密码 -S /var/lib/mysql/mysql.sock ...
- Linux下的压缩和解压缩命令gzip/gunzip
作者:邓聪聪 Linux下的压缩和解压缩命令——gzip/gunzip yum -y install zip gzip (--安装压缩工具) gzip命令 gzip命令用来压缩文件.gzip是个使用广 ...
- requests库入门05-参数类型
一个接口基本都需要传入参数,有的参数必填,有的不必填. params参数 使用params参数来传递接口所需要的参数.一般用在get请求中,url参数是通过?拼接,?前面是接口的地址,之后是请求的参数 ...
- selenium控制浏览器
1.要把浏览器设置为全屏,否则有些元素是操作失败的,如对下图进行操作按钮是失败的,因为按钮没有显示出来 2.设置浏览器的宽.高 3.控制前进.后退(不建议使用driver.black().driver ...
- 如何保障Web应用安全性
通过加密算法对关键数据进行加密 通过过滤器防御跨站脚本攻击XSS.跨域请求伪造CRSF和SQL注入 通过安全框架( Shiro.Spring Security )进行认证和授权 设置IP黑白名单来进行 ...
- Windows平台下,Java性能分析工具VisualVM的Tomcat8的配置
VisualVM在JDK6版本及以上已经自带这个应用. 位置:C:\Program Files (x86)\Java\jdk1.8.0_60\bin\jvisualvm.exe 在Windows环 ...
- lua 复制table
cocos2d-lua提供了复制方法clone(),源码如下: function clone(object) local lookup_table = {} local function _copy( ...