常见的爬虫分析库(4)-爬虫之PyQuery
PyQuery 是 Python 仿照 jQuery 的严格实现。语法与 jQuery 几乎完全相同。
官方文档:http://pyquery.readthedocs.io/
安装
|
1
|
pip install pyquery |
初始化
字符串初始化
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
html = '''<div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div>'''from pyquery import PyQuery as pqdoc = pq(html)print(doc('li')) |
URL初始化
|
1
2
3
|
from pyquery import PyQuery as pqdoc = pq(url='http://www.baidu.com')print(doc('head')) |
文件初始化
|
1
2
3
|
from pyquery import PyQuery as pqdoc = pq(filename='demo.html')print(doc('li')) |
基本CSS选择器
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
html = '''<div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div>'''from pyquery import PyQuery as pqdoc = pq(html)print(doc('#container .list li')) |
查找元素
子元素
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
html = '''<div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div>'''from pyquery import PyQuery as pqdoc = pq(html)items = doc('.list')print(type(items))print(items)lis = items.find('li')print(type(lis))print(lis) |
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
html = '''<div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div>'''from pyquery import PyQuery as pqdoc = pq(html)items = doc('.list')lis = items.children()print(type(lis))print(lis) |
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
html = '''<div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div>'''from pyquery import PyQuery as pqdoc = pq(html)items = doc('.list')lis = items.children('.active')print(lis) |
父元素
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
html = '''<div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div>'''from pyquery import PyQuery as pqdoc = pq(html)items = doc('.list')container = items.parent()print(type(container))print(container) |
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
html = '''<div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div>'''from pyquery import PyQuery as pqdoc = pq(html)items = doc('.list')parents = items.parents()print(type(parents))print(parents) |
|
1
2
|
parent = items.parents('.wrap')print(parent) |
兄弟元素
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
html = '''<div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div>'''from pyquery import PyQuery as pqdoc = pq(html)li = doc('.list .item-0.active')print(li.siblings()) |
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
html = '''<div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div>'''from pyquery import PyQuery as pqdoc = pq(html)li = doc('.list .item-0.active')print(li.siblings('.active')) |
遍历
单个元素
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
html = '''<div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div>'''from pyquery import PyQuery as pqdoc = pq(html)li = doc('.item-0.active')print(li) |
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
html = '''<div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div>'''from pyquery import PyQuery as pqdoc = pq(html)lis = doc('li').items()print(type(lis))for li in lis: print(li) |
获取信息
获取属性
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
html = '''<div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div>'''from pyquery import PyQuery as pqdoc = pq(html)a = doc('.item-0.active a')print(a)print(a.attr('href'))print(a.attr.href) |
获取文本
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
html = '''<div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div>'''from pyquery import PyQuery as pqdoc = pq(html)a = doc('.item-0.active a')print(a)print(a.text()) |
获取HTML
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
html = '''<div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div>'''from pyquery import PyQuery as pqdoc = pq(html)li = doc('.item-0.active')print(li)print(li.html()) |
DOM操作
addClass、removeClass
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
html = '''<div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div>'''from pyquery import PyQuery as pqdoc = pq(html)li = doc('.item-0.active')print(li)li.removeClass('active')print(li)li.addClass('active')print(li) |
attr、css
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
html = '''<div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div>'''from pyquery import PyQuery as pqdoc = pq(html)li = doc('.item-0.active')print(li)li.attr('name', 'link')print(li)li.css('font-size', '14px')print(li) |
remove
|
1
2
3
4
5
6
7
8
9
10
11
12
|
html = '''<div class="wrap"> Hello, World <p>This is a paragraph.</p> </div>'''from pyquery import PyQuery as pqdoc = pq(html)wrap = doc('.wrap')print(wrap.text())wrap.find('p').remove()print(wrap.text()) |
其他DOM方法 http://pyquery.readthedocs.io/en/latest/api.html
伪类选择器
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
html = '''<div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div>'''from pyquery import PyQuery as pqdoc = pq(html)li = doc('li:first-child')print(li)li = doc('li:last-child')print(li)li = doc('li:nth-child(2)')print(li)li = doc('li:gt(2)')print(li)li = doc('li:nth-child(2n)')print(li)li = doc('li:contains(second)')print(li) |
常见的爬虫分析库(4)-爬虫之PyQuery的更多相关文章
- 常见的爬虫分析库(3)-Python正则表达式与re模块
在线正则表达式测试 http://tool.oschina.net/regex/ 常见匹配模式 模式 描述 \w 匹配字母数字及下划线 \W 匹配非字母数字下划线 \s 匹配任意空白字符,等价于 [\ ...
- 常见的爬虫分析库(2)-xpath语法
xpath简介 1.xpath使用路径表达式在xml和html中进行导航 2.xpath包含标准函数库 3.xpath是一个w3c的标准 xpath节点关系 1.父节点 2.子节点 3.同胞节点 4. ...
- 常见的爬虫分析库(1)-Python3中Urllib库基本使用
原文来自:https://www.cnblogs.com/0bug/p/8893677.html 什么是Urllib? Python内置的HTTP请求库 urllib.request ...
- 爬虫requests库 之爬虫贴吧
首先要观察爬虫的URL规律,爬取一个贴吧所有页的数据,观察点击下一页时URL是如何变化的. 思路: 定义一个类,初始化方法什么都不用管 定义一个run方法,用来实现主要逻辑 3 class Tieba ...
- 2.爬虫 urlib库讲解 异常处理、URL解析、分析Robots协议
1.异常处理 URLError类来自urllib库的error模块,它继承自OSError类,是error异常模块的基类,由request模块产生的异常都可以通过这个类来处理. from urllib ...
- Python爬虫Urllib库的基本使用
Python爬虫Urllib库的基本使用 深入理解urllib.urllib2及requests 请访问: http://www.mamicode.com/info-detail-1224080.h ...
- Python爬虫—requests库get和post方法使用
目录 Python爬虫-requests库get和post方法使用 1. 安装requests库 2.requests.get()方法使用 3.requests.post()方法使用-构造formda ...
- [python爬虫]Requests-BeautifulSoup-Re库方案--Requests库介绍
[根据北京理工大学嵩天老师“Python网络爬虫与信息提取”慕课课程编写 文章中部分图片来自老师PPT 慕课链接:https://www.icourse163.org/learn/BIT-10018 ...
- Python爬虫与数据分析之爬虫技能:urlib库、xpath选择器、正则表达式
专栏目录: Python爬虫与数据分析之python教学视频.python源码分享,python Python爬虫与数据分析之基础教程:Python的语法.字典.元组.列表 Python爬虫与数据分析 ...
随机推荐
- [转载]tmux常用快捷键
Hello World 窗口管理只是 tmux 功能的一小部分,另一个很有用的功能就是,连接到远程主机之后,一旦断开,那么当前账户登录的任务就被取消了,但是使用 tmux 可以在断开之后继续工作,下次 ...
- android 控件设置透明度
问题:java文件中引用组件设置透明度:mGuideLayout.getBackground().setAlpha(125); 一直报null 修改办法:对应的布局文件中添加 android:back ...
- WPF C# int.TryParse的用法
; if (!int.TryParse(item.Tag.ToString(), out comld)) { continue; } 没转换成功就continue 开始写成 if(GetNumber( ...
- 设置LISTControl控件某一行的背景和文字颜色
定义宏 用listcontrol的SetItemData设置某一行的属性,通过自定义属性标识实现. 自定义某行内容颜色属性: #define COLOR_DEFAULT 0 //默认颜色 #defin ...
- C/C++经典面试题一
1.变量的声明和定义有什么区别? 常量:在程序执行过程中,不会发生改变的量,不能被改变的量 变量:在程序执行过程中,可以被改变的量 定义变量的方式:数据类型 变量名 = 常量: int num = 1 ...
- 004_wireshark专题
一.常用的wireshark搜索语法 (1) http.request.uri contains "admin/activities" #搜索URL包含"admin/ac ...
- C# 对话框使用整理
1.保存文件对话框 SaveFileDialog saveFile = new SaveFileDialog(); saveFile.Title = "save file"; sa ...
- MySQL--表操作(innodb表字段数据类型、约束条件)、sql_mode操作
一.创建表的完整语法 #[]内的可有可无,即创建表时字段名和类型是必须填写的,宽度与约束条件是可选择填写的.create table 表名(字段名1 类型[(宽度) 约束条件],字段名2 类型[(宽度 ...
- HTTP连接池
<context:property-placeholder location="classpath:conf/framework/httpclient.properties" ...
- npm dev run 报错
解决办法: npm run dev --port 8088 Error: listen EACCES 0.0.0.0:8080at Object.exports._errnoException (ut ...