常见的爬虫分析库(4)-爬虫之PyQuery
PyQuery 是 Python 仿照 jQuery 的严格实现。语法与 jQuery 几乎完全相同。
官方文档:http://pyquery.readthedocs.io/
安装
|
1
|
pip install pyquery |
初始化
字符串初始化
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
html = '''<div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div>'''from pyquery import PyQuery as pqdoc = pq(html)print(doc('li')) |
URL初始化
|
1
2
3
|
from pyquery import PyQuery as pqdoc = pq(url='http://www.baidu.com')print(doc('head')) |
文件初始化
|
1
2
3
|
from pyquery import PyQuery as pqdoc = pq(filename='demo.html')print(doc('li')) |
基本CSS选择器
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
html = '''<div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div>'''from pyquery import PyQuery as pqdoc = pq(html)print(doc('#container .list li')) |
查找元素
子元素
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
html = '''<div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div>'''from pyquery import PyQuery as pqdoc = pq(html)items = doc('.list')print(type(items))print(items)lis = items.find('li')print(type(lis))print(lis) |
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
html = '''<div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div>'''from pyquery import PyQuery as pqdoc = pq(html)items = doc('.list')lis = items.children()print(type(lis))print(lis) |
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
html = '''<div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div>'''from pyquery import PyQuery as pqdoc = pq(html)items = doc('.list')lis = items.children('.active')print(lis) |
父元素
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
html = '''<div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div>'''from pyquery import PyQuery as pqdoc = pq(html)items = doc('.list')container = items.parent()print(type(container))print(container) |
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
html = '''<div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div>'''from pyquery import PyQuery as pqdoc = pq(html)items = doc('.list')parents = items.parents()print(type(parents))print(parents) |
|
1
2
|
parent = items.parents('.wrap')print(parent) |
兄弟元素
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
html = '''<div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div>'''from pyquery import PyQuery as pqdoc = pq(html)li = doc('.list .item-0.active')print(li.siblings()) |
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
html = '''<div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div>'''from pyquery import PyQuery as pqdoc = pq(html)li = doc('.list .item-0.active')print(li.siblings('.active')) |
遍历
单个元素
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
html = '''<div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div>'''from pyquery import PyQuery as pqdoc = pq(html)li = doc('.item-0.active')print(li) |
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
html = '''<div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div>'''from pyquery import PyQuery as pqdoc = pq(html)lis = doc('li').items()print(type(lis))for li in lis: print(li) |
获取信息
获取属性
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
html = '''<div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div>'''from pyquery import PyQuery as pqdoc = pq(html)a = doc('.item-0.active a')print(a)print(a.attr('href'))print(a.attr.href) |
获取文本
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
html = '''<div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div>'''from pyquery import PyQuery as pqdoc = pq(html)a = doc('.item-0.active a')print(a)print(a.text()) |
获取HTML
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
html = '''<div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div>'''from pyquery import PyQuery as pqdoc = pq(html)li = doc('.item-0.active')print(li)print(li.html()) |
DOM操作
addClass、removeClass
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
html = '''<div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div>'''from pyquery import PyQuery as pqdoc = pq(html)li = doc('.item-0.active')print(li)li.removeClass('active')print(li)li.addClass('active')print(li) |
attr、css
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
html = '''<div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div>'''from pyquery import PyQuery as pqdoc = pq(html)li = doc('.item-0.active')print(li)li.attr('name', 'link')print(li)li.css('font-size', '14px')print(li) |
remove
|
1
2
3
4
5
6
7
8
9
10
11
12
|
html = '''<div class="wrap"> Hello, World <p>This is a paragraph.</p> </div>'''from pyquery import PyQuery as pqdoc = pq(html)wrap = doc('.wrap')print(wrap.text())wrap.find('p').remove()print(wrap.text()) |
其他DOM方法 http://pyquery.readthedocs.io/en/latest/api.html
伪类选择器
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
html = '''<div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div>'''from pyquery import PyQuery as pqdoc = pq(html)li = doc('li:first-child')print(li)li = doc('li:last-child')print(li)li = doc('li:nth-child(2)')print(li)li = doc('li:gt(2)')print(li)li = doc('li:nth-child(2n)')print(li)li = doc('li:contains(second)')print(li) |
常见的爬虫分析库(4)-爬虫之PyQuery的更多相关文章
- 常见的爬虫分析库(3)-Python正则表达式与re模块
在线正则表达式测试 http://tool.oschina.net/regex/ 常见匹配模式 模式 描述 \w 匹配字母数字及下划线 \W 匹配非字母数字下划线 \s 匹配任意空白字符,等价于 [\ ...
- 常见的爬虫分析库(2)-xpath语法
xpath简介 1.xpath使用路径表达式在xml和html中进行导航 2.xpath包含标准函数库 3.xpath是一个w3c的标准 xpath节点关系 1.父节点 2.子节点 3.同胞节点 4. ...
- 常见的爬虫分析库(1)-Python3中Urllib库基本使用
原文来自:https://www.cnblogs.com/0bug/p/8893677.html 什么是Urllib? Python内置的HTTP请求库 urllib.request ...
- 爬虫requests库 之爬虫贴吧
首先要观察爬虫的URL规律,爬取一个贴吧所有页的数据,观察点击下一页时URL是如何变化的. 思路: 定义一个类,初始化方法什么都不用管 定义一个run方法,用来实现主要逻辑 3 class Tieba ...
- 2.爬虫 urlib库讲解 异常处理、URL解析、分析Robots协议
1.异常处理 URLError类来自urllib库的error模块,它继承自OSError类,是error异常模块的基类,由request模块产生的异常都可以通过这个类来处理. from urllib ...
- Python爬虫Urllib库的基本使用
Python爬虫Urllib库的基本使用 深入理解urllib.urllib2及requests 请访问: http://www.mamicode.com/info-detail-1224080.h ...
- Python爬虫—requests库get和post方法使用
目录 Python爬虫-requests库get和post方法使用 1. 安装requests库 2.requests.get()方法使用 3.requests.post()方法使用-构造formda ...
- [python爬虫]Requests-BeautifulSoup-Re库方案--Requests库介绍
[根据北京理工大学嵩天老师“Python网络爬虫与信息提取”慕课课程编写 文章中部分图片来自老师PPT 慕课链接:https://www.icourse163.org/learn/BIT-10018 ...
- Python爬虫与数据分析之爬虫技能:urlib库、xpath选择器、正则表达式
专栏目录: Python爬虫与数据分析之python教学视频.python源码分享,python Python爬虫与数据分析之基础教程:Python的语法.字典.元组.列表 Python爬虫与数据分析 ...
随机推荐
- python,获取用户输入,并且将输入的内容写到.txt
该功能缺点是必须保证该文件不存在的情况才会成功 f=open('E:/mywork/保存文件.txt','x') def userwrite(code): if code=='w': f.close( ...
- Spring Boot:如何配置静态资源的地址与访问路径
spring.resources.static-locations=classpath:/static,classpath:/public,classpath:/resources,classpath ...
- Kivy折腾笔记
最近想用Python开发APP,选择kivy,记录过程 首先是源码安装,各种蛋疼的报错放弃了.cython高版本有问题. python3 -m pip install cython==0.23 pyt ...
- FAT文件系统规范v1.03学习笔记---1.保留区之 Fat32 FSInfo扇区结构和备份启动扇区
1.前言 本文主要是对Microsoft Extensible Firmware Initiative FAT32 File System Specification中文翻译版的学习笔记. 每个FAT ...
- webstorm设置VCS:版本控制顶部按钮
说明: 每次都在这坑一下,浪费时间,百度只指出在哪,并没有说怎么调出来 我用的版本是10,点击下面的选项按操作设置就可以了 红色箭头:从服务器获取最新代码: 绿色箭头:提交: 白色箭头:撤销
- python ctypes
official tutorial for ctypes libhttps://docs.python.org/3/library/ctypes.html 1 ctypes exports the c ...
- ASP.NET如何下载大文件
关于此代码的几点说明: 1. 将数据分成较小的部分,然后将其移动到输出流以供下载,从而获取这些数据. 2. 根据下载的文件类型来指定 Response.ContentType .(参考OSChina的 ...
- Git操作----删除untracked files
# 删除 untracked files git clean -f # 连 untracked 的目录也一起删掉 git clean -fd # 连 gitignore 的untrack 文件/目录也 ...
- java结合testng,利用excel做数据源的数据驱动实例
数据驱动部分,是自动化测试常用部分,也是参数化设计的重要环节,前面分享了,mysql.yaml做数据源,那么再来分享下excel做数据驱动 思路: 先用POI读取excel.解析读取数据,返回list ...
- JS知识点随笔
1.为什么 0.1 + 0.2 != 0.3? 原因: 因为 JS 采用 IEEE 754 双精度版本(64位),并且只要采用 IEEE 754 的语言都有该问题. 我们都知道计算机是通过二进制来存储 ...