爬虫之pyquery库
官方文档:https://pyquery.readthedocs.io/en/latest/
PyQuery是一个强大又灵活的网页解析库。如果你觉得正则写起来太麻烦、BeautifulSoup语法太难记,而你熟悉jQury的语法,那么PyQuery就是你的绝佳选择。
一、开始
字符串初始化:
from pyquery import PyQuery as pq
d = pq("<html>哈哈哈</html>") # 现在d就相当于jQuery的$
print(d("html"))
URL初始化:
from pyquery import PyQuery as pq
d = pq(url="https://www.baidu.com")
print(d("head"))
文件初始化:
from pyquery import PyQuery as pq
d = pq(filename='demo.html') # filename指定文件路径
print(d("head"))
二、基本CSS选择器
html = """
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
print(d("#container .list li"))
三、查找元素
子元素
d("css选择器").find("li")
html = """
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
items = d(".list")
print(type(items)) # <class 'pyquery.pyquery.PyQuery'>
li = items.find("li")
print(type(li)) # <class 'pyquery.pyquery.PyQuery'>
print(li)
"""
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
"""
父元素
d("css选择器").parent(<css选择器(可无)>)
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
items = d(".list")
parents = items.parents()
print(parents)
"""
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
"""
d(".list").parents()
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
items = d(".list")
parents = items.parents(".wrap")
print(parents)
"""
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
d(".list").parents(".wrap")
兄弟元素
d("css选择器").siblings(<css选择器(可无)>)
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
li = d(".list .item-0.active")
print(li.siblings())
"""
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0">first item</li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
"""
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
li = d(".list .item-0.active")
print(li.siblings(".active"))
"""
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
"""
四、遍历
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
li = d("li").items()
print(type(li)) # <class 'generator'>
for i in li:
print(i)
"""
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
"""
五、获取信息
获取属性
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
a = d(".item-0.active a")
print(a.attr("href"))
print(a.attr.href)
获取文本
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
a = d(".item-0.active a")
print(a.text())
"""
third item
"""
获取html
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
li = d(".item-0.active")
print(li)
print(li.html())
"""
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<a href="link3.html"><span class="bold">third item</span></a>
"""
六、DOM操作
addClass()、removeClass()
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
li = d(".item-0.active")
print(li)
li.removeClass("active")
print(li)
li.addClass("active")
print(li)
"""
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
"""
attr()、css()
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
li = d(".item-0.active")
print(li)
li.attr("name", "link")
print(li)
li.css("font-size", "14px")
print(li)
"""
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0 active" name="link"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0 active" name="link" style="font-size: 14px"><a href="link3.html"><span class="bold">third item</span></a></li>
"""
remove()
html = """
<div class="wrap">
Hello, World.
<p>This is a paragraph.</p>
</div>
""" from pyquery import PyQuery as pq
d = pq(html)
wrap = d(".wrap")
print(wrap.text())
"""
Hello, World.
This is a paragraph.
"""
wrap.find("p").remove()
print(wrap.text()) # Hello, World.
其他DOM方法
https://pyquery.readthedocs.io/en/latest/api.html
七、伪类选择器
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
li = d("li:first-child")
print(li) # <li class="item-0">first item</li>
li = d("li:last-child")
print(li) # <li class="item-0"><a href="link5.html">fifth item</a></li>
li = d("li:nth-child(2)")
print(li) # <li class="item-1"><a href="link2.html">second item</a></li>
li = d("li:gt(2)") # 从0开始计数,索引大于2
print(li)
"""
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
"""
li = d("li:nth-child(2n)") # 获取偶数顺序的元素(从1开始)
print(li)
"""
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
"""
li = d("li:contains(second)") # 根据文本匹配,匹配文本包含second的标签
print(li) # <li class="item-1"><a href="link2.html">second item</a></li>
更多选择器:http://www.w3school.com.cn/cssref/css_selectors.asp
爬虫之pyquery库的更多相关文章
- Python爬虫之pyquery库的基本使用
# 字符串初始化 html = ''' <div> <ul> <li class = "item-0">first item</li> ...
- python爬虫从入门到放弃(七)之 PyQuery库的使用
PyQuery库也是一个非常强大又灵活的网页解析库,如果你有前端开发经验的,都应该接触过jQuery,那么PyQuery就是你非常绝佳的选择,PyQuery 是 Python 仿照 jQuery 的严 ...
- 爬虫常用库之pyquery 库
pyquery库是jQuery的Python实现,可以用于解析HTML网页内容,我个人写过的一些抓取网页数据的脚本就是用它来解析html获取数据的.他的官方文档地址是:http://packages. ...
- Python爬虫-- PyQuery库
PyQuery库 PyQuery库也是一个非常强大又灵活的网页解析库,PyQuery 是 Python 仿照 jQuery 的严格实现.语法与 jQuery 几乎完全相同,所以不用再去费心去记一些奇怪 ...
- PYTHON 爬虫笔记六:PyQuery库基础用法
知识点一:PyQuery库详解及其基本使用 初始化 字符串初始化 html = ''' <div> <ul> <li class="item-0"&g ...
- 第四节:Web爬虫之pyquery解析库
PyQuery库也是一个非常强大又灵活的网页解析库,如果你有前端开发经验的,都应该接触过jQuery,那么PyQuery就是你非常绝佳的选择,PyQuery 是 Python 仿照 jQuery 的严 ...
- python之爬虫(九)PyQuery库的使用
PyQuery库也是一个非常强大又灵活的网页解析库,如果你有前端开发经验的,都应该接触过jQuery,那么PyQuery就是你非常绝佳的选择,PyQuery 是 Python 仿照 jQuery 的严 ...
- Python3 网络爬虫(请求库的安装)
Python3 网络爬虫(请求库的安装) 爬虫可以简单分为几步:抓取页面,分析页面和存储数据 在页面爬取的过程中我们需要模拟浏览器向服务器发送请求,所以需要用到一些python库来实现HTTP的请求操 ...
- 爬虫之PyQuery的base了解
爬虫之PyQuery的base了解 pyquery库是jQuery的Python实现,能够以jQuery的语法来操作解析 HTML 文档,易用性和解析速度都很好,和它差不多的还有BeautifulSo ...
随机推荐
- [ZJOI 2007] 矩阵游戏
[题目链接] https://www.lydsy.com/JudgeOnline/problem.php?id=1059 [算法] 二分图最大匹配 时间复杂度 : O(N^3) [代码] #inclu ...
- 洛谷 P1328 生活大爆炸版石头剪刀布 —— 模拟
题目:https://www.luogu.org/problemnew/show/P1328 直接模拟即可. 代码如下: #include<iostream> #include<cs ...
- MIPI接口
接口 分辨率 说明 RGB 800*480以下 大部分AP均支持RGB接口,此类LCD在低端平板广泛使用 LVDS 1024*768及以上 主要通过转换芯片将RGB等专程LVDS来支持:少量AP直接集 ...
- 14. extjs中treepanel属性和方法
转自:http://www.cnblogs.com/connortang/p/4414907.html 1.Ext.tree.TreePanel 主要配置项: root:树的根节点. rootVisi ...
- [App Store Connect帮助]四、添加 App 图标、App 预览和屏幕快照(3)上传 App 预览和屏幕快照
请上传至多三个 App 预览和至多十张屏幕快照.如果您的 App 在不同设备尺寸和本地化内容间都相同,仅提供所要求的最高分辨率的屏幕快照即可. 对于 iPhone,必须提供用于 5.5 英寸设备(iP ...
- (斯特林公式)51NOD 1058 N的阶乘的长度
输入N求N的阶乘的10进制表示的长度.例如6! = 720,长度为3. Input 输入N(1 <= N <= 10^6) Output 输出N的阶乘的长度 Input示例 6 Out ...
- python中socket编程
一.网络协议 客户端/服务器架构 1.硬件C/S架构(打印机) 2.软件C/S架构(互联网中处处是C/S架构):B/S架构也是C/S架构的一种,B/S是浏览器/服务器 C/S架构与socket的关系: ...
- Linux学习笔记之Linux常用命令剖析-cat/chmod/cd
1.cat:用于连接文件并打印到标准输出设备上.(使用权限:所有使用者) 语法格式:cat [-AbeEnstTuv] [--help] [--version] fileName 参数说明: -n 或 ...
- Linux 常规操作指南
1.修改Linux服务器别名 临时修改: vim /etc/hostname 修改别名 永久修改: vim /etc/sysconfig/network 添加 HOSTNAME=别名 重启服务器 ...
- C# 的占位符
static void Main(string[] args) { Console.WriteLine("A:{0},a:{1}",65,97); Console.ReadLine ...