官方文档:https://pyquery.readthedocs.io/en/latest/

PyQuery是一个强大又灵活的网页解析库。如果你觉得正则写起来太麻烦、BeautifulSoup语法太难记,而你熟悉jQury的语法,那么PyQuery就是你的绝佳选择。

一、开始

字符串初始化:

from pyquery import PyQuery as pq
d = pq("<html>哈哈哈</html>") # 现在d就相当于jQuery的$
print(d("html"))

URL初始化:

from pyquery import PyQuery as pq
d = pq(url="https://www.baidu.com")
print(d("head"))

文件初始化:

from pyquery import PyQuery as pq
d = pq(filename='demo.html') # filename指定文件路径
print(d("head"))

二、基本CSS选择器

html = """
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
print(d("#container .list li"))

三、查找元素

子元素

d("css选择器").find("li")
html = """
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
items = d(".list")
print(type(items)) # <class 'pyquery.pyquery.PyQuery'>
li = items.find("li")
print(type(li)) # <class 'pyquery.pyquery.PyQuery'>
print(li)
"""
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
"""

父元素

d("css选择器").parent(<css选择器(可无)>)
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
items = d(".list")
parents = items.parents()
print(parents)
"""
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
"""

d(".list").parents()

html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
items = d(".list")
parents = items.parents(".wrap")
print(parents)
"""
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""

d(".list").parents(".wrap")

兄弟元素

d("css选择器").siblings(<css选择器(可无)>)
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
li = d(".list .item-0.active")
print(li.siblings())
"""
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0">first item</li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
"""
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
li = d(".list .item-0.active")
print(li.siblings(".active"))
"""
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
"""

四、遍历

html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
li = d("li").items()
print(type(li)) # <class 'generator'>
for i in li:
print(i)
"""
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
"""

五、获取信息

获取属性

html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
a = d(".item-0.active a")
print(a.attr("href"))
print(a.attr.href)

获取文本

html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
a = d(".item-0.active a")
print(a.text())
"""
third item
"""

获取html

html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
li = d(".item-0.active")
print(li)
print(li.html())
"""
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<a href="link3.html"><span class="bold">third item</span></a>
"""

六、DOM操作

addClass()、removeClass()

html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
li = d(".item-0.active")
print(li)
li.removeClass("active")
print(li)
li.addClass("active")
print(li)
"""
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
"""

attr()、css()

html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
li = d(".item-0.active")
print(li)
li.attr("name", "link")
print(li)
li.css("font-size", "14px")
print(li)
"""
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0 active" name="link"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0 active" name="link" style="font-size: 14px"><a href="link3.html"><span class="bold">third item</span></a></li>
"""

remove()

html = """
<div class="wrap">
Hello, World.
<p>This is a paragraph.</p>
</div>
""" from pyquery import PyQuery as pq
d = pq(html)
wrap = d(".wrap")
print(wrap.text())
"""
Hello, World.
This is a paragraph.
"""
wrap.find("p").remove()
print(wrap.text()) # Hello, World.

其他DOM方法

https://pyquery.readthedocs.io/en/latest/api.html

七、伪类选择器

html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
li = d("li:first-child")
print(li) # <li class="item-0">first item</li>
li = d("li:last-child")
print(li) # <li class="item-0"><a href="link5.html">fifth item</a></li>
li = d("li:nth-child(2)")
print(li) # <li class="item-1"><a href="link2.html">second item</a></li>
li = d("li:gt(2)") # 从0开始计数,索引大于2
print(li)
"""
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
"""
li = d("li:nth-child(2n)") # 获取偶数顺序的元素(从1开始)
print(li)
"""
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
"""
li = d("li:contains(second)") # 根据文本匹配,匹配文本包含second的标签
print(li) # <li class="item-1"><a href="link2.html">second item</a></li>

更多选择器:http://www.w3school.com.cn/cssref/css_selectors.asp

爬虫之pyquery库的更多相关文章

  1. Python爬虫之pyquery库的基本使用

    # 字符串初始化 html = ''' <div> <ul> <li class = "item-0">first item</li> ...

  2. python爬虫从入门到放弃(七)之 PyQuery库的使用

    PyQuery库也是一个非常强大又灵活的网页解析库,如果你有前端开发经验的,都应该接触过jQuery,那么PyQuery就是你非常绝佳的选择,PyQuery 是 Python 仿照 jQuery 的严 ...

  3. 爬虫常用库之pyquery 库

    pyquery库是jQuery的Python实现,可以用于解析HTML网页内容,我个人写过的一些抓取网页数据的脚本就是用它来解析html获取数据的.他的官方文档地址是:http://packages. ...

  4. Python爬虫-- PyQuery库

    PyQuery库 PyQuery库也是一个非常强大又灵活的网页解析库,PyQuery 是 Python 仿照 jQuery 的严格实现.语法与 jQuery 几乎完全相同,所以不用再去费心去记一些奇怪 ...

  5. PYTHON 爬虫笔记六:PyQuery库基础用法

    知识点一:PyQuery库详解及其基本使用 初始化 字符串初始化 html = ''' <div> <ul> <li class="item-0"&g ...

  6. 第四节:Web爬虫之pyquery解析库

    PyQuery库也是一个非常强大又灵活的网页解析库,如果你有前端开发经验的,都应该接触过jQuery,那么PyQuery就是你非常绝佳的选择,PyQuery 是 Python 仿照 jQuery 的严 ...

  7. python之爬虫(九)PyQuery库的使用

    PyQuery库也是一个非常强大又灵活的网页解析库,如果你有前端开发经验的,都应该接触过jQuery,那么PyQuery就是你非常绝佳的选择,PyQuery 是 Python 仿照 jQuery 的严 ...

  8. Python3 网络爬虫(请求库的安装)

    Python3 网络爬虫(请求库的安装) 爬虫可以简单分为几步:抓取页面,分析页面和存储数据 在页面爬取的过程中我们需要模拟浏览器向服务器发送请求,所以需要用到一些python库来实现HTTP的请求操 ...

  9. 爬虫之PyQuery的base了解

    爬虫之PyQuery的base了解 pyquery库是jQuery的Python实现,能够以jQuery的语法来操作解析 HTML 文档,易用性和解析速度都很好,和它差不多的还有BeautifulSo ...

随机推荐

  1. [ZJOI 2007] 矩阵游戏

    [题目链接] https://www.lydsy.com/JudgeOnline/problem.php?id=1059 [算法] 二分图最大匹配 时间复杂度 : O(N^3) [代码] #inclu ...

  2. 洛谷 P1328 生活大爆炸版石头剪刀布 —— 模拟

    题目:https://www.luogu.org/problemnew/show/P1328 直接模拟即可. 代码如下: #include<iostream> #include<cs ...

  3. MIPI接口

    接口 分辨率 说明 RGB 800*480以下 大部分AP均支持RGB接口,此类LCD在低端平板广泛使用 LVDS 1024*768及以上 主要通过转换芯片将RGB等专程LVDS来支持:少量AP直接集 ...

  4. 14. extjs中treepanel属性和方法

    转自:http://www.cnblogs.com/connortang/p/4414907.html 1.Ext.tree.TreePanel 主要配置项: root:树的根节点. rootVisi ...

  5. [App Store Connect帮助]四、添加 App 图标、App 预览和屏幕快照(3)上传 App 预览和屏幕快照

    请上传至多三个 App 预览和至多十张屏幕快照.如果您的 App 在不同设备尺寸和本地化内容间都相同,仅提供所要求的最高分辨率的屏幕快照即可. 对于 iPhone,必须提供用于 5.5 英寸设备(iP ...

  6. (斯特林公式)51NOD 1058 N的阶乘的长度

    输入N求N的阶乘的10进制表示的长度.例如6! = 720,长度为3.   Input 输入N(1 <= N <= 10^6) Output 输出N的阶乘的长度 Input示例 6 Out ...

  7. python中socket编程

    一.网络协议 客户端/服务器架构 1.硬件C/S架构(打印机) 2.软件C/S架构(互联网中处处是C/S架构):B/S架构也是C/S架构的一种,B/S是浏览器/服务器 C/S架构与socket的关系: ...

  8. Linux学习笔记之Linux常用命令剖析-cat/chmod/cd

    1.cat:用于连接文件并打印到标准输出设备上.(使用权限:所有使用者) 语法格式:cat [-AbeEnstTuv] [--help] [--version] fileName 参数说明: -n 或 ...

  9. Linux 常规操作指南

    1.修改Linux服务器别名 临时修改: vim /etc/hostname  修改别名 永久修改: vim  /etc/sysconfig/network  添加 HOSTNAME=别名 重启服务器 ...

  10. C# 的占位符

    static void Main(string[] args) { Console.WriteLine("A:{0},a:{1}",65,97); Console.ReadLine ...