什么是PyQuery?

PyQuery

初始化

字符串初始化

from pyquery import PyQuery as pq

html="""
<div>
<ul>
<li class="item-0">first item</li>
<li class="item-1">
<a href="link2.html">second item</a>
</li>
<li class="item-0 active">
<a href="link3.html>
<span class="bold">third item</span>
</a>
</li>
<li class="item-1 active">
<a href="link4.html">f
ourth item
</a>
</li>
<li class="item-0">
<a href="link5.html">
fifth item
</a>
</li>
</ul>
</div>
""" doc=pq(html)
print(doc("li"))
<li class="item-0">first item</li>
<li class="item-1">
<a href="link2.html">second item</a>
</li>
<li class="item-0 active">
<a href="link3.html&gt; &lt;span class=" bold="">third item
</a>
</li>
<li class="item-1 active">
<a href="link4.html">f
ourth item
</a>
</li>
<li class="item-0">
<a href="link5.html">
fifth item
</a>
</li>

打印后的结果为:

URL初始化

from pyquery import PyQuery as pq

doc = pq(url="http://www.baidu.com")
print(doc("head"))
<head><meta http-equiv="content-type" content="text/html;charset=utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=Edge"/><meta content="always" name="referrer"/><link rel="stylesheet" type="text/css" href="http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css"/><title>百度一下,你就知道</title></head> 

打印后的结果为:

文件初始化

from pyquery import PyQuery as pq

doc = pq(filename="demo.html")
print(doc("li"))

基本CSS选择器

from pyquery import PyQuery as pq

html= """
<div id = "container">
<ul>
<li class="item-0">
first item
</li>
<li class="item-1">
<a href="link2.html">second item</a>
</li>
<li class="item-2 active">
<a href="link3.html>
<span class="bold">third item</span>
</a>
</li>
<li class="item-3 active">
<a href="link4.html">f
ourth item
</a>
</li>
<li class="item-4">
<a href="link5.html">
fifth item
</a>
</li>
</ul>
</div>
"""
doc = pq(html)
print(doc("#container .item-0"))
<li class="item-0">
first item
</li>

打印的结果为:

查找元素

子元素

from pyquery import PyQuery as pq

html= """
<div id = "container">
<ul class="list>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-2 active"><a href="link3.html><span class="bold">third item</span></a></li>
<li class="item-3 active"><a href="link4.html">fourth item</a></li>
<li class="item-4"><a href="link5.html">fifth item</a></li>
</ul>
</div>
"""
doc = pq(html)
items = doc(".list")
print(type(items))
docs = doc.find("li")
print(type(docs))
print(docs)
<class 'pyquery.pyquery.PyQuery'>
<class 'pyquery.pyquery.PyQuery'>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-2 active"><a href="link3.html&gt;&lt;span class=" bold="">third item</a></li>
<li class="item-3 active"><a href="link4.html">fourth item</a></li>
<li class="item-4"><a href="link5.html">fifth item</a></li>

打印后的结果为:

from pyquery import PyQuery as pq

html= """
<div id = "container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-2 active"><a href="link3.html><span class="bold">third item</span></a></li>
<li class="item-3 active"><a href="link4.html">fourth item</a></li>
<li class="item-4"><a href="link5.html">fifth item</a></li>
</ul>
</div>
"""
doc = pq(html)
items = doc(".list")
docs = items.children() # 查找所有的直接子元素
docs1 = items.children(".active")
print(docs)
print(docs1)
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-2 active"><a href="link3.html&gt;&lt;span class=" bold="">third item</a></li>
<li class="item-3 active"><a href="link4.html">fourth item</a></li>
<li class="item-4"><a href="link5.html">fifth item</a></li>
<li class="item-2 active"><a href="link3.html&gt;&lt;span class=" bold="">third item</a></li>
<li class="item-3 active"><a href="link4.html">fourth item</a></li>

打印后的结果为:

父元素

from pyquery import PyQuery as pq

html= """
<div id = "container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-2 active"><a href="link3.html><span class="bold">third item</span></a></li>
<li class="item-3 active"><a href="link4.html">fourth item</a></li>
<li class="item-4"><a href="link5.html">fifth item</a></li>
</ul>
</div>
"""
doc = pq(html)
items = doc(".list")
docs = items.parent() # 查找所有的直接父元素
print(docs)
print(type(docs))
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-2 active"><a href="link3.html&gt;&lt;span class=" bold="">third item</a></li>
<li class="item-3 active"><a href="link4.html">fourth item</a></li>
<li class="item-4"><a href="link5.html">fifth item</a></li>
</ul>
</div> <class 'pyquery.pyquery.PyQuery'>

打印后的结果:

from pyquery import PyQuery as pq

html= """
<div class = "wrap">
<div id = "container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-2 active"><a href="link3.html><span class="bold">third item</span></a></li>
<li class="item-3 active"><a href="link4.html">fourth item</a></li>
<li class="item-4"><a href="link5.html">fifth item</a></li>
</ul>
</div>
<div>
""" doc = pq(html)
items = doc(".list")
docs = items.parents() # 查找所有的直接子元素
print(type(docs))
print(docs)
<class 'pyquery.pyquery.PyQuery'>
<html><body><div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-2 active"><a href="link3.html&gt;&lt;span class=" bold="">third item</a></li>
<li class="item-3 active"><a href="link4.html">fourth item</a></li>
<li class="item-4"><a href="link5.html">fifth item</a></li>
</ul>
</div>
<div>
</div></div></body></html><body><div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-2 active"><a href="link3.html&gt;&lt;span class=" bold="">third item</a></li>
<li class="item-3 active"><a href="link4.html">fourth item</a></li>
<li class="item-4"><a href="link5.html">fifth item</a></li>
</ul>
</div>
<div>
</div></div></body><div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-2 active"><a href="link3.html&gt;&lt;span class=" bold="">third item</a></li>
<li class="item-3 active"><a href="link4.html">fourth item</a></li>
<li class="item-4"><a href="link5.html">fifth item</a></li>
</ul>
</div>
<div>
</div></div><div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-2 active"><a href="link3.html&gt;&lt;span class=" bold="">third item</a></li>
<li class="item-3 active"><a href="link4.html">fourth item</a></li>
<li class="item-4"><a href="link5.html">fifth item</a></li>
</ul>
</div>

打印后的结果为:

兄弟元素

from pyquery import PyQuery as pq

html= """
<div class = "wrap">
<div id = "container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-2 active"><a href="link3.html><span class="bold">third item</span></a></li>
<li class="item-3 active"><a href="link4.html">fourth item</a></li>
<li class="item-4"><a href="link5.html">fifth item</a></li>
</ul>
</div>
<div>
""" doc = pq(html)
items = doc(".item-0")
print(type(items.siblings()))
print(items.siblings())
<class 'pyquery.pyquery.PyQuery'>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-2 active"><a href="link3.html&gt;&lt;span class=" bold="">third item</a></li>
<li class="item-3 active"><a href="link4.html">fourth item</a></li>
<li class="item-4"><a href="link5.html">fifth item</a></li>

打印后的结果为:

from pyquery import PyQuery as pq

html= """
<div class = "wrap">
<div id = "container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-2 active"><a href="link3.html><span class="bold">third item</span></a></li>
<li class="item-3 active"><a href="link4.html">fourth item</a></li>
<li class="item-4"><a href="link5.html">fifth item</a></li>
</ul>
</div>
<div>
""" doc = pq(html)
items = doc(".item-0active") # 同时匹配.item-0和active,若都包含item-0 + active则打印,否则不打印
item = doc(".item-0")
print(item.siblings())
print(type(items.siblings()))
print(items.siblings())
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-2 active"><a href="link3.html&gt;&lt;span class=" bold="">third item</a></li>
<li class="item-3 active"><a href="link4.html">fourth item</a></li>
<li class="item-4"><a href="link5.html">fifth item</a></li> <class 'pyquery.pyquery.PyQuery'>

打印后的结果为:

遍历

单个元素

from pyquery import PyQuery as pq

html= """
<div class = "wrap">
<div id = "container">
<ul class="list">
<li class="item-0 active">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-2 active"><a href="link3.html><span class="bold">third item</span></a></li>
<li class="item-3 active"><a href="link4.html">fourth item</a></li>
<li class="item-4"><a href="link5.html">fifth item</a></li>
</ul>
</div>
<div>
""" doc = pq(html)
items = doc(".item-0.active") # 同时匹配.item-0和active,若都包含item-0 + active则打印,否则不打印
print(type(items))
print(items)
<class 'pyquery.pyquery.PyQuery'>
<li class="item-0 active">first item</li>

打印后的结果为:

from pyquery import PyQuery as pq

html= """
<div class = "wrap">
<div id = "container">
<ul class="list">
<li class="item-0 active">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-2 active"><a href="link3.html><span class="bold">third item</span></a></li>
<li class="item-3 active"><a href="link4.html">fourth item</a></li>
<li class="item-4"><a href="link5.html">fifth item</a></li>
</ul>
</div>
<div>
""" doc = pq(html)
items = doc("li").items() # 遍历items ,此时的items是一个迭代器
print(type(items))
for li in items:
print(li)
<class 'generator'>
<li class="item-0 active">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-2 active"><a href="link3.html&gt;&lt;span class=" bold="">third item</a></li>
<li class="item-3 active"><a href="link4.html">fourth item</a></li>
<li class="item-4"><a href="link5.html">fifth item</a></li>

打印后的结果为:

获取信息

获取属性

from pyquery import PyQuery as pq

html= """
<div class = "wrap">
<div id = "container">
<ul class="list">
<li class="item-0 active">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-2 active"><a href="link3.html><span class="bold">third item</span></a></li>
<li class="item-3 active"><a href="link4.html">fourth item</a></li>
<li class="item-4"><a href="link5.html">fifth item</a></li>
</ul>
</div>
<div>
""" doc = pq(html)
items = doc(".item-1 a")
print(items.attr("href"))
print(items.attr.href)
link2.html
link2.html

打印后的结果为:

获取文本

from pyquery import PyQuery as pq

html= """
<div class = "wrap">
<div id = "container">
<ul class="list">
<li class="item-0 active">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-2 active"><a href="link3.html><span class="bold">third item</span></a></li>
<li class="item-3 active"><a href="link4.html">fourth item</a></li>
<li class="item-4"><a href="link5.html">fifth item</a></li>
</ul>
</div>
<div>
""" doc = pq(html)
items = doc(".item-1")
print(items.text())
second item

打印后的结果为:

获取html

from pyquery import PyQuery as pq

html= """
<div class = "wrap">
<div id = "container">
<ul class="list">
<li class="item-0 active">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-2 active"><a href="link3.html><span class="bold">third item</span></a></li>
<li class="item-3 active"><a href="link4.html">fourth item</a></li>
<li class="item-4"><a href="link5.html">fifth item</a></li>
</ul>
</div>
<div>
""" doc = pq(html)
items = doc(".item-1")
print(items)
print(items.html())
<li class="item-1"><a href="link2.html">second item</a></li>

<a href="link2.html">second item</a>

打印后的结果为:

DOM操作

addClass , removeClass

from pyquery import PyQuery as pq

html= """
<div class = "wrap">
<div id = "container">
<ul class="list">
<li class="item-0 active">first item</li>
<li class="item-1 active"><a href="link2.html">second item</a></li>
<li class="item-2 active"><a href="link3.html><span class="bold">third item</span></a></li>
<li class="item-3 active"><a href="link4.html">fourth item</a></li>
<li class="item-4"><a href="link5.html">fifth item</a></li>
</ul>
</div>
<div>
""" doc = pq(html)
items = doc(".item-1")
print(items.remove_class("active"))
print(items.add_class("actives"))
<li class="item-1"><a href="link2.html">second item</a></li>

<li class="item-1 actives"><a href="link2.html">second item</a></li>

打印后的结果为:

attr , css

from pyquery import PyQuery as pq

html= """
<div class = "wrap">
<div id = "container">
<ul class="list">
<li class="item-0 active">first item</li>
<li class="item-1 active"><a href="link2.html">second item</a></li>
<li class="item-2 active"><a href="link3.html><span class="bold">third item</span></a></li>
<li class="item-3 active"><a href="link4.html">fourth item</a></li>
<li class="item-4"><a href="link5.html">fifth item</a></li>
</ul>
</div>
<div>
""" doc = pq(html)
items = doc(".item-1")
print(items.attr("name","names"))
print(items.css("font-size","14px"))
<li class="item-1 active" name="names"><a href="link2.html">second item</a></li>

<li class="item-1 active" name="names" style="font: 14px"><a href="link2.html">second item</a></li>

打印后的结果为:

remove

from pyquery import PyQuery as pq

html= """
<div class = "wrap">
<div id = "container">
<ul class="list">
<li class="item-0 active">first item</li>
<li class="item-1 active"><a href="link2.html">second item<p>Third times</p></a></li>
<li class="item-2 active"><a href="link3.html><span class="bold">third item</span></a></li>
<li class="item-3 active"><a href="link4.html">fourth item</a></li>
<li class="item-4"><a href="link5.html">fifth item</a></li>
</ul>
</div>
<div>
""" doc = pq(html)
items = doc(".item-1")
print(items.text())
print("---------------")
items.find('p').remove()
print(items.text())
second item
Third times
---------------
second item

打印后的结果为:

其他DOM方法

https://pythonhosted.org/pyquery/api.html

伪类选择器

from pyquery import PyQuery as pq

html= """
<div class = "wrap">
<div id = "container">
<ul class="list">
<li class="item-0 active">first item</li>
<li class="item-1 active"><a href="link2.html">second item<p>Third times</p></a></li>
<li class="item-2 active"><a href="link3.html><span class="bold">third item</span></a></li>
<li class="item-3 active"><a href="link4.html">fourth item</a></li>
<li class="item-4"><a href="link5.html">fifth item</a></li>
</ul>
</div>
<div>
""" doc = pq(html)
li = doc("li:first-child")
print(li)
print("------------------------------------------------------------")
li = doc("li:last-child")
print(li)
print("------------------------------------------------------------")
li = doc("li:gt(2)")
print(li)
print("------------------------------------------------------------")
li = doc("li:nth-child(2)") # 指定一个索引顺序,获取第二个li标签
print(li)
print("------------------------------------------------------------")
li = doc("li:nth-child(2n)") # 指定一个索引顺序,获取偶数的li标签
print(li)
print("------------------------------------------------------------")
li = doc("li:contains(second)") # 查找包含second的文本标签
print(li)
<li class="item-0 active">first item</li>
------------------------------------------------------------
<li class="item-4"><a href="link5.html">fifth item</a></li>
------------------------------------------------------------
<li class="item-3 active"><a href="link4.html">fourth item</a></li>
<li class="item-4"><a href="link5.html">fifth item</a></li>
------------------------------------------------------------
<li class="item-1 active"><a href="link2.html">second item<p>Third times</p></a></li>
------------------------------------------------------------
<li class="item-1 active"><a href="link2.html">second item<p>Third times</p></a></li>
<li class="item-3 active"><a href="link4.html">fourth item</a></li>
------------------------------------------------------------
<li class="item-1 active"><a href="link2.html">second item<p>Third times</p></a></li>

打印后的结果为:

更多的CSS选择器可以查看:http://www.w3school.com.cn/css/index.asp

Pyquery的官方文档可以查看:https://pythonhosted.org/pyquery/api.html

爬虫--PyQuery的更多相关文章

  1. 一起学爬虫——PyQuery常用用法总结

    什么是PyQuery PyQuery是一个类似于jQuery的解析网页工具,使用lxml操作xml和html文档,它的语法和jQuery很像.和XPATH,Beautiful Soup比起来,PyQu ...

  2. Python爬虫-- PyQuery库

    PyQuery库 PyQuery库也是一个非常强大又灵活的网页解析库,PyQuery 是 Python 仿照 jQuery 的严格实现.语法与 jQuery 几乎完全相同,所以不用再去费心去记一些奇怪 ...

  3. 爬虫--pyquery使用

    强大又灵活的网页解析库. 初始化   字符串初始化 html = ''' <div> <ul> <li class="item-0">first ...

  4. Python新手需要掌握的知识点

    一.基础语法 1 变量 2 逻辑判断 3 循环 4 函数 二.数据结构 1 数字(加减乘除) 2 字符串(一串字符) 3 布尔 (真假) 4 元组 (不能修改的列表) 5 列表(Python的苦力,最 ...

  5. 一起学爬虫——使用selenium和pyquery爬取京东商品列表

    layout: article title: 一起学爬虫--使用selenium和pyquery爬取京东商品列表 mathjax: true --- 今天一起学起使用selenium和pyquery爬 ...

  6. Python爬虫之PyQuery使用(六)

    Python爬虫之PyQuery使用 PyQuery简介 pyquery能够通过选择器精确定位 DOM 树中的目标并进行操作.pyquery相当于jQuery的python实现,可以用于解析HTML网 ...

  7. Python爬虫利器六之PyQuery的用法

    前言 你是否觉得 XPath 的用法多少有点晦涩难记呢? 你是否觉得 BeautifulSoup 的语法多少有些悭吝难懂呢? 你是否甚至还在苦苦研究正则表达式却因为少些了一个点而抓狂呢? 你是否已经有 ...

  8. PYTHON 爬虫笔记十:利用selenium+PyQuery实现淘宝美食数据搜集并保存至MongeDB(实战项目三)

    利用selenium+PyQuery实现淘宝美食数据搜集并保存至MongeDB 目标站点分析 淘宝页面信息很复杂的,含有各种请求参数和加密参数,如果直接请求或者分析Ajax请求的话会很繁琐.所以我们可 ...

  9. 小白学 Python 爬虫(23):解析库 pyquery 入门

    人生苦短,我用 Python 前文传送门: 小白学 Python 爬虫(1):开篇 小白学 Python 爬虫(2):前置准备(一)基本类库的安装 小白学 Python 爬虫(3):前置准备(二)Li ...

随机推荐

  1. LR监控tomcat服务器

    采用编写VuGen脚本访问Tomcat的Status页面的方式获取性能数据(利用了关联和lr_user_data_point函数),本质上还是使用tomcat自带的监控页面,只是将监控结果加到LR的a ...

  2. ADO.NET DBHelper 类库

    using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.D ...

  3. Building simple plug-ins system for ASP.NET Core(转)

    Recently I built plug-ins support to my TemperatureStation IoT solution web site. The code for .NET ...

  4. 利用css3的text-shadow属性实现文字阴影乳白效果

    现在CSS3+html5的网页应用的越来越广泛了.很多网页中的字体同样可以用CSS3来实现炫酷的效果. 下面就介绍一下利用css3的text-shadow属性实现文字阴影乳白效果.这是在设计达人上面看 ...

  5. 【Python】Python的time和datetime模块

    time 常用的有time.time()和time.sleep()函数. import time print(time.time()) 1499305554.3239055 上面的浮点数称为UNIX纪 ...

  6. 【题解】NOIP2017时间复杂度

    对大模拟抱有深深的恐惧……不过这次写好像还好?拿个栈维护一下循环的嵌套,然后重定义一下读入即可.记得去年在考场上面死活调不粗来,代码也奇丑无比……希望今年能好一点吧! #include <bit ...

  7. [JSOI2009]游戏 二分图博弈

    题面 题面 题解 二分图博弈的模板题,只要会二分图博弈就可以做了,可以当做板子打. 根据二分图博弈,如果一个点x在某种方案中不属于最大匹配,那么这是一个先手必败点. 因为对方先手,因此我们就是要找这样 ...

  8. 洛谷P1062 数列

    题目描述 给定一个正整数k(3≤k≤15),把所有k的方幂及所有有限个互不相等的k的方幂之和构成一个递增的序列,例如,当k=3时,这个序列是: 1,3,4,9,10,12,13,… (该序列实际上就是 ...

  9. 使用 nginx 代理 tomcat 服务器

    server { listen 80; server_name wechat-jsp.local; root /usr/local/Cellar/tomcat/9.0.5/libexec/webapp ...

  10. socketpair + signal + select 的套路

    1:起因 最近在看代码时连续两次看到这三个函数的组合使用,为方便以后借鉴和回忆,先记录下来. 这三个函数的应用场景是这样的: 1.1 首先socketpair函数创建一对已连接套接字,返回的两个描述符 ...