python爬虫知识点总结(七)PyQuery详解
官方学习文档:http://pyquery.readthedocs.io/en/latest/api.html
一、什么是PyQuery?
答:强大有灵活的网页解析库,模仿jQuery实现。如果你觉得正则表达式写起来太麻烦,如果你觉的BeautifulSoup语法太难记,如果你熟悉jQuery的语法,那么PyQuery就是你的绝佳选择。
二、安装
pip3 install pyquery
三、初始化
1、字符串初始化
html = '''
<div>
<ul>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="blod">thrid item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
print(doc('li'))
2、URL初始化
from pyquery import PyQuery as pq
doc = pq(url="http://www.baidu.com")
print(doc('head'))
3、文本初始化
from pyquery import PyQuery as pq
doc = pq(filename = 'demo.html')
print(doc('li'))
四、基本CSS选择器
html = '''
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="blod">thrid item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
print(doc('#container .list li'))
1、查找元素
子元素
html = '''
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="blod">thrid item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.list')
print(type(items))
print(items)
lis = items.find('li')
print(type(lis))
print(lis)
lis = items.children()
print(type(lis))
print(lis)
lis = items.children('.active')
print(lis)
父元素
html = '''
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="blod">thrid item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.list')
container = items.parent()
print(type(container))
print(container)
html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="blod">thrid item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.list')
container = items.parents()
print(type(container))
print(container)
parent = items.parents('.wrap')
print(parents)
兄弟元素
html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="blod">thrid item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.list .item-0.active')
print(li.siblings())
html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="blod">thrid item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.list .item-0.active')
print(li.siblings('.active'))
五、遍历
1、单个元素
html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="blod">thrid item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li)
html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="blod">thrid item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
lis = doc('li').items()
print(type(lis))
for li in lis:
print(li)
2、获取信息
获取属性
html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="blod">thrid item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
a = doc('.item-0.active a')
print(a)
print(a.attr('href'))
print(a.attr.href)
获取文本
html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="blod">thrid item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
a = doc('.item-0.active a')
print(a)
print(a.attr('href'))
print(a.text())
获取HTML
html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="blod">thrid item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li)
print(li.html())
六、DOM操作
1、addClass\removeClass
html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="blod">thrid item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li)
li.removeClass('active')
print(li)
li.addClass('active')
print(li)
2、attr、css
html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="blod">thrid item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li)
li.attr('name','link')
print(li)
li.css('font-size','14px')
print(li)
3、remove
html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="blod">thrid item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
wrap = doc('.wrap')
print(wrap.text())
wrap.find('p').remove()
print(wrap.text())
其他DOM方法
七、伪类选择器
html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="blod">thrid item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc("li:first-child")
print(li)
li = doc("li:last-child")
print(li)
# 标签从0开始
li = doc("li:nth-child(2)") # ntj-child(2)获取第2个标签
print(li)
li = doc("li:gt(2)") # gt-child(2)获取比2大的标签
print(li)
li = doc("li:nth-child(2n)") # nth-child(2n)获取偶数的标签
print(li)
li = doc("li:contains(second)") # contains(second)获取包含second文本的标签
print(li)
更多CSS选择器可以查看 http://www.w3school.com.cn/css/index.html
jQuery官方文档:http://jquery.cuishifeng.cn/
python爬虫知识点总结(七)PyQuery详解的更多相关文章
- Python爬虫之selenium库使用详解
Python爬虫之selenium库使用详解 本章内容如下: 什么是Selenium selenium基本使用 声明浏览器对象 访问页面 查找元素 多个元素查找 元素交互操作 交互动作 执行JavaS ...
- python爬虫知识点详解
python爬虫知识点总结(一)库的安装 python爬虫知识点总结(二)爬虫的基本原理 python爬虫知识点总结(三)urllib库详解 python爬虫知识点总结(四)Requests库的基本使 ...
- Python学习二:词典基础详解
作者:NiceCui 本文谢绝转载,如需转载需征得作者本人同意,谢谢. 本文链接:http://www.cnblogs.com/NiceCui/p/7862377.html 邮箱:moyi@moyib ...
- 爬虫入门之urllib库详解(二)
爬虫入门之urllib库详解(二) 1 urllib模块 urllib模块是一个运用于URL的包 urllib.request用于访问和读取URLS urllib.error包括了所有urllib.r ...
- Python网络请求urllib和urllib3详解
Python网络请求urllib和urllib3详解 urllib是Python中请求url连接的官方标准库,在Python2中主要为urllib和urllib2,在Python3中整合成了urlli ...
- python中requests库使用方法详解
目录 python中requests库使用方法详解 官方文档 什么是Requests 安装Requests库 基本的GET请求 带参数的GET请求 解析json 添加headers 基本POST请求 ...
- Python学习一:序列基础详解
作者:NiceCui 本文谢绝转载,如需转载需征得作者本人同意,谢谢. 本文链接:http://www.cnblogs.com/NiceCui/p/7858473.html 邮箱:moyi@moyib ...
- python中argparse模块用法实例详解
python中argparse模块用法实例详解 这篇文章主要介绍了python中argparse模块用法,以实例形式较为详细的分析了argparse模块解析命令行参数的使用技巧,需要的朋友可以参考下 ...
- python selenium 三种等待方式详解[转]
python selenium 三种等待方式详解 引言: 当你觉得你的定位没有问题,但是却直接报了元素不可见,那你就可以考虑是不是因为程序运行太快或者页面加载太慢造成了元素不可见,那就必须要加等待 ...
- Python爬虫利器六之PyQuery的用法
前言 你是否觉得 XPath 的用法多少有点晦涩难记呢? 你是否觉得 BeautifulSoup 的语法多少有些悭吝难懂呢? 你是否甚至还在苦苦研究正则表达式却因为少些了一个点而抓狂呢? 你是否已经有 ...
随机推荐
- 系统安全-Google authenticator
对于某些人来说,盗取密码会比你想象的更简单 以下任意一种常见的操作都可能让你的密码面临被盗的风险: 在多个网站上使用同一个密码 从互联网上下载软件 点击电子邮件中的链接 两步验证可以将别有用心的人阻 ...
- matlab2017b linux版分享
链接:https://pan.baidu.com/s/1smrTkFN 密码:cvb3 下载后请点关注并点赞,谢谢支持.
- Excel COM组件使用的注意事项和一些权限问题(转载)
1.实例化Excel的COM组件的时候,不要直接调用类,要用Microsoft提供的接口 原来的写法:Excel.ApplicationClass excelApp = new Excel.Appli ...
- [转]XMPP基本概念--节(stanza)
本文介绍在XMPP通信中最核心的三个XML节(stanza).这些节(stanza)有自己的作用和目标,通过组织不同的节(stanza),就能达到我们各种各样的通信目的. 首先我们来看一段XMPP流. ...
- 深入Asyncio(八)异步迭代器
Async Iterators: async for 除了async def和await语法外,还有一些其它的语法,本章学习异步版的for循环与迭代器,不难理解,普通迭代器是通过__iter__和__ ...
- 局部描述符表LDT的作用+定义+初始化+跳转相关
[0]写在前面 0.1)本代码的作用: 旨在说明局部描述符表的作用,及其相关定义,初始化和跳转等内容: 0.2)文末的个人总结是干货,前面代码仅供参考的,且source code from orang ...
- ASP.NET RemoteAttribute远程验证更新问题
create时使用remote特性没有任何问题, update时,问题就大了,验证唯一性时需要排除自身,如果使用这个特性将无法正确的验证. 改进思路:将自动生成的标签属性改为手写,,并在url上面加上 ...
- Swift———a Glance(极客学院)笔记
http://www.swiftv.cn/course/hw4sysi7 本课程很短,加起来一个小时,适合作为一个快速了解. 两本书: apple官方<The Swift Programmi ...
- 37、pendingIntent 点击通知栏进入页面
转载: http://blog.csdn.net/yuzhiboyi/article/details/8484771 https://my.oschina.net/youranhongcha/blog ...
- c++ get the pointer from the reference
int x = 5; int& y = x; int* xp = &x; int* yp = &y; xp is equal to yp. 也就是说,直接对reference取 ...