Spider-Python爬虫之PyQuery基本用法
1.安装方法
pip install pyquery
2.引用方法
from pyquery import PyQuery as pq
3.简介
pyquery 是类型jquery 的一个专供python使用的html解析的库,使用方法类似bs4。
4.使用方法
4.1 初始化方法:
from pyquery import PyQuery as pq
doc =pq(html) #解析html字符串
doc =pq("http://news.baidu.com/") #解析网页
doc =pq("./a.html") #解析html 文本
4.2 基本CSS选择器
from pyquery import PyQuery as pq
html = '''
<div id="wrap">
<ul class="s_from">
asdasd
<link href="http://asda.com">asdadasdad12312</link>
<link href="http://asda1.com">asdadasdad12312</link>
<link href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
'''
doc = pq(html)
print doc("#wrap .s_from link")
运行结果:
<link href="http://asda.com">asdadasdad12312</link>
<link href="http://asda1.com">asdadasdad12312</link>
<link href="http://asda2.com">asdadasdad12312</link>
#是查找id的标签 .是查找class 的标签 link 是查找link 标签 中间的空格表示里层
4.3 查找子元素
from pyquery import PyQuery as pq
html = '''
<div id="wrap">
<ul class="s_from">
asdasd
<link href="http://asda.com">asdadasdad12312</link>
<link href="http://asda1.com">asdadasdad12312</link>
<link href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
'''
#查找子元素
doc = pq(html)
items=doc("#wrap")
print(items)
print("类型为:%s"%type(items))
link = items.find('.s_from')
print(link)
link = items.children()
print(link)
运行结果:
<div id="wrap">
<ul class="s_from">
asdasd
<link href="http://asda.com">asdadasdad12312</link>
<link href="http://asda1.com">asdadasdad12312</link>
<link href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
类型为:<class 'pyquery.pyquery.PyQuery'>
<ul class="s_from">
asdasd
<link href="http://asda.com">asdadasdad12312</link>
<link href="http://asda1.com">asdadasdad12312</link>
<link href="http://asda2.com">asdadasdad12312</link>
</ul> <ul class="s_from">
asdasd
<link href="http://asda.com">asdadasdad12312</link>
<link href="http://asda1.com">asdadasdad12312</link>
<link href="http://asda2.com">asdadasdad12312</link>
</ul>
根据运行结果可以发现返回结果类型为pyquery,并且find方法和children 方法都可以获取里层标签
4.4查找父元素
from pyquery import PyQuery as pq
html = '''
<div href="wrap">
hello nihao
<ul class="s_from">
asdasd
<link href="http://asda.com">asdadasdad12312</link>
<link href="http://asda1.com">asdadasdad12312</link>
<link href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
''' doc = pq(html)
items=doc(".s_from")
print(items)
#查找父元素
parent_href=items.parent()
print(parent_href)
运行结果:
<ul class="s_from">
asdasd
<link href="http://asda.com">asdadasdad12312</link>
<link href="http://asda1.com">asdadasdad12312</link>
<link href="http://asda2.com">asdadasdad12312</link>
</ul> <div href="wrap">
hello nihao
<ul class="s_from">
asdasd
<link href="http://asda.com">asdadasdad12312</link>
<link href="http://asda1.com">asdadasdad12312</link>
<link href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
parent可以查找出外层标签包括的内容,与之类似的还有parents,可以获取所有外层节点
4.5 查找兄弟元素
from pyquery import PyQuery as pq
html = '''
<div href="wrap">
hello nihao
<ul class="s_from">
asdasd
<link class='active1 a123' href="http://asda.com">asdadasdad12312</link>
<link class='active2' href="http://asda1.com">asdadasdad12312</link>
<link class='movie1' href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
''' doc = pq(html)
items=doc("link.active1.a123")
print(items)
#查找兄弟元素
siblings_href=items.siblings()
print(siblings_href)
运行结果:
<link class="active1 a123" href="http://asda.com">asdadasdad12312</link> <link class="active2" href="http://asda1.com">asdadasdad12312</link>
<link class="movie1" href="http://asda2.com">asdadasdad12312</link>
根据运行结果可以看出,siblings 返回了同级的其他标签
结论:子元素查找,父元素查找,兄弟元素查找,这些方法返回的结果类型都是pyquery类型,可以针对结果再次进行选择
4.6 遍历查找结果
from pyquery import PyQuery as pq
html = '''
<div href="wrap">
hello nihao
<ul class="s_from">
asdasd
<link class='active1 a123' href="http://asda.com">asdadasdad12312</link>
<link class='active2' href="http://asda1.com">asdadasdad12312</link>
<link class='movie1' href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
''' doc = pq(html)
its=doc("link").items()
for it in its:
print(it)
运行结果:
<link class="active1 a123" href="http://asda.com">asdadasdad12312</link> <link class="active2" href="http://asda1.com">asdadasdad12312</link> <link class="movie1" href="http://asda2.com">asdadasdad12312</link>
4.7获取属性信息
from pyquery import PyQuery as pq
html = '''
<div href="wrap">
hello nihao
<ul class="s_from">
asdasd
<link class='active1 a123' href="http://asda.com">asdadasdad12312</link>
<link class='active2' href="http://asda1.com">asdadasdad12312</link>
<link class='movie1' href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
''' doc = pq(html)
its=doc("link").items()
for it in its:
print(it.attr('href'))
print(it.attr.href)
运行结果:
http://asda.com
http://asda.com
http://asda1.com
http://asda1.com
http://asda2.com
http://asda2.com
4.8 获取文本
from pyquery import PyQuery as pq
html = '''
<div href="wrap">
hello nihao
<ul class="s_from">
asdasd
<link class='active1 a123' href="http://asda.com">asdadasdad12312</link>
<link class='active2' href="http://asda1.com">asdadasdad12312</link>
<link class='movie1' href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
''' doc = pq(html)
its=doc("link").items()
for it in its:
print(it.text())
运行结果
asdadasdad12312
asdadasdad12312
asdadasdad12312
4.9 获取 HTML信息
from pyquery import PyQuery as pq
html = '''
<div href="wrap">
hello nihao
<ul class="s_from">
asdasd
<link class='active1 a123' href="http://asda.com"><a>asdadasdad12312</a></link>
<link class='active2' href="http://asda1.com">asdadasdad12312</link>
<link class='movie1' href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
''' doc = pq(html)
its=doc("link").items()
for it in its:
print(it.html())
运行结果:
<a>asdadasdad12312</a>
asdadasdad12312
asdadasdad12312
5.常用DOM操作
5.1 addClass removeClass
添加,移除class标签
from pyquery import PyQuery as pq
html = '''
<div href="wrap">
hello nihao
<ul class="s_from">
asdasd
<link class='active1 a123' href="http://asda.com"><a>asdadasdad12312</a></link>
<link class='active2' href="http://asda1.com">asdadasdad12312</link>
<link class='movie1' href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
''' doc = pq(html)
its=doc("link").items()
for it in its:
print("添加:%s"%it.addClass('active1'))
print("移除:%s"%it.removeClass('active1'))
运行结果
添加:<link class="active1 a123" href="http://asda.com"><a>asdadasdad12312</a></link> 移除:<link class="a123" href="http://asda.com"><a>asdadasdad12312</a></link> 添加:<link class="active2 active1" href="http://asda1.com">asdadasdad12312</link> 移除:<link class="active2" href="http://asda1.com">asdadasdad12312</link> 添加:<link class="movie1 active1" href="http://asda2.com">asdadasdad12312</link> 移除:<link class="movie1" href="http://asda2.com">asdadasdad12312</link>
需要注意的是已经存在的class标签不会继续添加
5.2 attr css
attr 为获取/修改属性 css 添加style属性
from pyquery import PyQuery as pq
html = '''
<div href="wrap">
hello nihao
<ul class="s_from">
asdasd
<link class='active1 a123' href="http://asda.com"><a>asdadasdad12312</a></link>
<link class='active2' href="http://asda1.com">asdadasdad12312</link>
<link class='movie1' href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
''' doc = pq(html)
its=doc("link").items()
for it in its:
print("修改:%s"%it.attr('class','active'))
print("添加:%s"%it.css('font-size','14px'))
运行结果
C:\Python27\python.exe D:/test_his/test_re_1.py
修改:<link class="active" href="http://asda.com"><a>asdadasdad12312</a></link> 添加:<link class="active" href="http://asda.com" style="font-size: 14px"><a>asdadasdad12312</a></link> 修改:<link class="active" href="http://asda1.com">asdadasdad12312</link> 添加:<link class="active" href="http://asda1.com" style="font-size: 14px">asdadasdad12312</link> 修改:<link class="active" href="http://asda2.com">asdadasdad12312</link> 添加:<link class="active" href="http://asda2.com" style="font-size: 14px">asdadasdad12312</link>
attr css操作直接修改对象的
5.3 remove
remove 移除标签
from pyquery import PyQuery as pq
html = '''
<div href="wrap">
hello nihao
<ul class="s_from">
asdasd
<link class='active1 a123' href="http://asda.com"><a>asdadasdad12312</a></link>
<link class='active2' href="http://asda1.com">asdadasdad12312</link>
<link class='movie1' href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
''' doc = pq(html)
its=doc("div")
print('移除前获取文本结果:\n%s'%its.text())
it=its.remove('ul')
print('移除后获取文本结果:\n%s'%it.text())
运行结果
移除前获取文本结果:
hello nihao
asdasd
asdadasdad12312
asdadasdad12312
asdadasdad12312
移除后获取文本结果:
hello nihao
其他DOM方法参考: 请点击
6.伪类选择器
from pyquery import PyQuery as pq
html = '''
<div href="wrap">
hello nihao
<ul class="s_from">
asdasd
<link class='active1 a123' href="http://asda.com"><a>helloasdadasdad12312</a></link>
<link class='active2' href="http://asda1.com">asdadasdad12312</link>
<link class='movie1' href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
''' doc = pq(html)
its=doc("link:first-child")
print('第一个标签:%s'%its)
its=doc("link:last-child")
print('最后一个标签:%s'%its)
its=doc("link:nth-child(2)")
print('第二个标签:%s'%its)
its=doc("link:gt(0)") #从零开始
print("获取0以后的标签:%s"%its)
its=doc("link:nth-child(2n-1)")
print("获取奇数标签:%s"%its)
its=doc("link:contains('hello')")
print("获取文本包含hello的标签:%s"%its)
运行结果
第一个标签:<link class="active1 a123" href="http://asda.com"><a>helloasdadasdad12312</a></link> 最后一个标签:<link class="movie1" href="http://asda2.com">asdadasdad12312</link> 第二个标签:<link class="active2" href="http://asda1.com">asdadasdad12312</link> 获取0以后的标签:<link class="active2" href="http://asda1.com">asdadasdad12312</link>
<link class="movie1" href="http://asda2.com">asdadasdad12312</link> 获取奇数标签:<link class="active1 a123" href="http://asda.com"><a>helloasdadasdad12312</a></link>
<link class="movie1" href="http://asda2.com">asdadasdad12312</link> 获取文本包含hello的标签:<link class="active1 a123" href="http://asda.com"><a>helloasdadasdad12312</a></link>
更多css选择器可以查看: 请点击
Spider-Python爬虫之PyQuery基本用法的更多相关文章
- python爬虫---selenium库的用法
python爬虫---selenium库的用法 selenium是一个自动化测试工具,支持Firefox,Chrome等众多浏览器 在爬虫中的应用主要是用来解决JS渲染的问题. 1.使用前需要安装这个 ...
- Python爬虫之PyQuery使用(六)
Python爬虫之PyQuery使用 PyQuery简介 pyquery能够通过选择器精确定位 DOM 树中的目标并进行操作.pyquery相当于jQuery的python实现,可以用于解析HTML网 ...
- Python爬虫之BeautifulSoup的用法
之前看静觅博客,关于BeautifulSoup的用法不太熟练,所以趁机在网上搜索相关的视频,其中一个讲的还是挺清楚的:python爬虫小白入门之BeautifulSoup库,有空做了一下笔记: 一.爬 ...
- python爬虫神器PyQuery的使用方法
你是否觉得 XPath 的用法多少有点晦涩难记呢? 你是否觉得 BeautifulSoup 的语法多少有些悭吝难懂呢? 你是否甚至还在苦苦研究正则表达式却因为少些了一个点而抓狂呢? 你是否已经有了一些 ...
- 【Python爬虫】selenium基础用法
selenium 基础用法 阅读目录 初识selenium 基本使用 查找元素 元素互交操作 执行JavaScript 获取元素信息 等待 前进后退 Cookies 选项卡管理 异常处理 初识sele ...
- python爬虫之PyQuery的基本使用
PyQuery库也是一个非常强大又灵活的网页解析库,如果你有前端开发经验的,都应该接触过jQuery,那么PyQuery就是你非常绝佳的选择,PyQuery 是 Python 仿照 jQuery 的严 ...
- python爬虫之pyquery学习
相关内容: pyquery的介绍 pyquery的使用 安装模块 导入模块 解析对象初始化 css选择器 在选定元素之后的元素再选取 元素的文本.属性等内容的获取 pyquery执行DOM操作.css ...
- 【Python爬虫】PyQuery解析库
PyQuery解析库 阅读目录 初始化 基本CSS选择器 查找元素 遍历 获取信息 DOM操作 伪类选择器 PyQuery 是 Python 仿照 jQuery 的严格实现.语法与 jQuery 几乎 ...
- python爬虫---requests库的用法
requests是python实现的简单易用的HTTP库,使用起来比urllib简洁很多 因为是第三方库,所以使用前需要cmd安装 pip install requests 安装完成后import一下 ...
- Python爬虫系列-PyQuery详解
强大又灵活的网页解析库.如果你觉得正则写起来太麻烦,如果你觉得BeautifulSoup语法太难记,如果你熟悉jQuery的语法,那么PyQuery就是你的最佳选择. 安装 pip3 install ...
随机推荐
- Win7下安装MongoDB4.0.10
前言 MySQL与MongoDB都是开源的常用数据库,但是MySQL是传统的关系型数据库,MongoDB则是非关系型数据库,也叫文档型数据库,是一种NoSQL的数据库.它们各有各的优点,关键是看用在什 ...
- js实现打字效果
<!DOCTYPE html> <html> <head> <meta charset='utf-8'> <title>js typing& ...
- VS2010编译错: #error : This file requires _WIN32_WINNT to be #defined at least to 0x0403...的解决方法
最近拿到一个别人的工程,是使用VS.net创建的,而我的机器上只有vs2010,于是用自带的转换工具将它转换成vs2010的工程,转换之前我就很担心,怕转换完后会出问题,但是没有办法,我实在是 ...
- A - I'm bored with life
Holidays have finished. Thanks to the help of the hacker Leha, Noora managed to enter the university ...
- [Usaco2017 Open]Modern Art 2
Description Having become bored with standard 2-dimensional artwork (and also frustrated at others c ...
- CalService
package org.crazyit.cal; import java.math.BigDecimal; /** * 计算业务类 * * @author yangenxiong yangenxion ...
- Android中集成第三方支付
常见的第三方支付解决方案 支付宝支付 微信支付 银联支付 Ping++统一支付平台(需要继承服务器端和客户端) 短信支付 支付宝的集成流程 相关资料链接: 支付宝支付指引流程:支付指引流程 支付宝An ...
- 虚拟机下安装 CentOS 7 的几个小问题
※ 网络问题(Destination Host Unreachable) 安装时网络选择的"桥接"模式, 安装完毕,并配置IP地址后,发现只能ping通自己,局域网内的其他IP无法 ...
- 解决max解析记录与cname不能共存的问题
问题描述: 在腾讯上做了域名邮箱解析,需要将max记录绑定到主机记录为@(即空)的记录下. 而在做域名解析的时候,为了方便,需要将不带3w的域名也要解析到主机记录为@(即空)的记录下. 因此,解析报错 ...
- Codeforces_B.Maximum Sum of Digits
http://codeforces.com/contest/1060/problem/B 题意:将n拆为a和b,让a+b=n且S(a)+S(b)最大,求最大的S(a)+S(b). 思路:考虑任意一个数 ...