pyquery和我们之前用的jQuery有着异曲同工之处,使用起来更加方便,基本能满足大部分时候我们的需求。

  先引入一个小事例展示pyquery的操作:

html = '''
<div>
<ul>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
'''
#引入pyquery
from pyquery import PyQuery as pq
#对html进行解析
doc = pq(html)
#选择出li的标签
print(doc('li'))
#所以li标签都被抽取出来
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>

  pyquery可以直接解析url

from pyquery import PyQuery as pq
doc = pq(url='http://www.baidu.com')
print(doc('head'))
<head><meta http-equiv="content-type" content="text/html;charset=utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=Edge"/><meta content="always" name="referrer"/><link rel="stylesheet" type="text/css" href="http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css"/><title>ç™¾åº¦ä¸€ä¸‹ï¼Œä½ å°±çŸ¥é“</title></head> 

  也可以是文件格式的html

from pyquery import PyQuery as pq
doc = pq(filename='demo.html')
print(doc('li'))

  pyquery支持css基本选择器的使用,这点和bs4一样:

html = '''
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
print(doc('#container .list li'))
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>

  可以对解析出来的pyquery对象再次解析出子元素

html = '''
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.list')
print(type(items))类型都是pyquery的obj
print(items)
lis = items.find('li')#查找子元素
print(type(lis))
print(lis)
<class 'pyquery.pyquery.PyQuery'>
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul> <class 'pyquery.pyquery.PyQuery'>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>

find的结果

  或者使用children方法获取所有子元素

lis = items.children()
print(type(lis))
print(lis)
<class 'pyquery.pyquery.PyQuery'>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>

children方法

  也可以使用children方法获取指定的子元素

lis = items.children('.active')
print(lis)
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>

获取到指定的子元素

  pyquery使用parent获取父元素

html = '''
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.list')
container = items.parent()
print(type(container))
print(container)
<class 'pyquery.pyquery.PyQuery'>
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>

运行结果

  同样的获取所有的父节点则使用parents方法

html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.list')
parents = items.parents()
print(type(parents))
print(parents)
<class 'pyquery.pyquery.PyQuery'>
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div><div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>

运行结果

  获取到指定的父节点

parent = items.parents('.wrap')
print(parent)
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>

运行结果

  使用siblings获取所有的兄弟元素

html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.list .item-0.active')
print(li.siblings())
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0">first item</li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>

运行结果

  同样可以获取指定的兄弟节点

html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.list .item-0.active')
print(li.siblings('.active'))
<li class="item-1 active"><a href="link4.html">fourth item</a></li>

运行结果

  多个选择器得到某一个标签

html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
#加空格就是自带选择器,不加就是and关系
li = doc('.item-0.active')
print(li)
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

  迭代选择出的所有标签

html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
lis = doc('li').items()#这是一个迭代器
print(type(lis))
for li in lis:
print(li)
<class 'generator'>
<li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li>

运行结果

  选择到需要的标签往往不够,我们时常需要从标签中获取到属性。

html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
a = doc('.item-0.active a')
print(a)#一个标签
print(a.attr('href'))#两种获取属性的方法
print(a.attr.href)
<a href="link3.html"><span class="bold">third item</span></a>
link3.html
link3.html

运行结果

  获取文本

html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
a = doc('.item-0.active a')
print(a)
print(a.text())
<a href="link3.html"><span class="bold">third item</span></a>
third item

运行结果

  获取元素内的html

html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li)
print(li.html())
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

<a href="link3.html"><span class="bold">third item</span></a>

运行结果

  pyquery进行dom操作

html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li)
#移除某个类名
li.removeClass('active')
print(li)
#增加某个类名
li.addClass('active')
print(li)
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

<li class="item-0"><a href="link3.html"><span class="bold">third item</span></a></li>

<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

运行结果

  增加属性的两种方式

html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li)
li.attr('name', 'link')
print(li)
li.css('font-size', '14px')
print(li)
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

<li class="item-0 active" name="link"><a href="link3.html"><span class="bold">third item</span></a></li>

<li class="item-0 active" name="link" style="font-size: 14px"><a href="link3.html"><span class="bold">third item</span></a></li>

运行结果

  使用remove移除整个标签

html = '''
<div class="wrap">
Hello, World
<p>This is a paragraph.</p>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
wrap = doc('.wrap')
print(wrap.text())
wrap.find('p').remove()
print(wrap.text())
Hello, World
This is a paragraph.
Hello, World

运行结果

  最后就是伪类选择器

html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('li:first-child')
print(li)
li = doc('li:last-child')
print(li)
li = doc('li:nth-child(2)')
print(li)
li = doc('li:gt(2)')
print(li)
li = doc('li:nth-child(2n)')
print(li)
li = doc('li:contains(second)')
print(li)
<li class="item-0">first item</li>

<li class="item-0"><a href="link5.html">fifth item</a></li>

<li class="item-1"><a href="link2.html">second item</a></li>

<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li> <li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-1"><a href="link2.html">second item</a></li>

运行结果

  以上就是常用的pyquery的操作,更多pyquery操作可以查看官方文档,pyquery: a jquery-like library for python

pyquery操作的更多相关文章

  1. pyquery的简单操作

    一.初始化 1.html初始化 html = ''' <div> <ul> <li class="item-0">first item</ ...

  2. python爬虫神器PyQuery的使用方法

    你是否觉得 XPath 的用法多少有点晦涩难记呢? 你是否觉得 BeautifulSoup 的语法多少有些悭吝难懂呢? 你是否甚至还在苦苦研究正则表达式却因为少些了一个点而抓狂呢? 你是否已经有了一些 ...

  3. Python解析HTML的开发库pyquery

    PyQuery是一个类似于jQuery的Python库,也可以说是jQuery在Python上的实现,能够以 jQuery 的语法来操作解析 HTML 文档,易用性和解析速度都很好. 例如,一段豆瓣h ...

  4. 【Python爬虫】安装 pyQuery 遇到的坑 Could not find function xmlCheckVersion in library libxml2. Is libxml2 installed?

    windows 64位操作系统下,用 Python 抓取网页,并用 pyQuery 解析网页 pyQuery是jQuery在python中的实现,能够以jQuery的语法来操作解析HTML文档,十分方 ...

  5. PyQuery基本操作介绍

    PyQuery基本操作介绍 PyQuery为Python提供一个类似于jQuery对HTML的操作方式,可以使用jQuery的语法对html文档进行查询操作. 本文以百度首页为例来介绍PyQuery的 ...

  6. PyQuery查询html信息

    以下代码主要演示使用pyquery进行对html文件的解析,包括设定编码,对子块进行查询等操作: from pyquery import PyQuery as pq import os from lx ...

  7. python爬虫从入门到放弃(七)之 PyQuery库的使用

    PyQuery库也是一个非常强大又灵活的网页解析库,如果你有前端开发经验的,都应该接触过jQuery,那么PyQuery就是你非常绝佳的选择,PyQuery 是 Python 仿照 jQuery 的严 ...

  8. 芝麻HTTP: Python爬虫利器之PyQuery的用法

    前言 你是否觉得 XPath 的用法多少有点晦涩难记呢? 你是否觉得 BeautifulSoup 的语法多少有些悭吝难懂呢? 你是否甚至还在苦苦研究正则表达式却因为少些了一个点而抓狂呢? 你是否已经有 ...

  9. PyQuery用法详解

    PyQuery是强大而又灵活的网页解析库,如果你觉得正则写起来太麻烦,如果你觉得BeautifulSoup语法太难记,如果你熟悉jQuery的语法 那么,PyQuery就是你绝佳的选择. 一.初始化方 ...

随机推荐

  1. OpenCascade:屏闪问题。

    1.在OnDraw中同时调用用V3d_View::Redaw()和 V3d_View::FitAll();可暂时解决. 2.在OnDraw中同时调用用V3d_View::Update();

  2. Xcode的Git管理

    在Xcode中创建工程的时候,我们很容易的可以将新创建的工程添加到Git中,如图: 但是如果是本地已经有的工程,那该如何添加到Git中呢? 首先终端进入到该工程的目录. 然后: git init gi ...

  3. CF-1111 (2019/2/7 补)

    CF-1111 题目链接 A. Superhero Transformation tags : strings #include <bits/stdc++.h> using namespa ...

  4. hihoCoder-1109-堆优化的Prim

    优先队列是由堆组成的,所以当我们使用优先队列对Prim进行优化时,就把这种优化叫做堆优化. 它的算法核心思想就是每次向后找边,每个pair存的都是下一个点,以及边权.我们对于已经走过的点就避开,这样就 ...

  5. Python Third Day-文件处理

    文件处理 打开文件,得到文件句柄并赋值给一个变量f=open('a.txt','r',encoding='utf-8')#默认打开的方式为r指的是文本文件,全名为‘rt’#w文件方式指的是如果有a.t ...

  6. 【REDIS】 redis-cli 命令

    Redis提供了丰富的命令(command)对数据库和各种数据类型进行操作,这些command可以在Linux终端使用. 在编程时,比如使用Redis 的Java语言包,这些命令都有对应的方法.下面将 ...

  7. LeetCode(121) Best Time to Buy and Sell Stock

    题目 Say you have an array for which the ith element is the price of a given stock on day i. If you we ...

  8. Verilog学习笔记基本语法篇(五)········ 条件语句

    条件语句可以分为if_else语句和case语句两张部分. A)if_else语句 三种表达形式 1) if(表达式)          2)if(表达式)               3)if(表达 ...

  9. HDU 2829 斜率优化DP Lawrence

    题意:n个数之间放m个障碍,分隔成m+1段.对于每段两两数相乘再求和,然后把这m+1个值加起来,让这个值最小. 设: d(i, j)表示前i个数之间放j个炸弹能得到的最小值 sum(i)为前缀和,co ...

  10. python基础学习笔记——列表技巧

    列表: 循环删除列表中的每⼀个元素 li = [, , , ] for e in li: li.remove(e) print(li) 结果: [, ] 分析原因: for的运⾏过程. 会有⼀个指针来 ...