PyQuery 是 Python 仿照 jQuery 的严格实现。语法与 jQuery 几乎完全相同。

官方文档:http://pyquery.readthedocs.io/

安装

1
pip install pyquery

初始化

字符串初始化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
html = '''
<div>
    <ul>
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
print(doc('li'))
 

URL初始化

1
2
3
from pyquery import PyQuery as pq
doc = pq(url='http://www.baidu.com')
print(doc('head'))

文件初始化

1
2
3
from pyquery import PyQuery as pq
doc = pq(filename='demo.html')
print(doc('li'))

基本CSS选择器

1
2
3
4
5
6
7
8
9
10
11
12
13
14
html = '''
<div id="container">
    <ul class="list">
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
print(doc('#container .list li'))

查找元素

子元素

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
html = '''
<div id="container">
    <ul class="list">
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.list')
print(type(items))
print(items)
lis = items.find('li')
print(type(lis))
print(lis)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
html = '''
<div id="container">
    <ul class="list">
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.list')
lis = items.children()
print(type(lis))
print(lis)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
html = '''
<div id="container">
    <ul class="list">
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.list')
lis = items.children('.active')
print(lis)

父元素

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
html = '''
<div id="container">
    <ul class="list">
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.list')
container = items.parent()
print(type(container))
print(container)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
html = '''
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.list')
parents = items.parents()
print(type(parents))
print(parents)
1
2
parent = items.parents('.wrap')
print(parent)

兄弟元素

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
html = '''
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.list .item-0.active')
print(li.siblings())
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
html = '''
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.list .item-0.active')
print(li.siblings('.active'))

遍历

单个元素

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
html = '''
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
html = '''
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
lis = doc('li').items()
print(type(lis))
for li in lis:
    print(li)

获取信息

获取属性

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
html = '''
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
= doc('.item-0.active a')
print(a)
print(a.attr('href'))
print(a.attr.href)

获取文本

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
html = '''
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
= doc('.item-0.active a')
print(a)
print(a.text())

获取HTML

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
html = '''
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li)
print(li.html())

DOM操作

addClass、removeClass

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
html = '''
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li)
li.removeClass('active')
print(li)
li.addClass('active')
print(li)

attr、css

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
html = '''
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li)
li.attr('name''link')
print(li)
li.css('font-size''14px')
print(li)

remove

1
2
3
4
5
6
7
8
9
10
11
12
html = '''
<div class="wrap">
    Hello, World
    <p>This is a paragraph.</p>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
wrap = doc('.wrap')
print(wrap.text())
wrap.find('p').remove()
print(wrap.text())

其他DOM方法 http://pyquery.readthedocs.io/en/latest/api.html

伪类选择器

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
html = '''
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('li:first-child')
print(li)
li = doc('li:last-child')
print(li)
li = doc('li:nth-child(2)')
print(li)
li = doc('li:gt(2)')
print(li)
li = doc('li:nth-child(2n)')
print(li)
li = doc('li:contains(second)')
print(li)

常见的爬虫分析库(4)-爬虫之PyQuery的更多相关文章

  1. 常见的爬虫分析库(3)-Python正则表达式与re模块

    在线正则表达式测试 http://tool.oschina.net/regex/ 常见匹配模式 模式 描述 \w 匹配字母数字及下划线 \W 匹配非字母数字下划线 \s 匹配任意空白字符,等价于 [\ ...

  2. 常见的爬虫分析库(2)-xpath语法

    xpath简介 1.xpath使用路径表达式在xml和html中进行导航 2.xpath包含标准函数库 3.xpath是一个w3c的标准 xpath节点关系 1.父节点 2.子节点 3.同胞节点 4. ...

  3. 常见的爬虫分析库(1)-Python3中Urllib库基本使用

    原文来自:https://www.cnblogs.com/0bug/p/8893677.html 什么是Urllib? Python内置的HTTP请求库 urllib.request          ...

  4. 爬虫requests库 之爬虫贴吧

    首先要观察爬虫的URL规律,爬取一个贴吧所有页的数据,观察点击下一页时URL是如何变化的. 思路: 定义一个类,初始化方法什么都不用管 定义一个run方法,用来实现主要逻辑 3 class Tieba ...

  5. 2.爬虫 urlib库讲解 异常处理、URL解析、分析Robots协议

    1.异常处理 URLError类来自urllib库的error模块,它继承自OSError类,是error异常模块的基类,由request模块产生的异常都可以通过这个类来处理. from urllib ...

  6. Python爬虫Urllib库的基本使用

    Python爬虫Urllib库的基本使用 深入理解urllib.urllib2及requests  请访问: http://www.mamicode.com/info-detail-1224080.h ...

  7. Python爬虫—requests库get和post方法使用

    目录 Python爬虫-requests库get和post方法使用 1. 安装requests库 2.requests.get()方法使用 3.requests.post()方法使用-构造formda ...

  8. [python爬虫]Requests-BeautifulSoup-Re库方案--Requests库介绍

    [根据北京理工大学嵩天老师“Python网络爬虫与信息提取”慕课课程编写  文章中部分图片来自老师PPT 慕课链接:https://www.icourse163.org/learn/BIT-10018 ...

  9. Python爬虫与数据分析之爬虫技能:urlib库、xpath选择器、正则表达式

    专栏目录: Python爬虫与数据分析之python教学视频.python源码分享,python Python爬虫与数据分析之基础教程:Python的语法.字典.元组.列表 Python爬虫与数据分析 ...

随机推荐

  1. python,获取用户输入,并且将输入的内容写到.txt

    该功能缺点是必须保证该文件不存在的情况才会成功 f=open('E:/mywork/保存文件.txt','x') def userwrite(code): if code=='w': f.close( ...

  2. Spring Boot:如何配置静态资源的地址与访问路径

    spring.resources.static-locations=classpath:/static,classpath:/public,classpath:/resources,classpath ...

  3. Kivy折腾笔记

    最近想用Python开发APP,选择kivy,记录过程 首先是源码安装,各种蛋疼的报错放弃了.cython高版本有问题. python3 -m pip install cython==0.23 pyt ...

  4. FAT文件系统规范v1.03学习笔记---1.保留区之 Fat32 FSInfo扇区结构和备份启动扇区

    1.前言 本文主要是对Microsoft Extensible Firmware Initiative FAT32 File System Specification中文翻译版的学习笔记. 每个FAT ...

  5. webstorm设置VCS:版本控制顶部按钮

    说明: 每次都在这坑一下,浪费时间,百度只指出在哪,并没有说怎么调出来 我用的版本是10,点击下面的选项按操作设置就可以了 红色箭头:从服务器获取最新代码: 绿色箭头:提交: 白色箭头:撤销

  6. python ctypes

    official tutorial for ctypes libhttps://docs.python.org/3/library/ctypes.html 1 ctypes exports the c ...

  7. ASP.NET如何下载大文件

    关于此代码的几点说明: 1. 将数据分成较小的部分,然后将其移动到输出流以供下载,从而获取这些数据. 2. 根据下载的文件类型来指定 Response.ContentType .(参考OSChina的 ...

  8. Git操作----删除untracked files

    # 删除 untracked files git clean -f # 连 untracked 的目录也一起删掉 git clean -fd # 连 gitignore 的untrack 文件/目录也 ...

  9. java结合testng,利用excel做数据源的数据驱动实例

    数据驱动部分,是自动化测试常用部分,也是参数化设计的重要环节,前面分享了,mysql.yaml做数据源,那么再来分享下excel做数据驱动 思路: 先用POI读取excel.解析读取数据,返回list ...

  10. JS知识点随笔

    1.为什么 0.1 + 0.2 != 0.3? 原因: 因为 JS 采用 IEEE 754 双精度版本(64位),并且只要采用 IEEE 754 的语言都有该问题. 我们都知道计算机是通过二进制来存储 ...