小白学 Python 爬虫（23）：解析库 pyquery 入门

from pyquery import PyQuery

html = '''

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

'''

d = PyQuery(html)

print(d('p'))

结果如下：

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

以上是直接使用字符串进行的初始化，同时它还支持直接传入 URL 地址进行初始化：

d_url = PyQuery(url='https://www.geekdigging.com/', encoding='UTF-8')

print(d_url('title'))

结果如下：

<title>极客挖掘机</title>

这样写的话，其实 PyQuery 会先请求这个 URL ，然后用响应得到的 HTML 内容完成初始化，与下面这样写其实也是一样的：

r = requests.get('https://www.geekdigging.com/')

r.encoding = 'UTF-8'

d_requests = PyQuery(r.text)

print(d_requests('title'))

CSS 选择器

我们先来简单感受下 CSS 选择器的用法，真的是非常的简单方便：

d_css = PyQuery(html)

print(d_css('.story .sister'))

print(type(d_css('.story .sister')))

结果如下：

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.

<class 'pyquery.pyquery.PyQuery'>

这里的写法含义是我们先寻找 class 为 story 的节点，寻找到以后接着在它的子节点中继续寻找 class 为 sister 的节点。

最后的打印结果中可以看到，它的类型依然为 pyquery.pyquery.PyQuery ，说明我们可以继续使用这个结果解析。

查找节点

我们接着介绍一下常用的查找函数，这些查找函数最赞的地方就是它们和 JQuery 的用法完全一致。

find() ：查找节点的所有子孙节点。
children() ：只查找子节点。
parent() ：查找父节点。
parents() ：查找祖先节点。
siblings() ：查找兄弟节点。

下面来一些简单的示例：

# 查找子节点

items = d('body')

print('子节点：', items.find('p'))

print(type(items.find('p')))

# 查找父节点

items = d('#link1')

print('父节点：', items.parent())

print(type(items.parent()))

# 查找兄弟节点

items = d('#link1')

print('兄弟节点：', items.siblings())

print(type(items.siblings()))

结果如下：

子节点： <p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

<class 'pyquery.pyquery.PyQuery'>

父节点： <p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<class 'pyquery.pyquery.PyQuery'>

兄弟节点： <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.

<class 'pyquery.pyquery.PyQuery'>

遍历

通过上面的示例，可以看到，如果 pyquery 取出来的有多个节点，虽然类型也是 PyQuery ，但是和 Beautiful Soup 不一样的是返回的并不是列表，如果我们需要继续获取其中的节点，就需要遍历这个结果，可以使用 items() 这个获取结果进行遍历：

a = d('a')

for item in a.items():

    print(item)

结果如下：

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.

这里我们调用 items() 后，会返回一个生成器，遍历一下，就可以逐个得到 a 节点对象了，它的类型也是 PyQuery 类型。每个 a 节点还可以调用前面所说的方法进行选择，比如继续查询子节点，寻找某个祖先节点等，非常灵活。

提取信息

前面我们获取到节点以后，接着就是要获取我们所需要的信息了。

获取信息主要分为两个部分，一个是获取节点的文本信息，一个获取节点的属性信息。

获取文本信息

a_1 = d('#link1')

print(a_1.text())

结果如下：

Elsie

如果想获取这个节点内的 HTML 信息，可以使用 html() 方法：

a_2 = d('.story')

print(a_2.html())

结果如下：

Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.

获取属性信息

当我们获取到节点以后，可以使用 attr() 来获取相关的属性信息：

attr_1 = d('#link1')

print(attr_1.attr('href'))

结果如下：

http://example.com/elsie

除了我们可以使用 attr() 这个方法以外， pyquery 还为我们提供了 attr 属性，比如上面的示例还可以写成这样：

print(attr_1.attr.href)

结果和上面的示例是一样的。

小结

我们在前置准备中安装的几种解析器到此就介绍完了，综合比较一下，Beautiful Soup 对新手比较友好，无需了解更多的其他知识就可以上手使用，但是对于复杂 DOM 的解析，依然需要一定的 CSS 选择器的基础，如果对 Xpath 比较熟练的话直接使用 lxml 倒是最为方便的，如果和小编一样，对 JQuery 和 CSS 选择器都比较熟悉，那么 pyquery 倒是一个很不错的选择。

接下来小编计划做几个简单的实战分享，敬请期待哦~~~

示例代码

本系列的所有代码小编都会放在代码管理仓库 Github 和 Gitee 上，方便大家取用。

示例代码-Github

示例代码-Gitee