遍历文档树

还拿”爱丽丝梦游仙境”的文档来做例子:

html_doc = """

<html><head><title>The Dormouse's story</title></head>

    <body>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, 'html.parser')

通过这段例子来演示怎样从文档的一段内容找到另一段内容

子节点

一个Tag可能包含多个字符串或其它的Tag,这些都是这个Tag的子节点.Beautiful Soup提供了许多操作和遍历子节点的属性.

注意: Beautiful Soup中字符串节点不支持这些属性,因为字符串没有子节点

tag的名字

操作文档树最简单的方法就是告诉它你想获取的tag的name.如果想获取 <head> 标签,只要用 soup.head :

soup.head

# <head><title>The Dormouse's story</title></head>

soup.title

# <title>The Dormouse's story</title>

这是个获取tag的小窍门,可以在文档树的tag中多次调用这个方法.下面的代码可以获取<body>标签中的第一个标签:

soup.body.b

# <b>The Dormouse's story</b>

通过点取属性的方式只能获得当前名字的第一个tag:

soup.a

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

如果想要得到所有的<a>标签,或是通过名字得到比一个tag更多的内容的时候,就需要用到 Searching the tree 中描述的方法,比如: find_all()

soup.find_all('a')

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

.contents 和 .children

tag的 .contents 属性可以将tag的子节点以列表的方式输出:

head_tag = soup.head

head_tag

# <head><title>The Dormouse's story</title></head>

head_tag.contents

[<title>The Dormouse's story</title>]

title_tag = head_tag.contents[0]

title_tag

# <title>The Dormouse's story</title>

title_tag.contents

# [u'The Dormouse's story']

BeautifulSoup 对象本身一定会包含子节点,也就是说<html>标签也是 BeautifulSoup 对象的子节点:

len(soup.contents)

# 1

soup.contents[0].name

# u'html'

字符串没有 .contents 属性,因为字符串没有子节点:

text = title_tag.contents[0]

text.contents

# AttributeError: 'NavigableString' object has no attribute 'contents'

通过tag的 .children 生成器,可以对tag的子节点进行循环:

for child in title_tag.children:

    print(child)

    # The Dormouse's story

.descendants

.contents 和 .children 属性仅包含tag的直接子节点.例如,<head>标签只有一个直接子节点<title>

head_tag.contents

# [<title>The Dormouse's story</title>]

但是<title>标签也包含一个子节点:字符串 “The Dormouse’s story”,这种情况下字符串 “The Dormouse’s story”也属于<head>标签的子孙节点. .descendants 属性可以对所有tag的子孙节点进行递归循环 [5] :

for child in head_tag.descendants:

    print(child)

    # <title>The Dormouse's story</title>

    # The Dormouse's story

上面的例子中, <head>标签只有一个子节点,但是有2个子孙节点:<head>节点和<head>的子节点, BeautifulSoup 有一个直接子节点(<html>节点),却有很多子孙节点:

len(list(soup.children))

# 1

len(list(soup.descendants))

# 25

.string

如果tag只有一个 NavigableString 类型子节点,那么这个tag可以使用 .string 得到子节点:

title_tag.string

# u'The Dormouse's story'

如果一个tag仅有一个子节点,那么这个tag也可以使用 .string 方法,输出结果与当前唯一子节点的 .string 结果相同:

head_tag.contents

# [<title>The Dormouse's story</title>]

head_tag.string

# u'The Dormouse's story'

如果tag包含了多个子节点,tag就无法确定 .string 方法应该调用哪个子节点的内容, .string 的输出结果是 None :

print(soup.html.string)

# None

.strings 和 stripped_strings

如果tag中包含多个字符串 [2] ,可以使用 .strings 来循环获取:

for string in soup.strings:

    print(repr(string))

    # u"The Dormouse's story"

    # u'\n\n'

    # u"The Dormouse's story"

    # u'\n\n'

    # u'Once upon a time there were three little sisters; and their names were\n'

    # u'Elsie'

    # u',\n'

    # u'Lacie'

    # u' and\n'

    # u'Tillie'

    # u';\nand they lived at the bottom of a well.'

    # u'\n\n'

    # u'...'

    # u'\n'

输出的字符串中可能包含了很多空格或空行,使用 .stripped_strings 可以去除多余空白内容:

for string in soup.stripped_strings:

    print(repr(string))

    # u"The Dormouse's story"

    # u"The Dormouse's story"

    # u'Once upon a time there were three little sisters; and their names were'

    # u'Elsie'

    # u','

    # u'Lacie'

    # u'and'

    # u'Tillie'

    # u';\nand they lived at the bottom of a well.'

    # u'...'

全部是空格的行会被忽略掉,段首和段末的空白会被删除

父节点

继续分析文档树,每个tag或字符串都有父节点:被包含在某个tag中

.parent

通过 .parent 属性来获取某个元素的父节点.在例子“爱丽丝”的文档中,<head>标签是<title>标签的父节点:

title_tag = soup.title

title_tag

# <title>The Dormouse's story</title>

title_tag.parent

# <head><title>The Dormouse's story</title></head>

文档title的字符串也有父节点:<title>标签

title_tag.string.parent

# <title>The Dormouse's story</title>

文档的顶层节点比如<html>的父节点是 BeautifulSoup 对象:

html_tag = soup.html

type(html_tag.parent)

# <class 'bs4.BeautifulSoup'>

BeautifulSoup 对象的 .parent 是None:

print(soup.parent)

# None

.parents

通过元素的 .parents 属性可以递归得到元素的所有父辈节点,下面的例子使用了 .parents 方法遍历了<a>标签到根节点的所有节点.

link = soup.a

link

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

for parent in link.parents:

    if parent is None:

        print(parent)

    else:

        print(parent.name)

# p

# body

# html

# [document]

# None

兄弟节点

看一段简单的例子:

sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>")

print(sibling_soup.prettify())

# <html>

#  <body>

#   <a>

#    <b>

#     text1

#    </b>

#    <c>

#     text2

#    </c>

#   </a>

#  </body>

# </html>

因为标签和<c>标签是同一层:他们是同一个元素的子节点,所以和<c>可以被称为兄弟节点.一段文档以标准格式输出时,兄弟节点有相同的缩进级别.在代码中也可以使用这种关系.

.next_sibling 和 .previous_sibling

在文档树中,使用 .next_sibling 和 .previous_sibling 属性来查询兄弟节点:

sibling_soup.b.next_sibling

# <c>text2</c>

sibling_soup.c.previous_sibling

# <b>text1</b>

标签有 .next_sibling 属性,但是没有 .previous_sibling 属性,因为标签在同级节点中是第一个.同理,<c>标签有 .previous_sibling 属性,却没有 .next_sibling 属性:

print(sibling_soup.b.previous_sibling)

# None

print(sibling_soup.c.next_sibling)

# None

例子中的字符串“text1”和“text2”不是兄弟节点,因为它们的父节点不同:

sibling_soup.b.string

# u'text1'

print(sibling_soup.b.string.next_sibling)

# None

实际文档中的tag的 .next_sibling 和 .previous_sibling 属性通常是字符串或空白. 看看“爱丽丝”文档:

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>

如果以为第一个<a>标签的 .next_sibling 结果是第二个<a>标签,那就错了,真实结果是第一个<a>标签和第二个<a>标签之间的顿号和换行符:

link = soup.a

link

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

link.next_sibling

# u',\n'

第二个<a>标签是顿号的 .next_sibling 属性:

link.next_sibling.next_sibling

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

.next_siblings 和 .previous_siblings

通过 .next_siblings 和 .previous_siblings 属性可以对当前节点的兄弟节点迭代输出:

for sibling in soup.a.next_siblings:

    print(repr(sibling))

    # u',\n'

    # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

    # u' and\n'

    # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

    # u'; and they lived at the bottom of a well.'

    # None

for sibling in soup.find(id="link3").previous_siblings:

    print(repr(sibling))

    # ' and\n'

    # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

    # u',\n'

    # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

    # u'Once upon a time there were three little sisters; and their names were\n'

    # None

回退和前进

看一下“爱丽丝” 文档:

<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

HTML解析器把这段字符串转换成一连串的事件: “打开<html>标签”,”打开一个<head>标签”,”打开一个<title>标签”,”添加一段字符串”,”关闭<title>标签”,”打开标签”,等等.Beautiful Soup提供了重现解析器初始化过程的方法.

.next_element 和 .previous_element

.next_element 属性指向解析过程中下一个被解析的对象(字符串或tag),结果可能与 .next_sibling 相同,但通常是不一样的.

这是“爱丽丝”文档中最后一个<a>标签,它的 .next_sibling 结果是一个字符串,因为当前的解析过程 [2]因为当前的解析过程因为遇到了<a>标签而中断了:

last_a_tag = soup.find("a", id="link3")

last_a_tag

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

last_a_tag.next_sibling

# '; and they lived at the bottom of a well.'

但这个<a>标签的 .next_element 属性结果是在<a>标签被解析之后的解析内容,不是<a>标签后的句子部分,应该是字符串”Tillie”:

last_a_tag.next_element

# u'Tillie'

这是因为在原始文档中,字符串“Tillie” 在分号前出现,解析器先进入<a>标签,然后是字符串“Tillie”,然后关闭</a>标签,然后是分号和剩余部分.分号与<a>标签在同一层级,但是字符串“Tillie”会被先解析.

.previous_element 属性刚好与 .next_element 相反,它指向当前被解析的对象的前一个解析对象:

last_a_tag.previous_element

# u' and\n'

last_a_tag.previous_element.next_element

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

.next_elements 和 .previous_elements

通过 .next_elements 和 .previous_elements 的迭代器就可以向前或向后访问文档的解析内容,就好像文档正在被解析一样:

for element in last_a_tag.next_elements:

    print(repr(element))

# u'Tillie'

# u';\nand they lived at the bottom of a well.'

# u'\n\n'

# <p class="story">...</p>

# u'...'

# u'\n'

# None

bs4--官文--遍历文档树的更多相关文章

使用requests爬取梨视频、bilibili视频、汽车之家，bs4遍历文档树、搜索文档树，css选择器
今日内容概要使用requests爬取梨视频 requests+bs4爬取汽车之家 bs4遍历文档树 bs4搜索文档树 css选择器内容详细 1.使用requests爬取梨视频 # 模拟发送http ...
使用Python爬虫库BeautifulSoup遍历文档树并对标签进行操作详解（新手必学）
为大家介绍下Python爬虫库BeautifulSoup遍历文档树并对标签进行操作的详细方法与函数下面就是使用Python爬虫库BeautifulSoup对文档树进行遍历并对标签进行操作的实例,都是最 ...
bs4--官文--搜索文档树
搜索文档树 Beautiful Soup定义了很多搜索方法,这里着重介绍2个: find() 和 find_all() .其它方法的参数和用法类似,请读者举一反三. 再以“爱丽丝”文档作为例子: ht ...
bs4--官文--修改文档树
修改文档树 Beautiful Soup的强项是文档树的搜索,但同时也可以方便的修改文档树修改tag的名称和属性在 Attributes 的章节中已经介绍过这个功能,但是再看一遍也无妨. 重命名一 ...
jsonp 遍历文档
遍历文档将html解析成一个Document后,就可以使用类似Dom的方法进行操作 File input = new File("/tmp/input.html"); Docum ...
Python爬虫系列（六）：搜索文档树
今天早上,写的东西掉了.这个烂知乎,有bug,说了自动保存草稿,其实并没有保存.无语今晚,我们将继续讨论如何分析html文档. 1.字符串 #直接找元素soup.find_all('b') 2.正则 ...
一文搞懂B树、B-树、B+树
前言 B树和B-树是同一种数据结构,如果不清楚的话,会被面试官忽悠,所以本文介绍两种数据结构,B树和B+树,废话不多数咱们开干. B树介绍在计算机科学中,B树是一种自平衡的树,能够保持数据有序.这 ...
老李推荐：第14章9节《MonkeyRunner源码剖析》 HierarchyViewer实现原理-遍历控件树查找控件
老李推荐:第14章9节<MonkeyRunner源码剖析> HierarchyViewer实现原理-遍历控件树查找控件 poptest是国内唯一一家培养测试开发工程师的培训机构,以学员 ...
childNodes遍历DOM节点树
childNodes遍历DOM节点树 var s = ""; function travel(space,node) { if(node.tagName){ s += space ...

随机推荐

BZOJ1102（搜索）
随便写一下的搜索,别的OJ深搜就过了,强大的BZOJ成功栈溢出RE了我并使我屈服地用广搜过掉,第一行手动开栈惨遭无视. 广搜: #pragma comment(linker, "/STACK ...
bryce1010专题训练——CDQ分治
Bryce1010模板 CDQ分治 1.与普通分治的区别普通分治中,每一个子问题只解决它本身(可以说是封闭的) 分治中,对于划分出来的两个子问题,前一个子问题用来解决后一个子问题而不是它本身 2.试 ...
CentOS 部署RabbitMQ集群
1. 准备两台CentOS,信息如下: node1:10.0.0.123 node2:10.0.0.124 修改hostname请参照: $ hostname # 查看当前的hostname $ ho ...
ubuntu dpkg命令总结
dpkg是Debian系统的后台包管理器,类似RPM.也是Debian包管理系统的中流砥柱,负责安全卸载软件包,配置,以及维护已安装的软件包.由于ubuntu和Debian乃一脉相承,所以很多命令是不 ...
mac 终端查看端口命令
查看端口所在线程 lsof -i:8080 mac-abeen:spider abeen$ lsof -i:8080 COMMAND PID USER FD TYPE DEVICE SIZE/OFF ...
ubuntu键盘映射
在sublime下开发习惯把CapsLock和Shift间交换,windows下有很多软件可以修改键盘映射,在ubuntu下可以是哦用xmodmap命令,使用方法如下: 在自己用户的home目录下新建 ...
关于sqlserver帐号被禁用问题
若发现sqlsrver所有帐号不小心被禁用了,这个时候怎么办?用重装吗?不用,仔细看小白是怎么一步一步解开这个谜题的.首先需要Windows帐号设置里重新添加一个新帐号.并将其添加到管理员组里面,然后 ...
java字符串拼接技巧(StringBuilder使用技巧)
在平时的开发中,我们可能会遇到需要拼接如下格式的字符串(至少我是遇到了很多次): 1,2,3,4,5,6,7,8,9,10,11,12,12,12,12,34,234,2134,1234,1324,1 ...
洛谷 P2253 好一个一中腰鼓！
题目背景话说我大一中的运动会就要来了,据本班同学剧透(其实早就知道了),我萌萌的初二年将要表演腰鼓[喷],这个无厘头的题目便由此而来. Ivan乱入:“忽一人大呼:‘好一个安塞腰鼓!’满座寂然,无敢 ...
将从SQL2008 r2里备份的数据库还原到SQL2008中
从标题可以看出这是未解决上一篇遗留问题写的,现在我也不知道这个可不可以成功,方法似乎查到了一种,具体怎样还不清楚:而且,我想说的是“我踩雷了”. 这篇的主角是“Database Publishing ...

bs4--官文--遍历文档树