BeautifulSoup 是一个非常优秀的Python扩展库,可以用来从HTML或XML文件中提取我们感兴趣的数据,并且允许指定使用不同的解析器。

  使用 pip install BeaufifulSoup4 直接进行模块的安装。安装之后应使用 from bs4 import BeautifulSoup 导入并使用。

  下面简单演示下BeautifulSoup4的功能,更加详细完整的学习资料请参考 https://www.crummy.com/software/BeautifulSoup/bs4/doc/。

 >>> from bs4 import BeautifulSoup
>>>
>>> #自动添加和补全标签
>>> BeautifulSoup('hello world','lxml')
<html><body><p>hello world</p></body></html>
>>>
>>> #自定义一个html文档内容
>>> html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p> <p class="story">...</p>
"""
>>>
>>> #解析这段html文档内容,以优雅的方式展示出来
>>> soup = BeautifulSoup(html_doc,'html.parser')
>>> print(soup.prettify())
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
Elsie
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
;
and they lived at the bottom of a well.
</p>
<p class="story">
...
</p>
</body>
</html>
>>>
>>> #访问特定标签
>>> soup.title
<title>The Dormouse's story</title>
>>>
>>> #标签名字
>>> soup.title.name
'title'
>>>
>>> #标签文本
>>> soup.title.text
"The Dormouse's story"
>>>
>>> #title标签的上一级标签
>>> soup.title.parent
<head><title>The Dormouse's story</title></head>
>>>
>>> soup.head
<head><title>The Dormouse's story</title></head>
>>>
>>> soup.b
<b>The Dormouse's story</b>
>>>
>>> soup.b.name
'b'
>>> soup.b.text
"The Dormouse's story"
>>>
>>> #把整个BeautifulSoup对象看作标签对象
>>> soup.name
'[document]'
>>>
>>> soup.body
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
>>>
>>> soup.p
<p class="title"><b>The Dormouse's story</b></p>
>>>
>>> #标签属性
>>> soup.p['class']
['title']
>>>
>>> soup.p.get('class') #也可以这样查看标签属性
['title']
>>>
>>> soup.p.text
"The Dormouse's story"
>>>
>>> soup.p.contents
[<b>The Dormouse's story</b>]
>>>
>>> soup.a
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
>>>
>>> #查看a标签所有属性
>>> soup.a.attrs
{'class': ['sister'], 'id': 'link1', 'href': 'http://example.com/elsie'}
>>>
>>> #查找所有a标签
>>> soup.find_all('a')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
>>>
>>> #同时查找<a>和<b>标签
>>> soup.find_all(['a','b'])
[<b>The Dormouse's story</b>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
>>>
>>> import re
>>> #查找href包含特定关键字的标签
>>> soup.find_all(href=re.compile("elsie"))
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
>>>
>>> soup.find(id='link3')
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
>>>
>>> soup.find_all('a',id='link3')
[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
>>>
>>> for link in soup.find_all('a'):
print(link.text,':',link.get('href')) Elsie : http://example.com/elsie
Lacie : http://example.com/lacie
Tillie : http://example.com/tillie
>>>
>>> print(soup.get_text()) #返回所有文本 The Dormouse's story The Dormouse's story
Once upon a time there were three little sisters;and their names were
Elsie,
Lacieand
Tillie;
and they lived at the bottom of a well.
... >>>
>>> #修改标签属性
>>> soup.a['id']='test_link1'
>>> soup.a
<a class="sister" href="http://example.com/elsie" id="test_link1">Elsie</a>
>>>
>>> #修改标签文本
>>> soup.a.string.replace_with('test_Elsie')
'Elsie'
>>>
>>> soup.a.string
'test_Elsie'
>>>
>>> print(soup.prettify())
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="test_link1">
test_Elsie
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
;
and they lived at the bottom of a well.
</p>
<p class="story">
...
</p>
</body>
</html>
>>>
>>>
>>> #遍历子标签
>>> for child in soup.body.children:
print(child) <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="test_link1">test_Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p> <p class="story">...</p> >>>

9.3.4 BeaufitulSoup4的更多相关文章

随机推荐

  1. 通过已有Nginx镜像创建私有仓库

    想搭建一个私有的Docker仓库,查看了各种资料,大多是使用Nginx做代理.但是因为对于Nginx不熟悉,各种关于权限认证的问题,折腾了两天也没有搞定.后来无意在网上看到一篇使用已有镜像的方法,最终 ...

  2. bzoj3175: [Tjoi2013]攻击装置&&4808: 马

    终于知道为啥网络流这么受欢迎了. 其实就是构个图模板一下的事儿,比较好打是吧. 然后这题网络流黑白染色(其实感觉上匈牙利更加直接好想啊,但是实际上黑白染色给人感觉就是二分图) st连白而ed连黑,流量 ...

  3. [luoguP4142]洞穴遇险

    https://www.zybuluo.com/ysner/note/1240792 题面 戳我 解析 这种用来拼接的奇形怪状的东西,要不就是轮廓线\(DP\),要不就是网络流. 为了表示奇数点(即\ ...

  4. 洛谷P2744 [USACO5.3]量取牛奶Milk Measuring

    题目描述 农夫约翰要量取 Q(1 <= Q <= 20,000)夸脱(夸脱,quarts,容积单位--译者注) 他的最好的牛奶,并把它装入一个大瓶子中卖出.消费者要多少,他就给多少,从不有 ...

  5. canvas做的时钟,学习下

    canvas标签只是图形容器,您必须使用脚本来绘制图形. getContext() 方法可返回一个对象,该对象提供了用于在画布上绘图的方法和属性.——获取上下文对象. getContext(" ...

  6. 【LeetCode】467. Unique Substrings in Wraparound String

    Consider the string s to be the infinite wraparound string of "abcdefghijklmnopqrstuvwxyz" ...

  7. RabbitMQ~消息的产生和管理(15672)

    上一讲说了rabbitmq在windows环境的部署,而今天主要说一下消息在产生后,如何去查看消息,事实上,rabbitmq为我们提供了功能强大的管理插件,我们只要开启这个插件即可,它也是一个网站,端 ...

  8. HTTP协议头部字段释义

    1. Accept:告诉WEB服务器自己接受什么介质类型,*/* 表示任何类型,type/* 表示该类型下的所有子类型,type/sub-type. 2. Accept-Charset: 浏览器申明自 ...

  9. Intellij IDEA14配置

    一.下载 官网下载地址:http://www.jetbrains.com/idea/ 目前最新的版本是15,发现15注册比较麻烦,好像需要只能通过联网激活.而网上14的离线注册码一大堆,就下载了14, ...

  10. firefox 附加组件栏安装

    firefox 在升级到 30的版本后,发现附加组件栏不兼容了. 搜索组件,add-on bar 会得到一个 new add-on bar的组件,安装完后发现上面不显示ip, 后来才发现,应该安装Th ...