9.3.4 BeaufitulSoup4
BeautifulSoup 是一个非常优秀的Python扩展库,可以用来从HTML或XML文件中提取我们感兴趣的数据,并且允许指定使用不同的解析器。
使用 pip install BeaufifulSoup4 直接进行模块的安装。安装之后应使用 from bs4 import BeautifulSoup 导入并使用。
下面简单演示下BeautifulSoup4的功能,更加详细完整的学习资料请参考 https://www.crummy.com/software/BeautifulSoup/bs4/doc/。
>>> from bs4 import BeautifulSoup
>>>
>>> #自动添加和补全标签
>>> BeautifulSoup('hello world','lxml')
<html><body><p>hello world</p></body></html>
>>>
>>> #自定义一个html文档内容
>>> html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p> <p class="story">...</p>
"""
>>>
>>> #解析这段html文档内容,以优雅的方式展示出来
>>> soup = BeautifulSoup(html_doc,'html.parser')
>>> print(soup.prettify())
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
Elsie
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
;
and they lived at the bottom of a well.
</p>
<p class="story">
...
</p>
</body>
</html>
>>>
>>> #访问特定标签
>>> soup.title
<title>The Dormouse's story</title>
>>>
>>> #标签名字
>>> soup.title.name
'title'
>>>
>>> #标签文本
>>> soup.title.text
"The Dormouse's story"
>>>
>>> #title标签的上一级标签
>>> soup.title.parent
<head><title>The Dormouse's story</title></head>
>>>
>>> soup.head
<head><title>The Dormouse's story</title></head>
>>>
>>> soup.b
<b>The Dormouse's story</b>
>>>
>>> soup.b.name
'b'
>>> soup.b.text
"The Dormouse's story"
>>>
>>> #把整个BeautifulSoup对象看作标签对象
>>> soup.name
'[document]'
>>>
>>> soup.body
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
>>>
>>> soup.p
<p class="title"><b>The Dormouse's story</b></p>
>>>
>>> #标签属性
>>> soup.p['class']
['title']
>>>
>>> soup.p.get('class') #也可以这样查看标签属性
['title']
>>>
>>> soup.p.text
"The Dormouse's story"
>>>
>>> soup.p.contents
[<b>The Dormouse's story</b>]
>>>
>>> soup.a
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
>>>
>>> #查看a标签所有属性
>>> soup.a.attrs
{'class': ['sister'], 'id': 'link1', 'href': 'http://example.com/elsie'}
>>>
>>> #查找所有a标签
>>> soup.find_all('a')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
>>>
>>> #同时查找<a>和<b>标签
>>> soup.find_all(['a','b'])
[<b>The Dormouse's story</b>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
>>>
>>> import re
>>> #查找href包含特定关键字的标签
>>> soup.find_all(href=re.compile("elsie"))
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
>>>
>>> soup.find(id='link3')
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
>>>
>>> soup.find_all('a',id='link3')
[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
>>>
>>> for link in soup.find_all('a'):
print(link.text,':',link.get('href')) Elsie : http://example.com/elsie
Lacie : http://example.com/lacie
Tillie : http://example.com/tillie
>>>
>>> print(soup.get_text()) #返回所有文本 The Dormouse's story The Dormouse's story
Once upon a time there were three little sisters;and their names were
Elsie,
Lacieand
Tillie;
and they lived at the bottom of a well.
... >>>
>>> #修改标签属性
>>> soup.a['id']='test_link1'
>>> soup.a
<a class="sister" href="http://example.com/elsie" id="test_link1">Elsie</a>
>>>
>>> #修改标签文本
>>> soup.a.string.replace_with('test_Elsie')
'Elsie'
>>>
>>> soup.a.string
'test_Elsie'
>>>
>>> print(soup.prettify())
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="test_link1">
test_Elsie
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
;
and they lived at the bottom of a well.
</p>
<p class="story">
...
</p>
</body>
</html>
>>>
>>>
>>> #遍历子标签
>>> for child in soup.body.children:
print(child) <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="test_link1">test_Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p> <p class="story">...</p> >>>
9.3.4 BeaufitulSoup4的更多相关文章
随机推荐
- JMeter常用函数 使用图解
函数的调用都是以${__function()}这种形式开始的注意:“__”是两个英文下划线 __UUID 生成唯一字符串
- 统计ES性能的python脚本
思路:通过http请求获取es集群中某一index的索引docs数目变化来进行ES性能统计 import time from datetime import datetime import urlli ...
- SQL分离附加数据库
转自:http://www.jb51.net/article/36624.htm
- php验证手机号是否合法
用正则匹配手机号码的时候, 我们先分析一下手机号码的规律: 1. 手机号通常是11位的 2. 经常是1开头 3. 第二个数字通常是34578这几个数字, 2014.5.5日170号段的手机号开卖所以这 ...
- [Swift通天遁地]七、数据与安全-(20)快速实现MD5/Poly1305/Aes/BlowFish/Chacha/Rabbit
★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★➤微信公众号:山青咏芝(shanqingyongzhi)➤博客园地址:山青咏芝(https://www.cnblogs. ...
- Akka源码分析-Router
akak中还有一个比较重要的概念,那就是Router(路由).路由的概念,相信大家都不陌生,在akka中,它就是其他actors的一个代理,会把消息按照路由规则,分发给指定的actor.我一般喜欢把R ...
- 【转】 [MySQL 查询语句]——分组查询group by
group by (1) group by的含义:将查询结果按照1个或多个字段进行分组,字段值相同的为一组(2) group by可用于单个字段分组,也可用于多个字段分组 select * from ...
- Python 如何在csv中定位非数字和字母的符号
在数据清洗过程中,有时不仅希望去掉脏数据,更希望定位脏数据的位置,例如从csv里面定位非数字和字母单元格的位置,在使用isdigit().isalpha().isalnum()时无法判断浮点数,会将浮 ...
- JVM 垃圾回收器详解
小结: 新生代 串行Serial 并行 Parallel(关注吞吐量) 并行ParNew 老年代 串行 Serial Old 并行Para ...
- 【转】utf-8的中文是一个汉字占三个字节长度
因为看到百度里面这个人回答比较生动,印象比较深刻,所以转过来做个笔记 原文链接 https://zhidao.baidu.com/question/1047887004693001899.html 知 ...