9.3.4 BeaufitulSoup4
BeautifulSoup 是一个非常优秀的Python扩展库,可以用来从HTML或XML文件中提取我们感兴趣的数据,并且允许指定使用不同的解析器。
使用 pip install BeaufifulSoup4 直接进行模块的安装。安装之后应使用 from bs4 import BeautifulSoup 导入并使用。
下面简单演示下BeautifulSoup4的功能,更加详细完整的学习资料请参考 https://www.crummy.com/software/BeautifulSoup/bs4/doc/。
>>> from bs4 import BeautifulSoup
>>>
>>> #自动添加和补全标签
>>> BeautifulSoup('hello world','lxml')
<html><body><p>hello world</p></body></html>
>>>
>>> #自定义一个html文档内容
>>> html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p> <p class="story">...</p>
"""
>>>
>>> #解析这段html文档内容,以优雅的方式展示出来
>>> soup = BeautifulSoup(html_doc,'html.parser')
>>> print(soup.prettify())
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
Elsie
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
;
and they lived at the bottom of a well.
</p>
<p class="story">
...
</p>
</body>
</html>
>>>
>>> #访问特定标签
>>> soup.title
<title>The Dormouse's story</title>
>>>
>>> #标签名字
>>> soup.title.name
'title'
>>>
>>> #标签文本
>>> soup.title.text
"The Dormouse's story"
>>>
>>> #title标签的上一级标签
>>> soup.title.parent
<head><title>The Dormouse's story</title></head>
>>>
>>> soup.head
<head><title>The Dormouse's story</title></head>
>>>
>>> soup.b
<b>The Dormouse's story</b>
>>>
>>> soup.b.name
'b'
>>> soup.b.text
"The Dormouse's story"
>>>
>>> #把整个BeautifulSoup对象看作标签对象
>>> soup.name
'[document]'
>>>
>>> soup.body
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
>>>
>>> soup.p
<p class="title"><b>The Dormouse's story</b></p>
>>>
>>> #标签属性
>>> soup.p['class']
['title']
>>>
>>> soup.p.get('class') #也可以这样查看标签属性
['title']
>>>
>>> soup.p.text
"The Dormouse's story"
>>>
>>> soup.p.contents
[<b>The Dormouse's story</b>]
>>>
>>> soup.a
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
>>>
>>> #查看a标签所有属性
>>> soup.a.attrs
{'class': ['sister'], 'id': 'link1', 'href': 'http://example.com/elsie'}
>>>
>>> #查找所有a标签
>>> soup.find_all('a')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
>>>
>>> #同时查找<a>和<b>标签
>>> soup.find_all(['a','b'])
[<b>The Dormouse's story</b>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
>>>
>>> import re
>>> #查找href包含特定关键字的标签
>>> soup.find_all(href=re.compile("elsie"))
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
>>>
>>> soup.find(id='link3')
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
>>>
>>> soup.find_all('a',id='link3')
[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
>>>
>>> for link in soup.find_all('a'):
print(link.text,':',link.get('href')) Elsie : http://example.com/elsie
Lacie : http://example.com/lacie
Tillie : http://example.com/tillie
>>>
>>> print(soup.get_text()) #返回所有文本 The Dormouse's story The Dormouse's story
Once upon a time there were three little sisters;and their names were
Elsie,
Lacieand
Tillie;
and they lived at the bottom of a well.
... >>>
>>> #修改标签属性
>>> soup.a['id']='test_link1'
>>> soup.a
<a class="sister" href="http://example.com/elsie" id="test_link1">Elsie</a>
>>>
>>> #修改标签文本
>>> soup.a.string.replace_with('test_Elsie')
'Elsie'
>>>
>>> soup.a.string
'test_Elsie'
>>>
>>> print(soup.prettify())
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="test_link1">
test_Elsie
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
;
and they lived at the bottom of a well.
</p>
<p class="story">
...
</p>
</body>
</html>
>>>
>>>
>>> #遍历子标签
>>> for child in soup.body.children:
print(child) <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="test_link1">test_Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p> <p class="story">...</p> >>>
9.3.4 BeaufitulSoup4的更多相关文章
随机推荐
- 频繁模式挖掘 Apriori算法 FP-tree
啤酒 尿布 组合营销 X=>Y,其中x属于项集I,Y属于项集I,且X.Y的交集等于空集. 2类算法 Apriori算法 不断地构造候选集.筛选候选集来挖掘出频繁项集,需要多次扫描原始数据.磁盘I ...
- bzoj2179
fft裸题 我还没有背下来fft #include<bits/stdc++.h> #define pi acos(-1) using namespace std; ; int n, m, ...
- JAVA基础(多线程Thread和Runnable的使用区别(转载)
转自:http://jinguo.iteye.com/blog/286772 Runnable是Thread的接口,在大多数情况下“推荐用接口的方式”生成线程,因为接口可以实现多继承,况且Runnab ...
- 常见的Java Script内存泄露原因及解决方案
前言 内存泄漏指由于疏忽或错误造成程序未能释放已经不再使用的内存.内存泄漏并非指内存在物理上的消失,而是应用程序分配某段内存后,由于设计错误,导致在释放该段内存之前就失去了对该段内存的控制,从而造成了 ...
- mahjong
题目描述 “为什么, 你们的力量在哪里得到如此地......”“我们比 1 分钟前的我们还要进步, 虽然很微小, 但每转一圈就会前进一寸.这就是钻头啊!”“那才是通向毁灭的道路.为什么就没有意识到螺旋 ...
- jmeter中对于各类时间格式的设置
最普通的设置为使用 函数助手中的__time, 设置好需要使用的类型,并设置接收参数即可 YMD = yyyyMMdd HMS = HHmmss YMDHMS = yyyyMMdd-HHmmss 第二 ...
- 51nod 1577 线性基
思路: http://blog.csdn.net/yxuanwkeith/article/details/53524757 //By SiriusRen #include <bits/stdc+ ...
- Codeforces 763A
乍看之下感觉有点无从下手,,其实是个很简单的水题,陷入僵局 题目大意:给一棵树,树上每个节点都染色,问能否取下一个节点,使得剩余所有子树上的点的颜色都相同.能输出YES和取下的节点编号,否则输出NO. ...
- 使用A*寻路小记
前几天做另一个DEMO 要用实现自动寻路功能,看到普遍都是A* 学习了下 我的主循环代码: isFindEndPoint = false; //主循环 do { CreateOutSkirtsNode ...
- Unity学习-地形的设置(五)
添加地形游戏对象 [Hierarchy-Create-Terrain] 为了看的看清楚,在添加一个平行光 [Hierarchy-Create-Direction light] 导入地形包 [Asset ...