BeautifulSoup 是一个非常优秀的Python扩展库,可以用来从HTML或XML文件中提取我们感兴趣的数据,并且允许指定使用不同的解析器。

  使用 pip install BeaufifulSoup4 直接进行模块的安装。安装之后应使用 from bs4 import BeautifulSoup 导入并使用。

  下面简单演示下BeautifulSoup4的功能,更加详细完整的学习资料请参考 https://www.crummy.com/software/BeautifulSoup/bs4/doc/。

 >>> from bs4 import BeautifulSoup
>>>
>>> #自动添加和补全标签
>>> BeautifulSoup('hello world','lxml')
<html><body><p>hello world</p></body></html>
>>>
>>> #自定义一个html文档内容
>>> html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p> <p class="story">...</p>
"""
>>>
>>> #解析这段html文档内容,以优雅的方式展示出来
>>> soup = BeautifulSoup(html_doc,'html.parser')
>>> print(soup.prettify())
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
Elsie
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
;
and they lived at the bottom of a well.
</p>
<p class="story">
...
</p>
</body>
</html>
>>>
>>> #访问特定标签
>>> soup.title
<title>The Dormouse's story</title>
>>>
>>> #标签名字
>>> soup.title.name
'title'
>>>
>>> #标签文本
>>> soup.title.text
"The Dormouse's story"
>>>
>>> #title标签的上一级标签
>>> soup.title.parent
<head><title>The Dormouse's story</title></head>
>>>
>>> soup.head
<head><title>The Dormouse's story</title></head>
>>>
>>> soup.b
<b>The Dormouse's story</b>
>>>
>>> soup.b.name
'b'
>>> soup.b.text
"The Dormouse's story"
>>>
>>> #把整个BeautifulSoup对象看作标签对象
>>> soup.name
'[document]'
>>>
>>> soup.body
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
>>>
>>> soup.p
<p class="title"><b>The Dormouse's story</b></p>
>>>
>>> #标签属性
>>> soup.p['class']
['title']
>>>
>>> soup.p.get('class') #也可以这样查看标签属性
['title']
>>>
>>> soup.p.text
"The Dormouse's story"
>>>
>>> soup.p.contents
[<b>The Dormouse's story</b>]
>>>
>>> soup.a
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
>>>
>>> #查看a标签所有属性
>>> soup.a.attrs
{'class': ['sister'], 'id': 'link1', 'href': 'http://example.com/elsie'}
>>>
>>> #查找所有a标签
>>> soup.find_all('a')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
>>>
>>> #同时查找<a>和<b>标签
>>> soup.find_all(['a','b'])
[<b>The Dormouse's story</b>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
>>>
>>> import re
>>> #查找href包含特定关键字的标签
>>> soup.find_all(href=re.compile("elsie"))
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
>>>
>>> soup.find(id='link3')
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
>>>
>>> soup.find_all('a',id='link3')
[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
>>>
>>> for link in soup.find_all('a'):
print(link.text,':',link.get('href')) Elsie : http://example.com/elsie
Lacie : http://example.com/lacie
Tillie : http://example.com/tillie
>>>
>>> print(soup.get_text()) #返回所有文本 The Dormouse's story The Dormouse's story
Once upon a time there were three little sisters;and their names were
Elsie,
Lacieand
Tillie;
and they lived at the bottom of a well.
... >>>
>>> #修改标签属性
>>> soup.a['id']='test_link1'
>>> soup.a
<a class="sister" href="http://example.com/elsie" id="test_link1">Elsie</a>
>>>
>>> #修改标签文本
>>> soup.a.string.replace_with('test_Elsie')
'Elsie'
>>>
>>> soup.a.string
'test_Elsie'
>>>
>>> print(soup.prettify())
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="test_link1">
test_Elsie
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
;
and they lived at the bottom of a well.
</p>
<p class="story">
...
</p>
</body>
</html>
>>>
>>>
>>> #遍历子标签
>>> for child in soup.body.children:
print(child) <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="test_link1">test_Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p> <p class="story">...</p> >>>

9.3.4 BeaufitulSoup4的更多相关文章

随机推荐

  1. 通过adb push 从电脑里复制文件到手机里

    在开发中.我们 常常 须要 从 电脑 拷贝 一些 文件 到  自己的 app 目录里. 我之前的 方式 是 将 手机 链接 电脑,然后 从 电脑里 找到 app 文件夹,然后 进行 拷贝. 可是 这种 ...

  2. go语言笔记——包的概念本质上和java是一样的,通过大小写来区分private,fmt的Printf不就是嘛!

    示例 4.1 hello_world.go package main import "fmt" func main() { fmt.Println("hello, wor ...

  3. .NET 的WebSocket开发包详细比较(2)

    AlchemyWebSocket http://alchemywebsockets.net/ 当我想到websocket库时,这个让人不可思议.没错这是真的.它可以排在Fleck后面,它非常容易使用, ...

  4. 【POJ 2054】 Color a Tree

    [题目链接] http://poj.org/problem?id=2054 [算法] 贪心 [代码] #include <algorithm> #include <bitset> ...

  5. LuoguP3261 [JLOI2015]城池攻占

    题目描述 小铭铭最近获得了一副新的桌游,游戏中需要用 m 个骑士攻占 n 个城池.这 n 个城池用 1 到 n 的整数表示.除 1 号城池外,城池 i 会受到另一座城池 fi 的管辖,其中 fi &l ...

  6. http协议的MP4文件播放问题的分析

    现在手上有两个链接 (1) http://202.108.16.173/cctv/video/8C/35/EB/E8/8C35EBE84E7B483C8741CF9A60154993/gphone/4 ...

  7. windows 多mysql 实例

  8. Graphics.DrawMeshInstanced

    Draw the same mesh multiple times using GPU instancing. 可以免去创建和管理gameObj的开销 并不是立即绘制,如需:Graphics.Draw ...

  9. P4046 [JSOI2010]快递服务

    传送门 很容易想出\(O(n^3m)\)的方程,三维分别表示某个快递员现在在哪里,然后直接递推即可 然而这样会T,考虑怎么优化.我们发现每一天的时候都有一个快递员的位置是确定的,即在前一天要到的位置. ...

  10. Java Itext 生成PDF文件

    利用Java Itext生成PDF文件并导出,实现效果如下: PDFUtil.java package com.jeeplus.modules.order.util; import java.io.O ...