BeautifulSoup(bs4)

BeautifulSoup是python的一个库,最主要的功能是从网页爬取数据,官方是这样解释的:BeautifulSoup提供一些简单,python式函数来处理导航,搜索,修改分析树等功能,其是一个工具库,通过解析文档为用户提供需要抓取的数据,因为简单,所有不需要多少代码就可以写出一个完整的程序

bs4安装

直接使用pip install命令安装

pip install beautifulsoup4

lxml解析器

lxml是一个高性能的Python库,用于处理XML与HTML文档,与bs4相比之下lxml具有更强大的功能与更高的性能,特别是处理大型文档时尤为明显.lxml可以与bs4结合使用,也可以单独使用

lxml安装

同样使用pip install 安装

pip install lxml

其用于在接下来会结合bs4进行讲解

BeautifulSoup浏览浏览器结构化方法

.title:获取title标签

html_doc="""....

""""

# 创建beautifulsoup对象 解析器为lxml

soup = BeautifulSoup(html_doc, 'lxml')

print(soup.title)

#output-><title>The Dormouse's story</title>

.name获取文件或标签类型名称

soup = BeautifulSoup(html_doc, 'lxml')

print(soup.title.name)

print(soup.name)

#output->title

#[document]

.string/.text:获取标签中的文字内容

soup = BeautifulSoup(html_doc, 'lxml')

print(soup.title.string)

print(soup.title.text)

#output->The Dormouse's story

#The Dormouse's story

.p:获取标签

soup = BeautifulSoup(html_doc, 'lxml')

print(soup.p)

#output-><p class="title"><b>The Dormouse's story</b></p>

.find_all(name,attrs={}):获取所有标签,参数:标签名,如’a’a标签,’p’p标签等等,attrs={}:属性值筛选器字典如attrs={'class': 'story'}

# 创建beautifulsoup对象 解析器为lxml

soup = BeautifulSoup(html_doc, 'lxml')

print(soup.find_all('p'))

print(soup.find_all('p', attrs={'class': 'title'}))

.find(name,attrs={}):获取第一次匹配条件的元素

soup = BeautifulSoup(html_doc, 'lxml')

print(soup.find(id="link1"))

#output-><a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>

.parent:获取父级标签

soup = BeautifulSoup(html_doc, 'lxml')

print(soup.title.parent)

#output-><head><title>The Dormouse's story</title></head>

.p['class'] :获取class的值

soup = BeautifulSoup(html_doc, 'lxml')

print(soup.p["class"])

#output->['title']

.get_text():获取文档中所有文字内容

soup = BeautifulSoup(html_doc, 'lxml')

print(soup.get_text())

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie,

Lacie and

Tillie;

and they lived at the bottom of a well.

...

从文档中找到所有<a>标签的链接

a_tags = soup.find_all('a')

for a_tag in a_tags:

    print(a_tag.get("href"))

#output->https://example.com/elsie

#https://example.com/lacie

#https://example.com/tillie

BeautifulSoup的对象种类

当你使用BeautifulSoup 解析一个HTML或XML文档时,BeautifulSoup会整个文档转换为一个树形结构,其中每个结点(标签,文本,注释)都被表示为一个python对象

BeautifulSoup的树形结构

在HTML文档中,根结点通常是<html>标签,其余的标签和文本内容则是其子结点

若有以下一个HTML文档:

<html>

    <head>

        <title>The Dormouse's story</title>

    </head>

    <body>

        <h1>The Dormouse's story</h1>

        <p>Once upon a time...</p>

    </body>

</html>

经过BeautifulSoup的解析后,<html>是根结点,与<html>相邻的<head>与<body>是其子结点,同理可得<title>是<head>子结点,<h1>与是<body>子结点

对象类型

BeautifulSoup有四种主要类型,Tag,NavigableString,BeautifulSoup,Comment

Tag

Tag对象与HTML或XML原生文档中的标签相同,每个Tag对象都可以包含其他标签,文本内容和属性

soup = BeautifulSoup(html_doc, 'lxml')

tag = soup.title

print(type(tag))

#output-><class 'bs4.element.Tag'>

NavigableString

NavigableString对象表示标签内的文本内容,是一个不可变字符串,可以提供Tag对象的.string获取

soup = BeautifulSoup(html_doc, 'lxml')

tag = soup.title

print(type(tag.string))

#output-> <class 'bs4.element.NavigableString'>

BeautifulSoup

BeautifulSoup对象表示整个文档的内容.其可以被视为一个特殊的Tag对象,但没有名称与属性.其提供了对整个文档的遍历,搜索和修改的功能

soup = BeautifulSoup(html_doc, 'lxml')

print(type(soup))

#output-> <class 'bs4.BeautifulSoup'>

Comment

Comment对象是一个特殊类型的NavigableString对象,表示HTML和XML中的注释部分

# <b><!--This is a comment--></b>

soup = BeautifulSoup(html_doc, 'lxml')

print(type(soup.b.string))

#output-> <class 'bs4.element.NavigableString'>

BeautifulSoup遍历文档树

BeautifulSoup提供了许多方法来遍历解析后的文档树

导航父节点

.parent与.parents:.parent可以获取当前节点的上一级父节点,.parents可以遍历获取当前节点的所有父辈节点

soup = BeautifulSoup(html_doc, 'lxml')

title_tag = soup.title

print(title_tag.parent)

#<head><title>The Dormouse's story</title></head>

soup = BeautifulSoup(html_doc, 'lxml')

body_tag = soup.body

for parent in body_tag.parents:

    print(parent)

#<html><head><title>The Dormouse's story</title></head>

#<body>

#<p class="title"><b>The Dormouse's story</b></p>

#<p class="story">Once upon a time there were three little sisters; and their names were

#<a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>,

#....

导航子结点

.contents:可以获取当前结点的所有子结点

soup = BeautifulSoup(html_doc, 'lxml')

head_contents = soup.head.contents

print(head_contents)

#output-> [<title>The Dormouse's story</title>]

.children:可以遍历当前结点的所有子结点,返回一个list

soup = BeautifulSoup(html_doc, 'lxml')

body_children = soup.body.children

for child in body_children:

    print(child)

#output-><p class="title"><b>The Dormouse's story</b></p>

#<a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>,

#<a class="sister" href="https://example.com/tillie" id="link3">Tillie</a>;

#and they lived at the bottom of a well.</p>

#.....

字符串没有.children与.contents属性

导航所有后代节点

.contents与.children属性仅包含tag直接子结点,例如标签只有一个直接子结点<title>

#[<title>The Dormouse's story</title>]

但<title>标签也包含一个子结点:字符串”The Dormouse's story”,字符串”The Dormouse's story”是<head>标签的子孙结点

.descendants属性可以遍历当前结点的所有后代结点(层遍历)

soup = BeautifulSoup(html_doc, 'lxml')

for descendant in soup.descendants:

    print(descendant)

节点内容

.string
- 如果tag只有一个NavigableString类型子节点,那么这个tag可以使用.string得到其子节点.
```
soup = BeautifulSoup(html_doc, 'lxml')

print(soup.head.string)

#The Dormouse's story

print(soup.title.string)

#The Dormouse's story
```
- 但若tag中包含了多个子节点,tag就无法确定string方法应该调用哪一个字节的内容,则会输出None
```
soup = BeautifulSoup(html_doc, 'lxml')

print(soup.body.string)

#None
```

.strings和.stripped_strings

.strings可以遍历获取标签中的所有文本内容,.stripped_strings可以除去多余的空白字符

soup = BeautifulSoup(html_doc, 'lxml')

for string in soup.strings:

    print(string)

#The Dormouse's story

......

#The Dormouse's story

soup = BeautifulSoup(html_doc, 'lxml')

for string in soup.stripped_strings:

    print(string)

#The Dormouse's story

#The Dormouse's story

#Once upon a time there were three little sisters; and their names were

#Elsie

#,

...

BeautifulSoup搜索文档树

BeautifulSoup提供了多种方法来搜索解析后的文档树

find_all(name , attrs , recursive , string , **kwargs)

find_all()方法搜索当前tag的所有tag子节点

soup = BeautifulSoup(html_doc, 'lxml')

print(soup.find_all("title"))  # 查找所有的title标签

print(soup.find_all("p", "title"))  # 查找p标签中class为title的标签

print(soup.find_all("a"))  # 查找所有的a标签

print(soup.find_all(id="link2"))  # 查找id为link2的标签

#[<title>The Dormouse's story</title>]

#[<p class="title"><b>The Dormouse's story</b></p>]

#[<a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="https://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="https://example.com/tillie" id="link3">Tillie</a>]

#[<a class="sister" href="https://example.com/lacie" id="link2">Lacie</a>]

接下来我们来详细解析一下每个参数的含义

name参数

name参数可以查找所有名字为name的tag,字符串对象名字会自动被忽略

soup.find_all("title")

# [<title>The Dormouse's story</title>]

name参数可以为任意类型的过滤器,如字符串,正则表达式,列表,方法等等
传字符串

传入字符串是最简单的过滤器,在搜索方法中传入一个字符串参数,BeautifulSoup会查找与字符串匹配的内容
- 下面的例子用于查找文档中所有的标签
```
soup.find_all('b')

# [The Dormouse's story]
```
传入正则表达式

若传入正则表达式作为参数,BeautifulSoup会通过正则表达式match()来匹配内容
- 查找b开头的标签,这表示<body>和标签都应该被找到
```
soup = BeautifulSoup(html_doc, 'lxml')

for tag in soup.find_all(re.compile("^b")):

 print(tag.name)

# body

# b
```
传入列表

如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回
- 找到文档中所有<a>标签和标签
```
soup = BeautifulSoup(html_doc, 'lxml')

for tag in soup.find_all(['a', 'b']):

 print(tag.name)

#b

#a

#a

#a

#b
```

**kwargs参数

在BeautifulSoup中,**kwargs(即关键字参数)可用于通过标签的属性来查找特定的标签.这些关键字参数可以直接传递给find,find_all方法,使得搜索更加强大.标签的属性名作为关键字参数，值可以是字符串、正则表达式或列表

使用字典

可以使用key=’word’传入参数

soup = BeautifulSoup(html_doc, 'lxml')

print(soup.find_all(id='link1'))

#[<a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>]

使用正则表达式

使用Python的re模块中的正则表达式来匹配属性值,使搜索更灵活

soup = BeautifulSoup(html_doc, 'lxml')

print(soup.find_all('a', href=re.compile("elsie")))  # 查找href属性中包含elsie的a标签

print(soup.find_all(string=re.compile("^The")))  # 查找文本中The开头的标签

#[<a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>]

#["The Dormouse's story", "The Dormouse's story"]

使用列表

可以传递一个列表作为关键字参数的值.BeautifulSoup会匹配列表中的任意一个值

soup = BeautifulSoup(html_doc, 'lxml')

print(soup.find('a', id=['link1', 'link2']))  # 查找id为link1或者link2的a标签

print(soup.find_all(class_=['sister', 'story']))  # 查找class为sister或者story的标签

#<a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>

#[<p class="story">Once upon a time there were three little sisters; and their names were

#...

特殊属性名称

HTML的属性名称与Python的保留字冲突,为了防止冲突,BeautifulSoup提供了一些特殊的替代名称

class_:用于匹配class属性
data-*:用于匹配自定义的data-*属性

soup = BeautifulSoup(html_doc, 'lxml')

print(soup.find_all('p', class_="title"))  # 查找所有class为title的p标签

print(soup.find_all('p', attrs={'data-p', 'story'}))  # 查找所有class为story的p标签

#[<p class="title"><b>The Dormouse's story</b></p>]

#[<p class="story">Once upon a time there were three little sisters; and their names were

#<a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>,

#<a class="sister" href="https://example.com/lacie" id="link2">Lacie</a> and

text/string参数

text/string参数允许操作者根据标签的文本内容进行搜索,与name参数类似,text参数也支持多种类型的值,包括正则表达式,字符串列表和True,早期bs4支持text,近期bs4将text都改为string

使用字符串匹配

你可以直接传递一个字符串作为 string参数的值，BeautifulSoup 会查找所有包含该字符串的标签

soup = BeautifulSoup(html_doc, 'lxml')

print(soup.find_all(string='Elsie'))

#['Elsie']

使用正则表达式匹配

soup = BeautifulSoup(html_doc, 'lxml')

print(soup.find_all(string=re.compile('sister'), limit=2))  # 查找前两个包含sister的字符串

print(soup.find_all(string=re.compile('Dormouse')))  # 查找包含Dormouse的字符串

#['Once upon a time there were three little sisters; and their names were\n']

#["The Dormouse's story", "The Dormouse's story"]

使用列表匹配

soup = BeautifulSoup(html_doc, 'lxml')

print(soup.find_all(string=['Elsie', 'Lacie', 'Tillie']))

#['Elsie', 'Lacie', 'Tillie']

limit参数

BeautifulSoup中的limit参数用于限制find_all方法结果的返回数量,当只需要查询前几个标签时,使用limit参数可以提高搜索搜索效率,效果与SQL中的limit关键字类似,当搜索到的结果数量达到 limit 的限制时,就停止搜索返回结果

soup = BeautifulSoup(html_doc, 'lxml')

print(soup.find_all('a', limit=2))  # 查找所有a标签，限制输出2个

#[<a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="https://example.com/lacie" id="link2">Lacie</a>]

find_parents() 和 find_parent()

BeautifulSoup 提供了 find_parents() 和 find_parent() 方法,用于在解析后的文档树中向上查找父标签.两个方法的主要区别在于返回的结果数量

find_parent(name=None, attrs={}, **kwargs):只返回最接近的父标签(即第一个匹配的父标签)
find_parents(name=None, attrs={}, limit=None, **kwargs):返回所有符合条件的祖先标签,按从近到远的顺序排列

soup = BeautifulSoup(html_doc, 'lxml')

a_string = soup.find(string='Lacie')

print(a_string.find_parent())  # 查找父节点

print('-----------------')

print(a_string.find_parents())  # 查找所有父节点

#<a class="sister" href="https://example.com/lacie" id="link2">Lacie</a>

#-----------------

#[<a class="sister" href="https://example.com/lacie" id="link2">Lacie</a>, <p class="story">Once upon a time there were three little sisters; and their names were

#<a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>,

#<a class="sister" href="https://example.com/lacie" id="link2">Lacie</a> and

#and they lived at the bottom of a well.</p>, <body>....]

BeautifulSoup的CSS选择器

我们在写CSS时,标签名不加任何修饰,类名前加点,id名前加#,BeautifulSoup中也可以使用类似的方法来筛选元素,

select(selector, namespaces=None, limit=None, **kwargs)

BeautifulSoup中的select()方法允许使用CSS选择器来查找HTML文档元素,其返回一个包含所有匹配元素的列表类似与find_all()方法

selector:一个字符串,表示将要选择的CSS选择器,可以是简单标签选择器,类选择器,id选择器

通过标签名查找

soup = BeautifulSoup(html_doc, 'lxml')

print(soup.select('b'))

#[<b>The Dormouse's story</b>, <b><!--This is a comment--></b>]

通过类名查找

soup = BeautifulSoup(html_doc, 'lxml')

print(soup.select('.title'))

#[<p class="title"><b>The Dormouse's story</b></p>]

id名查找

soup = BeautifulSoup(html_doc, 'lxml')

print(soup.select('#link1'))

#[<a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>]

组合查找

组合查找即与写class时一致,标签名与类名id名进行组合的原理一样

eg:查找p标签中id为link1的内容

soup = BeautifulSoup(html_doc, 'lxml')

print(soup.select('p #link1'))

#[<a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>]

查找类选择器时也可以使用id选择器的标签

soup = BeautifulSoup(html_doc, 'lxml')

print(soup.select('.story#text'))

查找有多个class选择器和一个id选择器的标签

soup = BeautifulSoup(html_doc, 'lxml')

print(soup.select(".story .sister#link1"))

#[<a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>]

属性查找

选择具有特定属性或属性值的标签

简单属性选择器

选择具有特定属性的标签

soup = BeautifulSoup(html_doc, 'lxml')

print(soup.select("a[href='https://example.com/elsie']"))  # 选择a标签中href属性为https://example.com/elsie的标签

#[<a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>]

属性值选择器

选择具有特定属性值的标签
- 精确匹配:[attribute="value"]
- 部分匹配
 - 包含特定值:[attribute~="value"] 选择属性值包含特定单词的标签。
 - 以特定值开头:[attribute^="value"] 选择属性值以特定字符串开头的标签
 - 以特定值结尾:[attribute$="value"] 选择属性值以特定字符串结尾的标签。
 - 包含特定子字符串:[attribute*="value"] 选择属性值包含特定子字符串的标签
```
soup = BeautifulSoup(html_doc, 'lxml')

print(soup.select('a[href^="https://example.com"]')) # 选择href以https://example.com开头的a标签

#[<a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="https://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="https://example.com/tillie" id="link3">Tillie</a>]
```

BeautifulSoup(bs4)细致讲解的更多相关文章

微信小程序入门与实战从0到1进行细致讲解涵盖小程序开发核心技能下载
第1章什么是微信小程序? 第2章小程序环境搭建与开发工具介绍第3章从一个简单的“欢迎“页面开始小程序之旅第4章第二个页面:新闻阅读列表第5章小程序的模板化与模块化第6章构建新闻详情 ...
from bs4 import BeautifulSoup 报错
一: BeautifulSoup的安装: 下载地址:https://www.crummy.com/software/BeautifulSoup/bs4/download/4.6/ 下载后,解压缩,然后 ...
python库：bs4，BeautifulSoup库、Requests库
Beautiful Soup https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/ Beautiful Soup 4.2.0 文档 htt ...
python bs4 BeautifulSoup
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.bs4 模块的 BeautifulSoup 配合requests库可以写简单的爬虫. 安装命令:pip in ...
from bs4 import BeautifulSoup 引入需要安装的文件和步骤
调用beautifulsoup库时,运行后提示错误: ImportError: No module named bs4 , 意思就是没有找到bs4模块,所以解决方法就是将bs4安装上,具体步骤如下: ...
python爬虫知识点总结（六）BeautifulSoup库详解
官方学习文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/ 一.什么时BeautifulSoup? 答:灵活又方便的网页解析库,处 ...
Python爬虫小白入门（三）BeautifulSoup库
# 一.前言 *** 上一篇演示了如何使用requests模块向网站发送http请求,获取到网页的HTML数据.这篇来演示如何使用BeautifulSoup模块来从HTML文本中提取我们想要的数据. ...
BeautifulSoup研究一
BeautifulSoup的文档见 https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/ 其中.contents 会将换行也记录为一个子节 ...
bs4 python解析html
使用文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/ python的编码问题比较恶心. decode解码encode编码在文件 ...
urllib+BeautifulSoup无登录模式爬取豆瓣电影Top250
对于简单的爬虫任务,尤其对于初学者,urllib+BeautifulSoup足以满足大部分的任务. 1.urllib是Python3自带的库,不需要安装,但是BeautifulSoup却是需要安装的. ...

随机推荐

TS中简单实现一下依赖注入
依赖注入(Dependency Injection,DI)是一种设计模式,主要用于实现控制反转(Inversion of Control,IoC).它通过将对象的依赖关系从内部管理转移到外部容器来解耦 ...
ASP.NET Core – Minimal API
介绍 Minimal API 是 .NET 6 才开始有的功能. 它是一个简化版本的 Web API. 我还没有认真的去学习它, 感觉它走的是 Node.js Express 的路线. 目前用它来写小 ...
Servlet——Request对象-请求数据&请求参数
Request 继承体系 1.Tomcat需要解析请求数据,封装为request对象,并且创建request对象传递到service方法中 2.使用request对象,查阅javaEE ...
[namespace hdk] 64位 bitset
功能已重载运算符 [](int) (右值,修改请使用 set() 方法) ~() +(bitset) +(unsigned long long) +=(bitset) +=(unsigned lon ...
【赵渝强老师】Kubernetes的探针
Kubernetes提供了探针(Probe)对容器的健康性进行检测.实际上我们不仅仅要对容器进行健康检测,还要对容器内布置的应用进行健康性检测. Probe有以下两种类型: livenessProbe ...
Spring事务的1道面试题
每次聊起Spring事务,好像很熟悉,又好像很陌生.本篇通过一道面试题和一些实践,来拆解几个Spring事务的常见坑点. 原理 Spring事务的原理是:通过AOP切面的方式实现的,也就是通过代理模式 ...
KubeSphere 对 Apache Log4j 2 远程代码执行最新漏洞的修复方案
Apache Log4j 2 是一款开源的日志记录工具,被广泛应用于各类框架中.近期,Apache Log4j 2 被爆出存在漏洞,漏洞现已公开,本文为 KubeSphere 用户提供建议的修复方案. ...
Iterator和Iterable
Java遍历List有三种方式 public static void main(String[] args) { List<String> list = new ArrayList< ...
F450 APM2.8 自组无人机手记
由于是初次接触无人机,外加自组需要焊接,做了一些前期的心理建设.但是过程还是异常艰难.(不过,实际操作也就焊20个焊点左右,基本就组装起来了,操作并不复杂) 自组APM无人机是想学习Ardupilot ...
去哪儿旅行携手 HarmonyOS SDK | 告别繁琐，常用信息秒级填充
背景去哪儿旅行作为行业内领先的一站式在线旅游平台,多年来在日益加剧的市场竞争中积极寻求创新,凭借其优质的服务深受消费者青睐.2024年,去哪儿旅行适配HarmonyOS NEXT版本, 升级用户服务 ...

BeautifulSoup(bs4)细致讲解