介绍及安装

Beautiful Soup 是一个HTML/XML的解析器，主要的功能也是如何解析和提取 HTML/XML 数据。

BeautifulSoup 用来解析 HTML 比较简单，API非常人性化，支持CSS选择器、Python标准库中的HTML解析器，也支持 lxml 的 XML解析器。

Beautiful Soup 3 目前已经停止开发，推荐现在的项目使用Beautiful Soup 4。使用 pip 安装即可：pip install beautifulsoup4

四大对象种类

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:

Tag
NavigableString
BeautifulSoup
Comment

1.Tag

Tag 通俗点讲就是 HTML 中的一个个标签。

print(soup.p)

# <p class="title" name="dromouse"><b>The Dormouse's story</b></p>

print(type(soup.p))

# <class 'bs4.element.Tag'>

我们可以利用 soup 加标签名轻松地获取这些标签的内容，这些对象的类型是bs4.element.Tag。

对于 Tag，它有两个重要的属性，是 name 和 attrs.

print(soup.name)

# [document] soup 对象本身比较特殊，它的 name 即为 [document]

print(soup.head.name)

# head 对于其他内部标签，输出的值便为标签本身的名称

print(soup.p.attrs)

# {'class': ['title'], 'name': 'dromouse'}

print(soup.p['class'])

# ['title']

NavigableString

NavigableString简单来讲就是一个可以遍历的字符串。

例如:

print(soup.p.string)

# The Dormouse's story

print(type(soup.p.string))

#  <class 'bs4.element.NavigableString'>

搜索文档

Beautiful Soup定义了很多搜索方法,这里着重介绍2个: find() 和 find_all() .其它方法的参数和用法类似。

html_doc = """

<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc)

使用find_all等类似的方法可以查找想要的文档内容。

在介绍find_all方法之前，先介绍一下过滤器的类型。

字符串

最简单的过滤器是字符串。在搜索方法中传入一个字符串参数,BeautifulSoup会查找与字符串完整匹配的内容。

例如:

查找所有的b标签。

soup.find_all('b')

# [<b>The Dormouse's story</b>]

正则表达式

find_all方法可以接受正则表示式作为参数，BeautifulSoup会通过match方法来匹配内容。

匹配以b开头的标签

for tag in soup.find_all(re.compile('^b')):

    print(tag.name)

#  body  b

匹配包含t的标签

for tag in soup.find_all(re.compile('t')):

    print(tag.name)

# html  title

列表

find_all方法也能接受列表参数，BeautifulSoup会将与列表中任一元素匹配的内容返回。

查找a标签和b标签

for tag in soup.find_all(['a','b']):

    print(tag.name)

# b a a a

方法

如果没有合适的过滤器,我们也可以自己定义一个方法，方法只接受一个元素参数。

匹配包含class属性，但是不包括id属性的标签。

def has_class_but_no_id(tag):

    return tag.has_attr('class') and not tag.has_attr('id')

print([tag.name for tag in soup.find_all(has_class_but_no_id)])

# ['p','p','p']

css选择器

这就是另一种与 find_all 方法有异曲同工之妙的查找方法.

写 CSS 时，标签名不加任何修饰，类名前加.，id名前加#
在这里我们也可以利用类似的方法来筛选元素，用到的方法是 soup.select()，返回类型是 list

1.通过标签名查找

print(soup.select('title'))

#[<title>The Dormouse's story</title>]

print(soup.select('a'))

#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print(soup.select('b'))

#[<b>The Dormouse's story</b>]

2.通过类名查找

print(soup.select('.sister'))

#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

3.通过 id 名查找

print(soup.select('#link1'))

#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

4.组合查找

组合查找即和写 class 文件时，标签名与类名、id名进行的组合原理是一样的，例如查找 p 标签中，id 等于 link1的内容，二者需要用空格分开

print(soup.select('p #link1'))

#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

直接子标签查找，则使用 > 分隔

print(soup.select("head > title"))

#[<title>The Dormouse's story</title>]

5.属性查找

查找时还可以加入属性元素，属性需要用中括号括起来，注意属性和标签属于同一节点，所以中间不能加空格，否则会无法匹配到。

print(soup.select('a[class="sister"]'))

#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print(soup.select('a[href="http://example.com/elsie"]'))

#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

同样，属性仍然可以与上述查找方式组合，不在同一节点的空格隔开，同一节点的不加空格

print(soup.select('p a[href="http://example.com/elsie"]'))

#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

6.获取内容

以上的 select 方法返回的结果都是列表形式，可以遍历形式输出，然后用 get_text() 方法来获取它的内容。

soup = BeautifulSoup(html, 'lxml')

print(type(soup.select('title')))

print(soup.select('title')[0].get_text())

for title in soup.select('title'):

    print(title.get_text())

===============================================================================================

通过tag标签逐层查找:

soup.select("body a")

找到某个tag标签下的直接子标签

soup.select("head > title")

通过CSS的类名查找:

soup.select(".sister")

通过tag的id查找:

soup.select("#link1")

通过是否存在某个属性来查找:

soup.select('a[href]')

通过属性的值来查找:

soup.select('a[href="http://example.com/elsie"]')

网页解析之BeautifulSoup的更多相关文章

转：Python网页解析：BeautifulSoup vs lxml.html
转自:http://www.cnblogs.com/rzhang/archive/2011/12/29/python-html-parsing.html Python里常用的网页解析库有Beautif ...
关于爬虫中常见的两个网页解析工具的分析 —— lxml / xpath 与 bs4 / BeautifulSoup
http://www.cnblogs.com/binye-typing/p/6656595.html 读者可能会奇怪我标题怎么理成这个鬼样子,主要是单单写 lxml 与 bs4 这两个 py 模块名可 ...
【Python爬虫】BeautifulSoup网页解析库
BeautifulSoup 网页解析库阅读目录初识Beautiful Soup Beautiful Soup库的4种解析器 Beautiful Soup类的基本元素基本使用标签选择器节点操作 ...
第6章网页解析器和BeautifulSoup第三方插件
第一节网页解析器简介作用:从网页中提取有价值数据的工具python有哪几种网页解析器?其实就是解析HTML页面正则表达式:模糊匹配结构化解析-DOM树:html.parserBeautiful So ...
网页解析：Xpath 与 BeautifulSoup
1. Xpath 1.1 Xpath 简介 1.2 Xpath 使用案例 2. BeautifulSoup 2.1 BeautifulSoup 简介 2.2 BeautifulSoup 使用案例 1) ...
Beautifulsoup网页解析——爬取豆瓣排行榜分类接口
我们在网页爬取的过程中,会通过requests成功的获取到所需要的信息,而且,在返回的网页信息中,也是通过HTML代码的形式进行展示的.HTML代码都是通过固定的标签组合来实现页面信息的展示,所以,最 ...
Python网页解析
续上篇文章,网页抓取到手之后就是解析网页了. 在Python中解析网页的库不少,我最开始使用的是BeautifulSoup,貌似这个也是Python中最知名的HTML解析库.它主要的特点就是容错性很好 ...
【爬虫入门手记03】爬虫解析利器beautifulSoup模块的基本应用
[爬虫入门手记03]爬虫解析利器beautifulSoup模块的基本应用 1.引言网络爬虫最终的目的就是过滤选取网络信息,因此最重要的就是解析器了,其性能的优劣直接决定这网络爬虫的速度和效率.Bea ...
Python HTML解析器BeautifulSoup(爬虫解析器)
BeautifulSoup简介我们知道,Python拥有出色的内置HTML解析器模块——HTMLParser,然而还有一个功能更为强大的HTML或XML解析工具——BeautifulSoup(美味的 ...

随机推荐

在线API文档管理工具Simple doc
Simple doc是一个简易的文档发布管理工具,为什么要写Simple doc呢?主要原因还是github的wiki并不好用:没有目录结构,文章没有Hx标签索引,最悲剧的是文章编辑的时候不能直接图片 ...
C# 倒计时，显示天，时，分，秒。时间可以是从数据库捞出来
从数据库把时间读出来,接着你用个timer控件启用控件,设置1000毫秒timer时间里用当前时间-你取出的时间就可以了 DateTime furtime = Convert.ToDateTim ...
shell 队列实现线程并发控制
需求:并发检测1000台web服务器状态(或者并发为1000台web服务器分发文件等)如何用shell实现? 方案一:(这应该是大多数人都第一时间想到的方法吧) 思路:一个for循环1000次,顺序执 ...
表格导出到excel的样式消失该如何修改
工作中遇到一需求,要将后台的表格导出到excel后的表格样式该如何修改呢? 其实表格导出首先需要了解两个插件:jquery.table2excel.js和tableExport.js 1.第一步,写一 ...
洛谷 P 5 3 0 4 [GXOI/GZOI2019]旅行者
题目描述 J 国有 n 座城市,这些城市之间通过 m 条单向道路相连,已知每条道路的长度. 一次,居住在 J 国的 Rainbow 邀请 Vani 来作客.不过,作为一名资深的旅行者,Vani 只对 ...
CF600E Lomsat gelral——线段树合并/dsu on tree
题目描述一棵树有$n$个结点,每个结点都是一种颜色,每个颜色有一个编号,求树中每个子树的最多的颜色编号的和. 这个题意是真的窒息...具体意思是说,每个节点有一个颜色,你要找的是每个子树中颜色的众数 ...
go中的数据结构切片-slice
1.部分基本类型 go中的类型与c的相似,常用类型有一个特例:byte类型,即字节类型,长度为,默认值是0: bytes = []btye{'h', 'e', 'l', 'l', 'o'} 变量byt ...
[转载]1.4 UiPath参数的介绍和使用
一.参数介绍用于将数据从一个项目传递到另一个项目.在全局意义上,它们类似于变量,因为它们动态地存储数据并传递给它.变量在活动之间传递数据,而参数在自动化之间传递数据.因此,它们使你能够一次又一次地重 ...
php memcache 缓存与memcached 客户端的详细步骤
缓存服务器有Memcache.Redis,我主要介绍了PHP中的Memcache,从Memcache简介开始,详细讲解了如Memcache和memcached的区别.PHP的 Memcache所有操作 ...
linux shell脚本语法笔记
1.&,&&,|,|| &:除了最后一个cmd,前面的cmd均已后台方式静默执行,执行结果显示在终端上,个别的cmd错误不影响整个命令的执行,全部的cmd同时执行 &a ...

网页解析之BeautifulSoup