BeautifulSoup4库

和 lxml 一样，Beautiful Soup 也是一个HTML/XML的解析器，主要的功能也是如何解析和提取 HTML/XML 数据。
lxml 只会局部遍历，而Beautiful Soup 是基于HTML DOM（Document Object Model）的，会载入整个文档，解析整个DOM树，因此时间和内存开销都会大很多，所以性能要低于lxml。
BeautifulSoup 用来解析 HTML 比较简单，API非常人性化，支持CSS选择器、Python标准库中的HTML解析器，也支持 lxml 的 XML解析器。
Beautiful Soup 3 目前已经停止开发，推荐现在的项目使用Beautiful Soup 4。

安装和文档：

安装：pip install bs4。
中文文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

几大解析工具对比：

解析工具	解析速度	使用难度
BeautifulSoup	最慢	最简单
lxml	快	简单
正则	最快	最难

简单使用：

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

#创建 Beautiful Soup 对象

# 使用lxml来进行解析

soup = BeautifulSoup(html,"lxml")

print(soup.prettify())

四个常用的对象：

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:

Tag
NavigatableString
BeautifulSoup
Comment

1. Tag：

Tag 通俗点讲就是 HTML 中的一个个标签。示例代码如下：

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

#创建 Beautiful Soup 对象

soup = BeautifulSoup(html,'lxml')

print soup.title

# <title>The Dormouse's story</title>

print soup.head

# <head><title>The Dormouse's story</title></head>

print soup.a

# <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>

print soup.p

# <p class="title" name="dromouse"><b>The Dormouse's story</b></p>

print type(soup.p)

# <class 'bs4.element.Tag'>

我们可以利用 soup 加标签名轻松地获取这些标签的内容，这些对象的类型是bs4.element.Tag。但是注意，它查找的是在所有内容中的第一个符合要求的标签。如果要查询所有的标签，后面会进行介绍。
对于Tag，它有两个重要的属性，分别是name和attrs。示例代码如下：

print soup.name

# [document] #soup 对象本身比较特殊，它的 name 即为 [document]

print soup.head.name

# head #对于其他内部标签，输出的值便为标签本身的名称

print soup.p.attrs

# {'class': ['title'], 'name': 'dromouse'}

# 在这里，我们把 p 标签的所有属性打印输出了出来，得到的类型是一个字典。

print soup.p['class'] # soup.p.get('class')

# ['title'] #还可以利用get方法，传入属性的名称，二者是等价的

soup.p['class'] = "newClass"

print soup.p # 可以对这些属性和内容等等进行修改

# <p class="newClass" name="dromouse"><b>The Dormouse's story</b></p>

2. NavigableString：

如果拿到标签后，还想获取标签中的内容。那么可以通过tag.string获取标签中的文字。示例代码如下：

print soup.p.string

# The Dormouse's story

print type(soup.p.string)

# <class 'bs4.element.NavigableString'>thon

3. BeautifulSoup：

BeautifulSoup 对象表示的是一个文档的全部内容.大部分时候,可以把它当作 Tag 对象,它支持遍历文档树和搜索文档树中描述的大部分的方法.
因为 BeautifulSoup 对象并不是真正的HTML或XML的tag,所以它没有name和attribute属性.但有时查看它的 .name 属性是很方便的,所以 BeautifulSoup 对象包含了一个值为 “[document]” 的特殊属性 .name

soup.name

# '[document]'

4. Comment：

Tag , NavigableString , BeautifulSoup 几乎覆盖了html和xml中的所有内容,但是还有一些特殊对象.容易让人担心的内容是文档的注释部分:

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"

soup = BeautifulSoup(markup)

comment = soup.b.string

type(comment)

# <class 'bs4.element.Comment'>

Comment 对象是一个特殊类型的 NavigableString 对象:

comment

# 'Hey, buddy. Want to buy a used parser'

遍历文档树：

1. contents和children：

html_doc = """

<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc,'lxml')

head_tag = soup.head

# 返回所有子节点的列表

print(head_tag.contents)

# 返回所有子节点的迭代器

for child in head_tag.children:

    print(child)

2. strings 和 stripped_strings

如果tag中包含多个字符串 [2] ,可以使用 .strings 来循环获取：

for string in soup.strings:

    print(repr(string))

    # u"The Dormouse's story"

    # u'\n\n'

    # u"The Dormouse's story"

    # u'\n\n'

    # u'Once upon a time there were three little sisters; and their names were\n'

    # u'Elsie'

    # u',\n'

    # u'Lacie'

    # u' and\n'

    # u'Tillie'

    # u';\nand they lived at the bottom of a well.'

    # u'\n\n'

    # u'...'

    # u'\n'

输出的字符串中可能包含了很多空格或空行,使用 .stripped_strings 可以去除多余空白内容：

for string in soup.stripped_strings:

    print(repr(string))

    # u"The Dormouse's story"

    # u"The Dormouse's story"

    # u'Once upon a time there were three little sisters; and their names were'

    # u'Elsie'

    # u','

    # u'Lacie'

    # u'and'

    # u'Tillie'

    # u';\nand they lived at the bottom of a well.'

    # u'...'

搜索文档树：

1. find和find_all方法：

搜索文档树，一般用得比较多的就是两个方法，一个是find，一个是find_all。find方法是找到第一个满足条件的标签后就立即返回，只返回一个元素。find_all方法是把所有满足条件的标签都选到，然后返回回去。使用这两个方法，最常用的用法是出入name以及attr参数找出符合要求的标签。

soup.find_all("a",attrs={"id":"link2"})

或者是直接传入属性的的名字作为关键字参数：

soup.find_all("a",id='link2')

2. select方法：

使用以上方法可以方便的找出元素。但有时候使用css选择器的方式可以更加的方便。使用css选择器的语法，应该使用select方法。以下列出几种常用的css选择器方法：

（1）通过标签名查找：

print(soup.select('a'))

（2）通过类名查找：

通过类名，则应该在类的前面加一个.。比如要查找class=sister的标签。示例代码如下：

print(soup.select('.sister'))

（3）通过id查找：

通过id查找，应该在id的名字前面加一个＃号。示例代码如下：

print(soup.select("#link1"))

（4）组合查找：

组合查找即和写 class 文件时，标签名与类名、id名进行的组合原理是一样的，例如查找 p 标签中，id 等于 link1的内容，二者需要用空格分开：

print(soup.select("p #link1"))

直接子标签查找，则使用 > 分隔：

print(soup.select("head > title"))

（5）通过属性查找：

查找时还可以加入属性元素，属性需要用中括号括起来，注意属性和标签属于同一节点，所以中间不能加空格，否则会无法匹配到。示例代码如下：

print(soup.select('a[href="http://example.com/elsie"]'))

（6）获取内容

以上的 select 方法返回的结果都是列表形式，可以遍历形式输出，然后用 get_text() 方法来获取它的内容。

soup = BeautifulSoup(html, 'lxml')

print type(soup.select('title'))

print soup.select('title')[0].get_text()

for title in soup.select('title'):

    print title.get_text()

Python 爬虫 BeautifulSoup4 库的使用的更多相关文章

python爬虫---selenium库的用法
python爬虫---selenium库的用法 selenium是一个自动化测试工具,支持Firefox,Chrome等众多浏览器在爬虫中的应用主要是用来解决JS渲染的问题. 1.使用前需要安装这个 ...
Python爬虫Urllib库的高级用法
Python爬虫Urllib库的高级用法设置Headers 有些网站不会同意程序直接用上面的方式进行访问,如果识别有问题,那么站点根本不会响应,所以为了完全模拟浏览器的工作,我们需要设置一些Head ...
Python爬虫Urllib库的基本使用
Python爬虫Urllib库的基本使用深入理解urllib.urllib2及requests 请访问: http://www.mamicode.com/info-detail-1224080.h ...
Python爬虫—requests库get和post方法使用
目录 Python爬虫-requests库get和post方法使用 1. 安装requests库 2.requests.get()方法使用 3.requests.post()方法使用-构造formda ...
Python爬虫beautifulsoup4常用的解析方法总结（新手必看）
今天小编就为大家分享一篇关于Python爬虫beautifulsoup4常用的解析方法总结,小编觉得内容挺不错的,现在分享给大家,具有很好的参考价值,需要的朋友一起跟随小编来看看吧摘要如何用beau ...
Python网络爬虫——BeautifulSoup4库的使用
使用requests库获取html页面并将其转换成字符串之后,需要进一步解析html页面格式,提取有用信息. BeautifulSoup4库,也被成为bs4库(后皆采用简写)用于解析和处理html和x ...
python爬虫 - Urllib库及cookie的使用
http://blog.csdn.net/pipisorry/article/details/47905781 lz提示一点,python3中urllib包括了py2中的urllib+urllib2. ...
[python爬虫]Requests-BeautifulSoup-Re库方案--Requests库介绍
[根据北京理工大学嵩天老师“Python网络爬虫与信息提取”慕课课程编写文章中部分图片来自老师PPT 慕课链接:https://www.icourse163.org/learn/BIT-10018 ...
Python 爬虫解析库的使用 --- XPath
一.使用XPath XPath ,全称XML Path Language,即XML路径语言,它是一门在XML文档中查找信息的语言.它最初是用来搜寻XML文档的,但是它同样适用于HTML文档的搜索. 所 ...

随机推荐

React函数式组件和类组件[Dan]
一篇对Dan的 How Are Function Components Different from Classes? 一文的个人阅读总结,内容来自于此.强烈推荐阅读 Dan Abramov.的博客. ...
如何在 ASP.Net Core 中实现健康检查
健康检查常用于判断一个应用程序能否对 request 请求进行响应,ASP.Net Core 2.2 中引入了健康检查中间件用于报告应用程序的健康状态. ASP.Net Core 中的健康检查 ...
MySQL基础知识：MySQL Connection和Session
在connection的生命里,会一直有一个user thread(以及user thread对应的THD)陪伴它. Connection和Session概念来自Stackoverflow的一个回答 ...
codefoces D. Phoenix and Science
原题链接:https://codeforc.es/problemset/problem/1348/D 题意:给你一个体重为一克的细菌(它可以每天进行一次二分裂即一分为二体重均分:晚上体重增加1克)求最 ...
golang 性能调优分析工具 pprof (上)
一.golang 程序性能调优在 golang 程序中,有哪些内容需要调试优化? 一般常规内容: cpu:程序对cpu的使用情况 - 使用时长,占比等内存:程序对cpu的使用情况 - 使用时长,占 ...
PAT (Advanced Level) Practice 1027 Colors in Mars (20 分) 凌宸1642
PAT (Advanced Level) Practice 1027 Colors in Mars (20 分) 凌宸1642 题目描述: People in Mars represent the c ...
linux日志文件说明
/var/log/message 系统启动后的信息和错误日志,是Red Hat Linux中最常用的日志之一 /var/log/secure 与安全相关的日志信息 /var/log/maillog 与 ...
vue 快速入门系列 —— vue 的基础应用（上）
其他章节请看: vue 快速入门系列 vue 的基础应用(上) Tip: vue 的基础应用分上下两篇,上篇是基础,下篇是应用. 在初步认识 vue一文中,我们已经写了一个 vue 的 hello- ...
SCIP:构造数据抽象--数据结构中队列与树的解释
现在到了数学抽象中最关键的一步:让我们忘记这些符号所表示的对象.不应该在这里停滞不前,有许多操作可以应用于这些符号,而根本不必考虑它们到底代表着什么东西. --Hermann Weyi <思维的 ...
java面试-CountDownLatch、CyclicBarrier、Semaphore谈谈你的理解
一.CountDownLatch 主要用来解决一个线程等待多个线程的场景,计数器不能循环利用 public class CountDownLatchDemo { public static void ...

Python 爬虫 BeautifulSoup4 库的使用