Python BeautifulSoup库的用法

BeautifulSoup是一个可以从HTML或者XML文件中提取数据的Python库，它通过解析器把文档解析为利于人们理解的文档导航模式，有利于查找和修改文档。

BeautifulSoup3目前已经停止开发，现在推荐使用BeautifulSoup4，它被移植到了bs4中。

# 使用时需要导入

from bs4 import BeautifulSoup

解析器

BeautifulSoup4中常用4种主要的解析器，使用前需要安装：

#不同系统安装方法

$ apt-get install Python-lxml

$ easy_install lxml

$ pip install lxml

# pycharm中安装可以先import xxx，显示有错误然后点击安装，安装后删除import语句，即可正常使用

解析器的优缺点对比
解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(DocumentName, "html.parser")	Python的内置标准库不需要单独安装执行速度适中文档容错能力强	Python 2.7.3 or 3.2.2前的班中文容错能力差
lxml HTML解析器	BeautifulSoup(DocumentName, "lxml")	速度快文档容错能力强	需要安装C语言库
lxml XML解析器	BeautifulSoup(DocumentName, "xml") BeautifulSoup(DocumentName, ["lxml","xml"])	速度快唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(DocumentName, "html5lib")	最好的容错以浏览器的方式解析文档生成HTML5格式的文档	速度慢，需要依赖python库

不同解析器的解析结果：

# 符合HTML标准的解析结果
htmldoc = "<a><p></p></a>"

print("None        :",BeautifulSoup(htmldoc))

print("html.parser :", BeautifulSoup(htmldoc, "html.parser"))

print("lxml        :", BeautifulSoup(htmldoc, "lxml"))

print("xml         :", BeautifulSoup(htmldoc, "lxml-xml"))

print("html5lib    :", BeautifulSoup(htmldoc, "html5lib"))

"""

结果为：

None        : <html><body><a><p></p></a></body></html>

html.parser : <a><p></p></a>

lxml        : <html><body><a><p></p></a></body></html>

xml         : <?xml version="1.0" encoding="utf-8"?>

              　　<a><p/></a>

html5lib    : <html><head></head><body><a><p></p></a></body></html>

"""

# 不符合HTML标准的解析结果

htmldoc = "<a></p></a>"

print("None        :",BeautifulSoup(htmldoc))

print("html.parser :", BeautifulSoup(htmldoc, "html.parser"))

print("lxml        :", BeautifulSoup(htmldoc, "lxml"))

print("xml         :", BeautifulSoup(htmldoc, "lxml-xml"))

print("html5lib    :", BeautifulSoup(htmldoc, "html5lib"))

"""

结果为：

None        : <html><body><a></a></body></html>

html.parser : <a></a>

lxml        : <html><body><a></a></body></html>

xml         : <?xml version="1.0" encoding="utf-8"?>

　　　　　　　　　　<a/>

html5lib    : <html><head></head><body><a><p></p></a></body></html>

"""

html5lib会把所有的标签不全，并且加上html、head、body，标准的html格式；默认、html.parser、lxml 解析器会把错误标签忽略掉。

编码

任何HTML或者XML文档都有自己的编码方式，但使用BeautifulSoup解析后，文档都会被转换为Unicode，输出时编码均为UTF-8。

因为BeautifulSoup用来编码自动检测库来识别当前文档的编码，并自动转换为Unicode编码。但也有小概率会识别出错，可以用.original_encoding来检测编码格式。

并且设置from_encoding参数可以提高文档的解析速度。

htmldoc = b"<h1>\xed\xe5\xec\xf9</h1>"

soup = BeautifulSoup(htmldoc, from_encoding="iso-8859-8")

print(soup.h1)

print(soup.original_encoding)

"""

结果：

<h1>םולש</h1>

'iso8859-8'

"""

指定输出编码：

htmldoc = b"<h1>\xed\xe5\xec\xf9</h1>"

soup = BeautifulSoup(htmldoc, from_encoding="iso-8859-8")
print(soup.prettify("latin-1"))

"""
结果：
b'<h1>\n νεμω\n</h1>'

"""

遍历文档树：

1. 注释<class 'bs4.element.Comment'> 和替换Comment内容

htmldoc = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"

soup = BeautifulSoup(htmldoc)

comment = soup.b.string

print(comment)

print(type(comment))

print(soup2.b)

print(soup2.b.prettify()) #comment特色输出方式

# 替换Comment

cdata= CData("A CData block")

comment.replace_with(cdata)

print(soup2.b.prettify())

"""

结果：

Hey, buddy. Want to buy a used parser?

<class 'bs4.element.Comment'>

<b><!--Hey, buddy. Want to buy a used parser?--></b>

<b>

 <!--Hey, buddy. Want to buy a used parser?-->

</b>

<b>

<![CDATA[A CData block]]>

</b>

"""

CData使用时需要导入 from bs4 import CData

2. soup.tagName 返回类型 <class 'bs4.element.Tag'>，得到文档中第一个tagName标签 == soup.find(tagName)

soup.div #得到文章中第一个div标签

soup.a #得到文章中第一个a标签

3. soup.tagName.get_text() / soup.tagName.text 返回类型 <class 'str'>，得到该标签内容，对每个BeautifulSoup处理后的对象都生效。

soup.a.get_text()

soup.a.text

4. soup.tagName.tagName["AttributeName"] 获得标签内属性值，逐级访问标签可以用 . 连接，某个属性值用 ["属性名"] 访问。

soup.div.a['href']

5. soup.ul.contents 返回类型为<class 'list'>，可以用下标访问其中的元素

　list内元素类型为 <class 'bs4.element.Tag'> or <class 'bs4.element.NavigableString'>

　如果是单一标签可以用 string 返回文本，如果超过1个标签就返回None

soup.ul.contents

type(soup.ul.contents) #<class 'list'>

soup.ul.contents[0].string

type(soup.ul.contents[0])

6. find_all(name, attrs, recursice, text, limit, **kwargs)，返回的类型为 <class 'bs4.element.ResultSet'>

# 找到所有标签为tagName的集合

soup.find_all("tagName")

soup.find_all("a")

# 找到所有标签为tagName 且 class=className的集合

soup.find_all("tagName", "className")

soup.find_all("div","Score")

# 找到所有id=idName的标签

soup.find_all(id = "idName")

# 使用多个指定名字的参数可以同时过滤tag的多个属性:

soup.find_all(href = re.compile("else"))

soup.find_all(href=re.compile("elsie"), id='link1')

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

# 有些tag属性在搜索不能使用,比如HTML5中的 data-* 属性

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')

data_soup.find_all(data-foo="value")

# SyntaxError: keyword can't be an expression

# 可以用attrs 参数定义一个字典参数来搜索包含特殊属性的tag

data_soup.find_all(attrs={"data-foo": "value"})

# [<div data-foo="value">foo!</div>]

# 找到所有有id的标签

soup.find_all(id = True)

# 找到所有标签为tagName且class = "className"的标签

soup.find_all("tagName", class_ = "className")

# css class参数过滤

soup.find_all(class_=re.compile("itl"))

# [<p class="title"><b>The Dormouse's story</b></p>]

# 通过文本查找文本

soup.find_all(text = "textContent")

soup.find_all(text = ["Content1", "Content2"])

# 文本参数过滤

soup.find_all(text=re.compile("Dormouse"))

# 限制结果个数

soup.find_all("tagName", limit = 2)

elemSet = soup.find_all("div", limit = 2)

# 可循环出每个元素

for item in elemSet:

    print(item)

    print(item.text)

7. 通过 css 选择器查找，返回值 <class 'list'>，list中元素类型为 <class 'bs4.element.Tag'>

htmldoc = """

<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

soup = BeautifulSoup(htmldoc, "lxml")

# tag标签查找

soup.select("title")

# tag标签逐层查找

soup.select("html head title")

soup.select("head title")

soup.select("body a")

# tag标签下的直接子标签

soup.select("head > title")

soup.select("p > a")

soup.select("p > a:nth-of-type(2)")

soup.select("p > #link1")

# css类名查找

soup.select(".sister")

# id 查找

soup.select("#link1")

# 通过属性查找

soup.select("a[href]")

soup.select('a[href="http://example.com/elsie"]')

soup.select('a[href^="http://example.com/"]')

soup.select('a[href$="tillie"]')

soup.select('a[href*=".com/el"]')

type(soup.select("a[href]")) # <class 'list'>

# 通过循环获取到每个tag

list = soup.select("a[href]")

for item in list:

    print(item)

    print(type(item)) # <class 'bs4.element.Tag'>

    print(item.text)

    print(item.string)

Python BeautifulSoup库的用法的更多相关文章

python BeautifulSoup库用法总结
1. Beautiful Soup 简介简单来说,Beautiful Soup是python的一个库,最主要的功能是从网页抓取数据.官方解释如下: Beautiful Soup提供一些简单的.pyt ...
PYTHON 爬虫笔记五:BeautifulSoup库基础用法
知识点一:BeautifulSoup库详解及其基本使用方法什么是BeautifulSoup 灵活又方便的网页解析库,处理高效,支持多种解析器.利用它不用编写正则表达式即可方便实现网页信息的提取库. ...
python BeautifulSoup库的基本使用
Beautiful Soup 是用Python写的一个HTML/XML的解析器,它可以很好的处理不规范标记并生成剖析树(parse tree). 它提供简单又常用的导航(navigating),搜索以 ...
python 时间库的用法时区的转化
1. 月份的加减 https://blog.csdn.net/qq_18863573/article/details/79444094 第三方模块:python-dateutil import dat ...
python requests库的用法
参考 http://docs.python-requests.org/zh_CN/latest/user/quickstart.html 1.传递url参数 >>> payload ...
Python爬虫-- BeautifulSoup库
BeautifulSoup库 beautifulsoup就是一个非常强大的工具,爬虫利器.一个灵活又方便的网页解析库,处理高效,支持多种解析器.利用它就不用编写正则表达式也能方便的实现网页信息的抓取 ...
Python HTTP库requests中文页面乱码解决方案！
http://www.cnblogs.com/bitpeng/p/4748872.html Python中文乱码,是一个很大的坑,自己不知道在这里遇到多少问题了.还好通过自己不断的总结,现在遇到乱码的 ...
$python爬虫系列（2）—— requests和BeautifulSoup库的基本用法
本文主要介绍python爬虫的两大利器:requests和BeautifulSoup库的基本用法. 1. 安装requests和BeautifulSoup库可以通过3种方式安装: easy_inst ...
Python爬虫小白入门（三）BeautifulSoup库
# 一.前言 *** 上一篇演示了如何使用requests模块向网站发送http请求,获取到网页的HTML数据.这篇来演示如何使用BeautifulSoup模块来从HTML文本中提取我们想要的数据. ...

随机推荐

css3的那些高级选择器一
css大家都不陌生了,从1996年12月css1正式推出,经历了1998年5月css2,再到2004年2月css2.1,最后一直到2010年推出的css3.css的推出给web带来巨大的改变,使我们 ...
word 2013如何从某一页开始插入页码
把光标移入要插入页面的最前面插入分页符在要插入页码的页脚双击打开页脚设计取消页脚和前面页眉的链接插入页码
命令（Command）模式
命令(Command)模式:命令模式是对命令的封装.命令模式把发出命令的责任和执行命令的责任分割开,委派给不同的对象 /* * 客户(Client)角色:创建了一个具体命令(ConcreteComma ...
asp.net微信内置浏览器下Session失效
问题记录:仅限安卓端微信内置浏览器,服务器集群设置了黏性Session,在Post请求时会强制走代理,导致出去的ip指向另一台服务器,黏性Session失效,用户状态无法保存. 目前想知道除了设置Se ...
Kotlin if else判断
Kotlin的if相对与java,有着较为灵活的用法. if是用来判断. if在Kotlin里面可以作为表达式来使用. 如果熟悉C java C#等 A>B:A?B这个判断应该是很熟悉,而Kot ...
《spring 攻略》笔记1
chapter1 spring简介两种spring ioc容器实现类型: BeanFactory ApplicationContext 应用程序上下文 DI技巧: @Autowired(requir ...
xcode9 上传app后iTues 构建版本不显示
1.问题原因苹果公司更新了ios10系统和xcode9以后,做了许多调整,如果开发者没有注意就会遇到这样那样的问题.作者在更新以后就遇到了上传app到appstore成功后,没有显示的问题.下面就介 ...
UI控件的位置
1.该位置指的是本控件的中心点位于点 (100, 100)上(不包含尺寸),可以用于中心对齐在使用frame设置位置的情况下 self.view.center = CGPointMake(100, 1 ...
洛谷P2754 [CTSC1999]家园（最大流）
传送门这题思路太强了……大佬们怎么想到的……我这菜鸡根本想不出来…… 先判断是否能到达,对每一艘飞船能到的地方用并查集合并一下,最后判断一下是否连通然后考虑几天怎么判断,我们可以枚举. 每一个点表 ...
DRF 的解析器和渲染器
一.解析器解析器作用解析器的作用就是服务端接收客户端传过来的数据,把数据解析成自己可以处理的数据.本质就是对请求体中的数据进行解析. 在了解解析器之前,我们要先知道Accept以及ContentT ...

Python BeautifulSoup库的用法

解析器

编码

遍历文档树：

Python BeautifulSoup库的用法的更多相关文章

随机推荐

热门专题