Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.

快速开始

html_doc = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

　　使用BeautifulSoup解析这段代码,能够得到一个 BeautifulSoup 的对象,并能按照标准的缩进格式的结构输出:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc)

print(soup.prettify())

# <html>

#  <head>

#   <title>

#    The Dormouse's story

#   </title>

#  </head>

#  <body>

#   <p class="title">

#    <b>

#     The Dormouse's story

#    </b>

#   </p>

#   <p class="story">

#    Once upon a time there were three little sisters; and their names were

#    <a class="sister" href="http://example.com/elsie" id="link1">

#     Elsie

#    </a>

#    ,

#    <a class="sister" href="http://example.com/lacie" id="link2">

#     Lacie

#    </a>

#    and

#    <a class="sister" href="http://example.com/tillie" id="link2">

#     Tillie

#    </a>

#    ; and they lived at the bottom of a well.

#   </p>

#   <p class="story">

#    ...

#   </p>

#  </body>

# </html>

　　几个简单的浏览结构化数据的方法：

soup.title

# <title>The Dormouse's story</title>

soup.title.name

# u'title'

soup.title.string

# u'The Dormouse's story'

soup.title.parent.name

# u'head'

soup.p

# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']

# u'title'

soup.a

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

　　从文档中找到所有<a>标签的链接:

for link in soup.find_all('a'):

    print(link.get('href'))

    # http://example.com/elsie

    # http://example.com/lacie

    # http://example.com/tillie

　　从文档中获取所有文字内容:

print(soup.get_text())

# The Dormouse's story

#

# The Dormouse's story

#

# Once upon a time there were three little sisters; and their names were

# Elsie,

# Lacie and

# Tillie;

# and they lived at the bottom of a well.

#

# ...

如何使用

将一段文档传入BeautifulSoup 的构造方法,就能得到一个文档的对象, 可以传入一段字符串或一个文件句柄.

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("index.html"))

soup = BeautifulSoup("<html>data</html>")

　　首先,文档被转换成Unicode,并且HTML的实例都被转换成Unicode编码

BeautifulSoup("Sacré bleu!")

<html><head></head><body>Sacré bleu!</body></html>

　　然后,Beautiful Soup选择最合适的解析器来解析这段文档,如果手动指定解析器那么Beautiful Soup会选择指定的解析器来解析文档.

对象的种类

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种: Tag , NavigableString , BeautifulSoup , Comment .

Tag

Tag 对象与XML或HTML原生文档中的tag相同:

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')

tag = soup.b

type(tag)

# <class 'bs4.element.Tag'>

　　Tag有很多方法和属性,在遍历文档树和搜索文档树中有详细解释.现在介绍一下tag中最重要的属性: name和attributes.

Name

每个tag都有自己的名字,通过 .name 来获取:

tag.name

# u'b'

如果改变了tag的name,那将影响所有通过当前Beautiful Soup对象生成的HTML文档:

tag.name = "blockquote"

tag

# <blockquote class="boldest">Extremely bold</blockquote>

Attributes

一个tag可能有很多个属性. tag <b class="boldest"> 有一个 “class” 的属性,值为 “boldest” . tag的属性的操作方法与字典相同:

tag['class']

# u'boldest'

　　也可以直接”点”取属性, 比如: .attrs :

tag.attrs

# {u'class': u'boldest'}

　　tag的属性可以被添加,删除或修改. 再说一次, tag的属性操作方法与字典一样

tag['class'] = 'verybold'

tag['id'] = 1

tag

# <blockquote class="verybold" id="1">Extremely bold</blockquote>

del tag['class']

del tag['id']

tag

# <blockquote>Extremely bold</blockquote>

tag['class']

# KeyError: 'class'

print(tag.get('class'))

Beautiful Soup 4.2.0 文档（一）的更多相关文章

Beautiful Soup 4.2.0 文档
Beautiful Soup 4.2.0 文档 Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方 ...
吴裕雄--天生自然python学习笔记：Beautiful Soup 4.2.0模块
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时 ...
Beautiful Soup 4.2.0 doc_tag、Name、Attributes、多值属性
找到了bs4的中文文档,对昨天爬虫程序里所涉及的bs4库进行学习.这篇代码涉及到tag.Name.Attributes以及多值属性. ''' 对象的种类 Beautiful Soup将复杂HTML文档 ...
Beautiful Soup 4.4.0 基本使用方法
Beautiful Soup 4.4.0 基本使用方法Beautiful Soup 安装 pip install beautifulsoup4 标准库有html.parser解析器但速度不是很快一般 ...
vue mand-mobile按2.0文档默认安装的是1.6.8版本
vue mand-mobile按2.0文档默认安装的是1.6.8版本 npm list mand-mobilebigbullmobile@1.0.0 E:\webcode\bigbullmobile` ...
css2.0文档查阅及字体样式
css2.0文档查阅下载网址:http://soft.hao123.com/soft/appid/9517.html <html xmlns="http://www.w3.o ...
Beautiful Soup 4.2.0
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式快速开始 pip install beaut ...
Django2.0文档
第四章模板 1.标签 (1)if/else {% if %} 标签检查(evaluate)一个变量,如果这个变量为真(即,变量存在,非空,不是布尔值假),系统会显示在 {% if %} 和 {% e ...
11.Django2.0文档
第四章模板 1.标签 (1)if/else {% if %} 标签检查(evaluate)一个变量,如果这个变量为真(即,变量存在,非空,不是布尔值假),系统会显示在 {% if %} 和 {% e ...

随机推荐

Elasticsearch示例
/** * @author: yqq * @date: 2019/2/28 * @description: */ public class TestMain { private static Rest ...
VUE Node模式下，如何改变菜单的颜色，如何将超长文字缩略显示，在鼠标进入后展开全部显示，鼠标移出则恢复缩略显示
VUE Node模式下,如何改变菜单的颜色,如何将超长文字缩略显示,在鼠标进入后展开全部显示,鼠标移出则恢复缩略显示: “事件”引起变量值的变化,系统引擎自动根据变量值的变化刷新页面在VUE Nod ...
Django之模型层（2）
Django之模型层(2) 一.创建模型实例:我们来假定下面这些概念,字段和关系. 作者模型:一个作者由姓名和年龄. 作者详细模型:把作者的详情放到详情表,包含生日,手机号,家庭住址等信息.作者详情 ...
SpringBoot 2.0 + Apache Dubbo 2.7.3 最新版整合方案
前言 2018年2月16日,Apache Dubbo 加入 Apache 基金会孵化器.2019年5月16日,Apache 软件基金会董事会决议通过了 Apache Dubbo 的毕业申请,这意味着 ...
腾讯云和阿里云部署web 项目tomcat 日志中文变成问号
在部署项目到云上的时候,遇到了tomcat logs 日志中文变问号的问题,今天终于得到解决了这是中文变成问号的的截图打开到tomcat bin 目录的文件夹找到catalina.sh 文件 ...
38 (OC)* 进程、线程、堆栈
一.进程和线程 1.什么是进程进程是指在系统中正在运行的一个应用程序每个进程之间是独立的,每个进程均运行在其专用且受保护的内存空间内比如同时打开QQ.Xcode,系统就会分别启动2个进程通过“ ...
Scala函数式编程（三） scala集合和函数
前情提要: scala函数式编程(二) scala基础语法介绍 scala函数式编程(二) scala基础语法介绍前面已经稍微介绍了scala的常用语法以及面向对象的一些简要知识,这次是补充上一章的 ...
如何编写出高质量的 equals 和 hashcode 方法？
什么是 equals 和 hashcode 方法? 这要从 Object 类开始说起,我们知道 Object 类是 Java 的超类,每个类都直接或者间接的继承了 Object 类,在 Object ...
CocosCreator实现动物同化
获取源码关注微信公众号『一枚小工』,发送『动物同化』获取完整游戏源码. 游戏玩法游戏目标是将游戏区域的动物全部同化成同一种动物.游戏从左上角开始,从右边点击需要变成的目标动物头像,如果被同化动 ...
基于API和SQL的基本操作【DataFrame】
写在前面: 当得到一个DataFrame对象之后,可以使用对象提供的各种API方法进行直接调用,进行数据的处理. // =====基于dataframe的API=======之后的就都是DataFra ...

Beautiful Soup 4.2.0 文档（一）