BeautifulSoup使用总结

一、介绍

BeautifulSoup为一个python库，它可以接收一个HTML或XML的字符串或文件，并返回一个BeautifulSoup对象，之后我们可以使用BeautifulSoup提供的众多方法来对文件内容进行解析。

二、安装

1、使用pip安装

pip install beautifulsoup4

#安装BeautifulSoup解析器

pip install lxml

pip install html5lib

2、通过apt-get安装

sudo apt-get install Python-bs4

#安装BeautifulSoup解析器

sudo apt-get install Python-lxml

sudo apt-get install Python-html5lib

推荐使用lxml作为解析器，因为其效率更高。

三、常用方法

下面的例子将解析以下字符串：

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

1、将字符串包装厂BeautifulSoup对象

soup = BeautifulSoup(html, "lxml")

#使用标准的缩进结构输出

print soup.prettify()

输出：

<html>

 <head>

  <title>

   The Dormouse's story

  </title>

 </head>

 <body>

  <p class="title">

   <b>

    The Dormouse's story

   </b>

  </p>

  <p class="story">

   Once upon a time there were three little sisters; and their names were

   <a class="sister" href="http://example.com/elsie" id="link1">

    Elsie

   </a>

   ,

   <a class="sister" href="http://example.com/lacie" id="link2">

    Lacie

   </a>

   and

   <a class="sister" href="http://example.com/tillie" id="link3">

    Tillie

   </a>

   ;

and they lived at the bottom of a well.

  </p>

  <p class="story">

   ...

  </p>

 </body>

</html>

2、使用name获取标签名称

print soup.a

print soup.a.name

输出：

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

a

需要注意的是，使用soup.[tag]来访问标签只会返回第一个名为tag的标签，若想返回所有的或者根据条件返回，可以使用find_all()方法。

3、使用string获取标签内容

通过访问标签的string属性可以获取标签的内容。

print soup.title.string

输出：

The Dormouse's story

需要注意的是使用string来访问标签内容时，该标签内只能包含一个子节点，若有多个子节点，使用string会返回None，因为不知道该返回哪个子节点的内容。

print soup.body.string

输出：

None

将string换成strings即可：

strings = soup.body.strings

for string in strings:

    print string

输出：



The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie

,

Lacie

 and

Tillie

;

and they lived at the bottom of a well.

...

可以看到输出有很多多余的空行和空格，使用stripped_strings即可去除这些空行和空格：

strings = soup.body.stripped_strings

for string in strings:

    print string

输出：

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie

,

Lacie

and

Tillie

;

and they lived at the bottom of a well.

...

4、获取标签的属性名称

#获取第一个<p>标签的class属性

soup.p["class"]

输出：

['title']

返回的为一个列表，因为class可能有多个值。

#获取第一个<a>标签的href属性

soup.a["href"]

输出：

'http://example.com/elsie'

5、更改标签的属性值

#更改第一个<p>标签的href属性

soup.p["class"] = "new-class"

print soup.p["class"]

#更改第一个<a>标签的href属性

soup.a["href"] = "www.google.com"

print soup.a["href"]

print soup.prettify()

输出：

new-class

www.google.com

<html>

 <head>

  <title>

   The Dormouse's story

  </title>

 </head>

 <body>

  <p class="new-class">

   <b>

    The Dormouse's story

   </b>

  </p>

  <p class="story">

   Once upon a time there were three little sisters; and their names were

   <a class="sister" href="www.google.com" id="link1">

    Elsie

   </a>

   ,

   <a class="sister" href="http://example.com/lacie" id="link2">

    Lacie

   </a>

   and

   <a class="sister" href="http://example.com/tillie" id="link3">

    Tillie

   </a>

   ;

and they lived at the bottom of a well.

  </p>

  <p class="story">

   ...

  </p>

 </body>

</html>

6、find_all方法

6.1 返回所有的标签

#返回文档中所有的<a>标签，返回值为列表

links = soup.find_all("a")

print links

输出：

[<a class="sister" href="www.google.com" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

6.2、根据属性名返回标签

#返回文档中所有的类名为sister的<a>标签，返回值为列表

#class为python关键字，所以使用class_代替

links = soup.find_all("a", class_="sister")

print links

print '-'*20

#与上面的相同

links = soup.find_all("a", attrs={"class":"sister"})

print links

print '-'*20

#返回文档中所有的id为link2的<a>标签，返回值为列表

links = soup.find_all("a", id="link2")

print links

输出：

[<a class="sister" href="www.google.com" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

--------------------

[<a class="sister" href="www.google.com" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

--------------------

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

6.3、获取所有标签的href属性

links = soup.find_all("a")

for a in links:

    print a["href"]

输出：

www.google.com

http://example.com/lacie

http://example.com/tillie

三、参考

1、https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

BeautifulSoup使用总结的更多相关文章

Python爬虫小白入门（三）BeautifulSoup库
# 一.前言 *** 上一篇演示了如何使用requests模块向网站发送http请求,获取到网页的HTML数据.这篇来演示如何使用BeautifulSoup模块来从HTML文本中提取我们想要的数据. ...
使用beautifulsoup与requests爬取数据
1.安装需要的库 bs4 beautifulSoup requests lxml如果使用mongodb存取数据,安装一下pymongo插件 2.常见问题 1> lxml安装问题如果遇到lxm ...
BeautifulSoup ：功能使用
# -*- coding: utf-8 -*- ''' # Author : Solomon Xie # Usage : 测试BeautifulSoup一些用法及容易出bug的地方 # Envirom ...
BeautifulSoup研究一
BeautifulSoup的文档见 https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/ 其中.contents 会将换行也记录为一个子节 ...
BeautifulSoup
参考:http://www.freebuf.com/news/special/96763.html 相关资料:http://www.jb51.net/article/65287.htm 1.Pytho ...
BeautifulSoup Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
BeautifulSoup很赞的东西最近出现一个问题:Python 3.3 soup=BeautifulSoup(urllib.request.urlopen(url_path),"htm ...
beautifulSoup(1)
import re from bs4 import BeautifulSoupdoc = ['<html><head><title>Page title</t ...
python BeautifulSoup模块的简要介绍
常用介绍: pip install beautifulsoup4 # 安装模块 from bs4 import BeautifulSoup # 导入模块 soup = BeautifulSoup(ht ...
BeautifulSoup 的用法
转自:http://cuiqingcai.com/1319.html Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,如果我们不安装它,则 Python ...
BeautifulSoup的选择器
用BeautifulSoup查找指定标签(元素)的时候,有几种方法: soup=BeautifulSoup(html) 1.soup.find_all(tagName),返回一个指定Tag元素的列表 ...

随机推荐

Problem A: 踢罐子解题报告
Problem A: 踢罐子 Description 平面上有\(n\)个点,其中任意2点不重合,任意3点不共线. 我们等概率地选取一个点A,再在剩下的\(n-1\)个点中等概率地选取一个点B,再在剩 ...
raft 论文
raft 论文,摘自 http://www.infoq.com/cn/articles/raft-paper raft动画:https://raft.github.io/ raft说明动画:
Java基础-使用JAVA代码剖析MD5算法实现过程
Java基础-使用JAVA代码剖析MD5算法实现过程作者:尹正杰版权声明:原创作品,谢绝转载!否则将追究法律责任.
windows服务写完之后怎么让它跑起来
当然你可以在命令框里面自己去手动的敲代码,也可以写一个.bat文件一劳永逸......这里我就介绍写.bat文件的方法就是上图所示的三个东东啦,有了这三个东东,把他们拖到你windows服务的deb ...
[Java] Servlet工作原理之二：Session与Cookie
(未完成) 一.Cookie与Session的使用简介 1 Cookie Cookie 用于记录用户在一段时间内的行为,它有两个版本:Version 0 和 Version 1,分别对应两种响应头 S ...
WEB前端技巧之JQuery为动态添加的元素绑定事件.md
jquery 为动态添加的元素绑定事件如果直接写click函数的话,只能把事件绑定在已经存在的元素上,不能绑定在动态添加的元素上可以用delegate来实现 .delegate( select ...
转：IOS 基于APNS消息推送原理与实现(JAVA后台)
Push的原理: Push 的工作机制可以简单的概括为下图图中,Provider是指某个iPhone软件的Push服务器,这篇文章我将使用.net作为Provider. APNS 是Apple ...
WINDOWS控制界面操作命令for WIN10
Windows系统:开始--运行--命令大全: cmd--------CMD命令提示符 cleanmgr-------垃圾整理 compmgmt.msc---计算机管理 conf----------- ...
C#的Lamda表达式_匿名函数
Spring Cloud（十四）Config 配置中心与客户端的使用与详细
前言在上一篇文章中我们直接用了本应在本文中配置的Config Server,对Config也有了一个基本的认识,即 Spring Cloud Config 是一种用来动态获取Git.SVN.本地 ...