BuautifulSoup4库详解

1、BeautifulSoup4库简介

What is beautifulsoup ?

答：一个可以用来从HTML 和 XML中提取数据的网页解析库，支持多种解析器（代替正则的复杂用法）

2、安装

pip3 install beautifulsoup4

3、用法详解

（1）、解析器性能分析（第一个参数markup-要解析的目标代码，第二个参数为解析器）

（2）、使用方法（独孤九剑）

1、总诀式：

#author: "xian"

#date: 2018/5/7

#以下为爱丽丝梦游仙境的部分代码

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

#小试牛刀

from bs4 import BeautifulSoup #从bs4库导入BeautifulSoup类

soup = BeautifulSoup(html,'lxml') #构造名为soup的对象

print(soup.prettify()) #prettify修饰()方法：格式化代码也就是让各位小伙伴释放眼睛压力哈哈！

print(soup.a) #选中a标签

print(soup.a['class'])#打印a标签名为class的属性值

print(soup.a.name) #打印a 标签的名字 soup.a.parent.name 找到a标签的老子

print(soup.a.string) #小伙伴们猜猜看这是干什么？    答：打印a标签的文本

print(soup.find_all('a')) #找到所有的a标签

print(soup.find(id="link3"))#找到id属性值为link3的标签

#找链接

for link in soup.find_all('a'):

    print(link.get('href')) #遍历所有名为a的标签并得到其链接

#找文本

print(soup.a.get_text()) #获取a标签的文本当然小伙伴们可以任意指定想要的内容

#上面的输出

'''<html>

 <head>

  <title>

   The Dormouse's story

  </title>

 </head>

 <body>

  <p class="title">

   <b>

    The Dormouse's story

   </b>

  </p>

  <p class="story">

   Once upon a time there were three little sisters; and their names were

   <a class="sister" href="http://example.com/elsie" id="link1">

    Elsie

   </a>

   ,

   <a class="sister" href="http://example.com/lacie" id="link2">

    Lacie

   </a>

   and

   <a class="sister" href="http://example.com/tillie" id="link3">

    Tillie

   </a>

   ;

and they lived at the bottom of a well.

  </p>

  <p class="story">

   ...

  </p>

 </body>

</html>

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

['sister']

a

Elsie

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

http://example.com/elsie

http://example.com/lacie

http://example.com/tillie'''

其他的小伙伴们可以根据需要获取想要的内容，掌握方法即可，具体可参见官网：https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

2、破剑式

 #author: "xian"

 #date: 2018/5/7

 html = """

 <html>

     <head>

         <title>The Dormouse's story</title>

     </head>

     <body>

         <p class="story">

             Once upon a time there were three little sisters; and their names were

             <a href="http://example.com/elsie" class="sister" id="link1">

                 <span>Elsie</span>

             </a>

             <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>

             and

             <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>

             and they lived at the bottom of a well

         </p>

         <p class="story">...</p>

 """

 #子节点及子孙节点（老子节点与祖宗节点的选择）的选择

 from bs4 import BeautifulSoup

 soup = BeautifulSoup(html,'lxml')

 print(soup.p.contents) #contents方法将得到的结果以列表形式输出

 print(soup.p.children) #是一个迭代器对象，需要用for循环才能得到器内容 children 只后期子节点

 for i,child in enumerate(soup.p.children): #enumerate() 函数用于将一个可遍历的数据对象(如列表、元组或字符串)组合为一个索引序列，同时列出数据和数据下标，一般用在 for 循环当中。

     print(i,child) #接受index 和内特

 print(soup.p.descendants)  #descendants 获取所有的儿子和孙子后代节点

 for i,child in enumerate(soup.p.descendants):

     print(i,child)

 #上面的输出结果

 '''['\n            Once upon a time there were three little sisters; and their names were\n            ', < a

 class ="sister" href="http://example.com/elsie" id="link1" >

 < span > Elsie < / span >

 < / a >, '\n', < a class ="sister" href="http://example.com/lacie" id="link2" > Lacie < / a >, '\n            and\n            ', < a class ="sister" href="http://example.com/tillie" id="link3" > Tillie < / a >, '\n            and they lived at the bottom of a well\n        ']

 < list_iterator object at 0x00000156B2E76EF0 >

 0

             Once upon a time there were three little sisters; and their names were

 1 < a class ="sister" href="http://example.com/elsie" id="link1" >

 < span > Elsie < / span >

 < / a >

 2

 3 < a class ="sister" href="http://example.com/lacie" id="link2" > Lacie < / a >

 4

             and

 5 < a class ="sister" href="http://example.com/tillie" id="link3" > Tillie < / a >

 6

             and they lived at the bottom of a well

 < generator object descendants at 0x00000156B08910F8 >

 0

 Once upon a time there were three little sisters; and their names were

 1 < a class ="sister" href="http://example.com/elsie" id="link1" >

 < span > Elsie < / span >

 < /a >

 2

 3 < span > Elsie < / span >

 4 Elsie

 5

 6

 7 < a class ="sister" href="http://example.com/lacie" id="link2" > Lacie < / a >

 8 Lacie

 9

             and

 10 < a class ="sister" href="http://example.com/tillie" id="link3" > Tillie < / a >

 11 Tillie

 12

         and they lived at the bottom of a well'''

 #老子节点和祖宗节点方法介绍 children -- parent / descendants -- parents 小伙伴们模仿上面的可是动手试试

 #兄弟节点的获取 方法为：next_siblings：获取当前对象后面的兄弟节点 previous_siblings:获取当前对象前面的兄弟节点，小伙伴们可以试试

3、破刀式

 #author: "xian"

 #date: 2018/5/7

 #搜索文档内容 find_all() 和find()

 html = """

 <html><head><title>The Dormouse's story</title></head>

 <p class="title"><b>The Dormouse's story</b></p>

 <p class="story">Once upon a time there were three little sisters; and their names were

 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

 and they lived at the bottom of a well.</p>

 <p class="story">...</p>

 """

 from bs4 import BeautifulSoup

 import re

 soup = BeautifulSoup(html,'lxml')

 #(1)、find_all( name , attrs , recursive , text , **kwargs )

 #name参数用法详解（text参数的使用同name类似如soup.find_all(text=["Tillie", "Elsie", "Lacie"])只返回内容，小伙伴们可查阅官方文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/）

 print(soup.find_all('head')) #查找head标签

 print(soup.find_all(id='link2')) #查找id='link2'的标签

 print(soup.find_all(href=re.compile("(\w+)"))) #查找所有包含href属性包含字母数字的标签

 print(soup.find_all(href=re.compile("(\w+)"), id='link1')) #多重过滤

 #搜索指定名字的属性时可以使用的参数值包括 字符串 , 正则表达式 , 列表, True

 #attrs参数用法详解

 print(soup.find_all(attrs={'id':'link2'})) #attrs参数以key-value形式传入值 /返回列表类型

 #（2）find( name , attrs , recursive , text , **kwargs )用法同find_all 类似只不过它只返回一个值，小伙伴们可以查找官方用法

 #（3）其他方法汇总：(小伙伴们了解即可具体碰到查文档)

 #find_parents() 和find_parent() 返回祖宗节点 和 返回老子节点

 #find_next_siblings() 和 find_next_sibling() 返回后面所有的兄弟节点 和 返回后面第一个兄弟节点

 #find_previous_siblings() 和 find_previous_sibling() 返回前面所有的兄弟节点 和 返回前面第一个兄弟节点

 #find_all_next() 和 find_next() 返回节点后满足条件所有的节点 和 返回第一个满足条件的节点

 #find_all_previous() 和 find_previous() 返回节点前满足条件所有的节点 和 返回第一个满足条件的节点

 #上面的输出结果：

 '''

 [<head><title>The Dormouse's story</title></head>]

 [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

 [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

 [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

 [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

 <class 'bs4.element.ResultSet'>

 '''

4、破枪式

 #author: "xian"

 #date: 2018/5/7

 #CSS选择器详解（通过select()传入css选择器即可成功选择）

 html = """

 <html><head><title>The Dormouse's story</title></head>

 <body>

 <p class="title"><b>The Dormouse's story</b></p>

 <p class="story">Once upon a time there were three little sisters; and their names were

 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

 and they lived at the bottom of a well.</p>

 <p class="story">...</p>

 """

 from bs4 import BeautifulSoup

 soup = BeautifulSoup(html,'lxml')

 print(soup.select('.title')) #选择class属性为title的标签 css选择器使用请小伙伴们查看官网

 #再来一例

 print(soup.select('p a#link1'))# 选择p标签下的a下的id属性为link1的标签

 print(soup.select('a')[1]) #做一个切片拿到第二个a标签

 #获取内容

 print(soup.select('a')[1].get_text()) 

 #上面的输出：

 '''

 [<p class="title"><b>The Dormouse's story</b></p>]

 [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

 laci
33 '''

通过以上的实验，小伙伴们对bs4库是否有了一定的了解，赶紧行动起来，试试学习的效果吧！

总结：

1.建议小伙伴使用lxml解析器

2.多用find_all()和find()

3.css的select()方法掌握下

4.多练习，勤能补拙，孰能生巧，才能渐入化境！

BuautifulSoup4库详解的更多相关文章

Lua的协程和协程库详解
我们首先介绍一下什么是协程.然后详细介绍一下coroutine库,然后介绍一下协程的简单用法,最后介绍一下协程的复杂用法. 一.协程是什么? (1)线程首先复习一下多线程.我们都知道线程——Thre ...
Python--urllib3库详解1
Python--urllib3库详解1 Urllib3是一个功能强大,条理清晰,用于HTTP客户端的Python库,许多Python的原生系统已经开始使用urllib3.Urllib3提供了很多pyt ...
Struts标签库详解【3】
struts2标签库详解要在jsp中使用Struts2的标志,先要指明标志的引入.通过jsp的代码的顶部加入以下的代码: <%@taglib prefix="s" uri= ...
STM32固件库详解
STM32固件库详解 emouse原创文章,转载请注明出处http://www.cnblogs.com/emouse/ 应部分网友要求,最新加入固件库以及开发环境使用入门视频教程,同时提供例程模板 ...
MySQL5.6的4个自带库详解
MySQL5.6的4个自带库详解 1.information_schema详细介绍: information_schema数据库是MySQL自带的,它提供了访问数据库元数据的方式.什么是元数据呢?元数 ...
php中的PDO函数库详解
PHP中的PDO函数库详解 PDO是一个“数据库访问抽象层”,作用是统一各种数据库的访问接口,与mysql和mysqli的函数库相比,PDO让跨数据库的使用更具有亲和力:与ADODB和MDB2相比,P ...
STM32 HAL库详解及手动移植
源: STM32 HAL库详解及手动移植
爬虫入门之urllib库详解(二)
爬虫入门之urllib库详解(二) 1 urllib模块 urllib模块是一个运用于URL的包 urllib.request用于访问和读取URLS urllib.error包括了所有urllib.r ...
Python爬虫系列-Urllib库详解
Urllib库详解 Python内置的Http请求库: * urllib.request 请求模块 * urllib.error 异常处理模块 * urllib.parse url解析模块 * url ...

随机推荐

使用Python做简单的字符串匹配
由于需要在半结构化的文本数据中提取一些特定格式的字段.数据辅助挖掘分析工作,以往都是使用Matlab工具进行结构化数据处理的建模,matlab擅长矩阵处理.结构化数据的计算,Python具有与matl ...
JDK6和JDK7中String的substring()方法及其差异
翻译人员: 铁锚翻译日期: 2013年11月2日原文链接: The substring() Method in JDK 6 and JDK 7 在JDK6与JDK7这两个版本中,substri ...
sql 如何应对子查询返回数据有多条　我就是要返回数据有多条
SELECT * FROM SUSE_DEV.PROJECT_LISTING INNER JOIN SUSE_DEV.PROJECT_AUCTION ON SUSE_DEV.PROJECT_LISTI ...
Visual Studio 2010多线程编程
随着处理数据量的逐渐增大,串行单核的程序,犹如残灯缺月,无法满足运用需求.大规模集群的出现,解决了这一技术难题.本文旨在探讨如何使用多CPU并行编程,关于CUDA的并行前面文章已有讲述.本文结构分为三 ...
[RDLC]一步一步教你使用RDLC(一)
一:加数据集,并且命名为Quotation,如下图所示: 二: 添加一张报表,命名为Quotation,如下图所示: 向报表中添加"表"这一项,如下图所示: 这时就弹出一个选择数据 ...
(NO.00001)iOS游戏SpeedBoy Lite成形记(十五)
现在啃第2个问题:如何让玩家输入赌注金额. 实现的方法有很多种,比如可以限制玩家只能从特定的金额中选择,把每个选择做成一个按钮即可.以下是一个假想选择窗口的示意图: 这样没有玩家的输入问题了.缺点是不 ...
AngularJS进阶(三十六)AngularJS项目开发技巧之利用Service&Promise&Resolve解决图片预加载问题(后记)
AngularJS项目开发技巧之利用Service&Promise&Resolve解决图片预加载问题(后记) 前言在"AngularJS项目开发技巧之图片预加载" ...
（二）plist的使用和序列帧动画
六.plist的使用方法: iOS的程序在安装在手机上以后会把全部资源文件集成在一个文件夹中,这种文件集合称为bundle,对于一般的工程,只有一个bundle,即mainbundle,因此可以通过b ...
Mahout文本向量化
在文本聚类之前,首先要做的是文本的向量化.该过程涉及到分词,特征抽取,权重计算等等.Mahout 提供了文本向量化工具.由于Mahout 向量化算法要处理的文件是Hadoop SequenceFile ...
mysql进阶(二十)CPU超负荷异常情况
CPU超负荷异常情况问题项目部署阶段,提交订单时总是出现cpu超负荷工作情况,导致机器卡死,订单提交失败.通过任务管理器可见下图所示: 通过任务管理器中进程信息(见下图)进行查看,可见正是由于项目 ...

BuautifulSoup4库详解

BuautifulSoup4库详解的更多相关文章

随机推荐

热门专题