官方学习文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

一、什么时BeautifulSoup?

答:灵活又方便的网页解析库,处理搞笑,支持多种解析器。

  利用它不用编写正则表达式即可方便地实现网页信息的提取。

二、安装

pip3 install bewautifulsoup4

三、用法讲解

解析器 使用方法 优势 劣势
Py't'hon标准库 BeautifulSoup(markup,"html.parser") Python的内置标准库、执行速度适中、文档容错额能力强 Python2.7 or 3.2。2 前的版本中文容错额能力差
lxml HTML解析器 BeautifulSoup(markup,"lxml") 速度快、文档容错能力强 需要安装C语言库
lxml XML解析器 BeautifulSoup(markup,"xml") 速度快、唯一支持XML的解析器 需要安装C语言库
html5lib BeautifulSoup(markup,"html5lib") 最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档 速度慢、不依赖外部扩展

四、基本使用

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dormouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,
<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and
<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.prettify())
print(soup.title.string)

  

五、标签选择器

lxml解析库

1、选择元素

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dormouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,
<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and
<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.title)
print(soup.title.string)
print(type(soup.title))
print(soup.href)
print(soup.p)

2、获取名称

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dormouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,
<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and
<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.title.name)

3、获取属性

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dormouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,
<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and
<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.attrs['name'])
print(soup.p['name'])

  

4、获取内容

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dormouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,
<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and
<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.string)

  

5、嵌套选择

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dormouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,
<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and
<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.head.title.string)

  

6、子节点和子孙节点

.contents可以获取标签的子节点
html = '''
<html><head><title>The Dormouse's story</title></head>
<body> <p class="story">
Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,
<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>
and
<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.contents)# .contents可以获取标签的子节点

  

.children是一个迭代器,以换行符分隔,获取所有的子节点
html = '''
<html><head><title>The Dormouse's story</title></head>
<body> <p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,
<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and
<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.children) # .children是一个迭代器,以换行符分隔,获取所有的子节点
for i,child in enumerate(soup.p.children):
print(i,child)

  

.descendants,以换行符分隔,获取所有的子孙节点
html = '''
<html><head><title>The Dormouse's story</title></head>
<body> <p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,
<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and
<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.descendants) # .descendants,以换行符分隔,获取所有的子孙节点
for i,child in enumerate(soup.p.descendants):
print(i,child)

  

7、父节点和祖先节点

.parent,获取父节点
html = '''
<html><head><title>The Dormouse's story</title></head>
<body> <p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,
<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and
<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.a.parent) # .parent,获取父节点

  

.parents,获取祖先节点
html = '''
<html><head><title>The Dormouse's story</title></head>
<body> <p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,
<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and
<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(list(enumerate(soup.a.parents))) # .parents,获取祖先节点

 
8、兄弟节点

.next_siblings,获取后面的兄弟节点
.previous_siblings,获取后面的兄弟节点
html = '''
<html><head><title>The Dormouse's story</title></head>
<body> <p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,
<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and
<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(list(enumerate(soup.a.next_siblings))) # .next_siblings,获取后面的兄弟节点
print(list(enumerate(soup.a.previous_siblings))) # .previous_siblings,获取后面的兄弟节点

  

标签选择器

1、find_all(name,attrs,recursive,text,**kwargs)

可根据标签名、属性、内容查找文档

html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find_all('ul'))
print(type(soup.find_all('ul')[0]))

  

html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.find_all('ul'):
print(ul.find_all('li'))

  

attrs

html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.find_all('ul'):
print(ul.find_all('li'))

  

text

html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(text='Foo')) # text方法适用于文本匹配,不适用于标签查找

  

2、find(name.attrs,recursive,text,**kwargs)

find返回单个元素,find_all返回所有元素

html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find('ul'))
print(type(soup.find('ul')))
print(soup.find('page'))

 

3、其他

find_parents()和 find_parent

find_parents()返回所有祖先节点,find_parent()返回直接父节点

find_next_siblings()和 find_next_siblings()

find_next_siblings()返回后面所有兄弟结点, find_next_siblings()返回后面第一个兄弟结点

find_previous_siblings()和find_previous_sibling()

find_previous_siblings()返回前面所有修兄弟节点,find_previous_sibling()返回前面第一个兄弟节点

find_all_next()和find_next()

find_all_next()返回节点后面所有符合条件的结点,find_next()返回第一个符合条件的结点

find_all_previous()和find_previous()

find_all_previous()返回结点前面所有符合条件的结点,find_previous()返回第一个符合条件的结点

 

CSS选择器

通过select()直接传入CSS选择器即可完成选择

html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.select('.panel .panel-heading')) # panel前面的.代表class属性
print(soup.select('ul li')) #ul li表示ul属性内的li属性,嵌套选择
print(soup.select('#list-2 .element'))
print(type(soup.select('ul')[0]))

  

1、获取属性

html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.select('ul'):
print(ul['id'])
print(ul.attrs['id'])

  

2、获取内容

html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for li in soup.select('li'):
print(li.get_text())

  

总结

  • 推荐使用lxml解析库,必要时使用html.parser
  • 标签选择筛选功能弱但是速度快
  • 建议使用find()、find_all()查询匹配单个结果或是多个结果
  • 如果对CSS选择器熟悉建议使用select()

python爬虫知识点总结(六)BeautifulSoup库详解的更多相关文章

  1. python爬虫入门四:BeautifulSoup库(转)

    正则表达式可以从html代码中提取我们想要的数据信息,它比较繁琐复杂,编写的时候效率不高,但我们又最好是能够学会使用正则表达式. 我在网络上发现了一篇关于写得很好的教程,如果需要使用正则表达式的话,参 ...

  2. python爬虫学习之使用BeautifulSoup库爬取开奖网站信息-模块化

    实例需求:运用python语言爬取http://kaijiang.zhcw.com/zhcw/html/ssq/list_1.html这个开奖网站所有的信息,并且保存为txt文件和excel文件. 实 ...

  3. python爬虫学习(一):BeautifulSoup库基础及一般元素提取方法

    最近在看爬虫相关的东西,一方面是兴趣,另一方面也是借学习爬虫练习python的使用,推荐一个很好的入门教程:中国大学MOOC的<python网络爬虫与信息提取>,是由北京理工的副教授嵩天老 ...

  4. python WEB接口自动化测试之requests库详解

    由于web接口自动化测试需要用到python的第三方库--requests库,运用requests库可以模拟发送http请求,再结合unittest测试框架,就能完成web接口自动化测试. 所以笔者今 ...

  5. Python爬虫连载4-Error模块、Useragent详解

    一.error 1.URLError产生的原因:(1)没有网络:(2)服务器连接失败:(3)不知道指定服务器:(4)是OSError的子类 from urllib import request,err ...

  6. python爬虫知识点详解

    python爬虫知识点总结(一)库的安装 python爬虫知识点总结(二)爬虫的基本原理 python爬虫知识点总结(三)urllib库详解 python爬虫知识点总结(四)Requests库的基本使 ...

  7. Python爬虫系列-Urllib库详解

    Urllib库详解 Python内置的Http请求库: * urllib.request 请求模块 * urllib.error 异常处理模块 * urllib.parse url解析模块 * url ...

  8. Python爬虫之Beautiful Soup解析库的使用(五)

    Python爬虫之Beautiful Soup解析库的使用 Beautiful Soup-介绍 Python第三方库,用于从HTML或XML中提取数据官方:http://www.crummv.com/ ...

  9. 爬虫入门之urllib库详解(二)

    爬虫入门之urllib库详解(二) 1 urllib模块 urllib模块是一个运用于URL的包 urllib.request用于访问和读取URLS urllib.error包括了所有urllib.r ...

随机推荐

  1. hdu1316

    链接:pid=1316" target="_blank">点击打开链接 题意:问区间[a,b]中有多少斐波那契数 代码: #include <iostream ...

  2. Cocos2d-x粒子系统

    CCparticleSystem类封装实现对粒子的控制与调度,当中操作包含有: 1.产生粒子 2.更新粒子状态 3.回收无效的粒子 CCparticleSystem派生出CCParticleSyste ...

  3. easyNetq demo

    本demo包含一个类库,2个console程序 1.新建类库  MQHelper,控制台程序  consumer和proc ,控制台程序引用MQHelper 2.使用nuget安装easynwtq 和 ...

  4. 浅谈java反序列化工具ysoserial

    前言 关于java反序列化漏洞的原理分析,基本都是在分析使用Apache Commons Collections这个库,造成的反序列化问题.然而,在下载老外的ysoserial工具并仔细看看后,我发现 ...

  5. python 基础 1.5 python数据类型(二)--列表常用方法示例

    #/usr/bin/python #coding=utf-8 #@Time   :2017/10/12 23:30 #@Auther :liuzhenchuan #@File   :列表.py lis ...

  6. mysql双机热备+heartbeat集群+自动故障转移

    环境说明:本环境由两台mysql 数据库和heartbeat 组成,一台的ip 为 192.168.10.197,一台为192.168.10.198,对外提供服务的vip 为192.168.10.20 ...

  7. maven 配置: 修改默认的 .m2仓库 默认存储路径.

    maven 配置: 修改默认的 .m2仓库 默认存储路径. 一 .在系统maven里修改 1.在maven_HOME/conf/下找到配置文档 settings.xml 在文档中添加如下的配置说明 & ...

  8. 九度OJ 1169:比较奇偶数个数 (基础题)

    时间限制:1 秒 内存限制:32 兆 特殊判题:否 提交:9459 解决:3146 题目描述: 第一行输入一个数,为n,第二行输入n个数,这n个数中,如果偶数比奇数多,输出NO,否则输出YES. 输入 ...

  9. c++动态绑定的技术实现

    1 什么是动态绑定 有一个基类,两个派生类,基类有一个virtual函数,两个派生类都覆盖了这个虚函数.现在有一个基类的指针或者引用,当该基类指针或者引用指向不同的派生类对象时,调用该虚函数,那么最终 ...

  10. 【题解】P3599 Koishi Loves Construction

    [题解]P3599 Koishi Loves Construction \(\mod n\) 考虑如何构造,发现\(n\)一定在第一位,不然不行.\(n\)一定是偶数或者是\(1\),不然 \(n|\ ...