官方学习文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

一、什么时BeautifulSoup？

答：灵活又方便的网页解析库，处理搞笑，支持多种解析器。

　　利用它不用编写正则表达式即可方便地实现网页信息的提取。

二、安装

pip3 install bewautifulsoup4

三、用法讲解

解析器	使用方法	优势	劣势
Py't'hon标准库	BeautifulSoup(markup,"html.parser")	Python的内置标准库、执行速度适中、文档容错额能力强	Python2.7 or 3.2。2 前的版本中文容错额能力差
lxml HTML解析器	BeautifulSoup(markup,"lxml")	速度快、文档容错能力强	需要安装C语言库
lxml XML解析器	BeautifulSoup(markup,"xml")	速度快、唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup,"html5lib")	最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档	速度慢、不依赖外部扩展

四、基本使用

html = '''

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dormouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters;and their names were

<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,

<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and

<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')

print(soup.prettify())

print(soup.title.string)

五、标签选择器

lxml解析库

1、选择元素

html = '''

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dormouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters;and their names were

<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,

<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and

<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')

print(soup.title)

print(soup.title.string)

print(type(soup.title))

print(soup.href)

print(soup.p)

2、获取名称

html = '''

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dormouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters;and their names were

<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,

<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and

<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')

print(soup.title.name)

3、获取属性

html = '''

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dormouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters;and their names were

<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,

<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and

<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')

print(soup.p.attrs['name'])

print(soup.p['name'])

4、获取内容

html = '''

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dormouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters;and their names were

<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,

<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and

<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')

print(soup.p.string)

5、嵌套选择

html = '''

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dormouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters;and their names were

<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,

<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and

<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')

print(soup.head.title.string)

6、子节点和子孙节点

.contents可以获取标签的子节点

html = '''

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="story">

Once upon a time there were three little sisters;and their names were

<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,

<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>

and

<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')

print(soup.p.contents)# .contents可以获取标签的子节点

.children是一个迭代器,以换行符分隔,获取所有的子节点

html = '''

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="story">Once upon a time there were three little sisters;and their names were

<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,

<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and

<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')

print(soup.p.children) # .children是一个迭代器,以换行符分隔,获取所有的子节点

for i,child in enumerate(soup.p.children):

    print(i,child)

.descendants,以换行符分隔，获取所有的子孙节点

html = '''

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="story">Once upon a time there were three little sisters;and their names were

<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,

<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and

<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')

print(soup.p.descendants) # .descendants,以换行符分隔，获取所有的子孙节点

for i,child in enumerate(soup.p.descendants):

    print(i,child)

7、父节点和祖先节点

.parent,获取父节点

html = '''

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="story">Once upon a time there were three little sisters;and their names were

<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,

<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and

<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')

print(soup.a.parent) # .parent,获取父节点

.parents,获取祖先节点

html = '''

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="story">Once upon a time there were three little sisters;and their names were

<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,

<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and

<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')

print(list(enumerate(soup.a.parents))) # .parents,获取祖先节点

　
8、兄弟节点

.next_siblings,获取后面的兄弟节点

.previous_siblings,获取后面的兄弟节点

html = '''

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="story">Once upon a time there were three little sisters;and their names were

<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,

<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and

<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')

print(list(enumerate(soup.a.next_siblings))) # .next_siblings,获取后面的兄弟节点

print(list(enumerate(soup.a.previous_siblings))) # .previous_siblings,获取后面的兄弟节点

标签选择器

1、find_all(name,attrs,recursive,text,kwargs）**

可根据标签名、属性、内容查找文档

html = '''

<div class="panel">

    <div class="panel-heading">

        <h4>Hello</h4>

    </div>

    <div class="panel-body">

        <ul class="list" id="list-1">

            <li class="element">Foo</li>

            <li class="element">Bar</li>

            <li class="element">Jay</li>

        </ul>

        <ul class="list" id="list-2">

            <li class="element">Foo</li>

            <li class="element">Bar</li>

            <li class="element">Jay</li>

        </ul>

    </div>

</div>

'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')

print(soup.find_all('ul'))

print(type(soup.find_all('ul')[0]))

html = '''

<div class="panel">

    <div class="panel-heading">

        <h4>Hello</h4>

    </div>

    <div class="panel-body">

        <ul class="list" id="list-1">

            <li class="element">Foo</li>

            <li class="element">Bar</li>

            <li class="element">Jay</li>

        </ul>

        <ul class="list" id="list-2">

            <li class="element">Foo</li>

            <li class="element">Bar</li>

            <li class="element">Jay</li>

        </ul>

    </div>

</div>

'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')

for ul in soup.find_all('ul'):

    print(ul.find_all('li'))

attrs

html = '''

<div class="panel">

    <div class="panel-heading">

        <h4>Hello</h4>

    </div>

    <div class="panel-body">

        <ul class="list" id="list-1">

            <li class="element">Foo</li>

            <li class="element">Bar</li>

            <li class="element">Jay</li>

        </ul>

        <ul class="list" id="list-2">

            <li class="element">Foo</li>

            <li class="element">Bar</li>

            <li class="element">Jay</li>

        </ul>

    </div>

</div>

'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')

for ul in soup.find_all('ul'):

    print(ul.find_all('li'))

text

html = '''

<div class="panel">

    <div class="panel-heading">

        <h4>Hello</h4>

    </div>

    <div class="panel-body">

        <ul class="list" id="list-1" name="elements">

            <li class="element">Foo</li>

            <li class="element">Bar</li>

            <li class="element">Jay</li>

        </ul>

        <ul class="list" id="list-2">

            <li class="element">Foo</li>

            <li class="element">Bar</li>

            <li class="element">Jay</li>

        </ul>

    </div>

</div>

'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')

print(soup.find_all(text='Foo')) # text方法适用于文本匹配，不适用于标签查找

2、find(name.attrs,recursive,text,**kwargs)

find返回单个元素，find_all返回所有元素

html = '''

<div class="panel">

    <div class="panel-heading">

        <h4>Hello</h4>

    </div>

    <div class="panel-body">

        <ul class="list" id="list-1" name="elements">

            <li class="element">Foo</li>

            <li class="element">Bar</li>

            <li class="element">Jay</li>

        </ul>

        <ul class="list" id="list-2">

            <li class="element">Foo</li>

            <li class="element">Bar</li>

            <li class="element">Jay</li>

        </ul>

    </div>

</div>

'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')

print(soup.find('ul'))

print(type(soup.find('ul')))

print(soup.find('page'))

3、其他

find_parents()和 find_parent

find_parents()返回所有祖先节点，find_parent()返回直接父节点

find_next_siblings()和 find_next_siblings()

find_next_siblings()返回后面所有兄弟结点， find_next_siblings()返回后面第一个兄弟结点

find_previous_siblings()和find_previous_sibling()

find_previous_siblings()返回前面所有修兄弟节点，find_previous_sibling()返回前面第一个兄弟节点

find_all_next()和find_next()

find_all_next()返回节点后面所有符合条件的结点，find_next()返回第一个符合条件的结点

find_all_previous()和find_previous()

find_all_previous()返回结点前面所有符合条件的结点，find_previous()返回第一个符合条件的结点

CSS选择器

通过select()直接传入CSS选择器即可完成选择

html = '''

<div class="panel">

    <div class="panel-heading">

        <h4>Hello</h4>

    </div>

    <div class="panel-body">

        <ul class="list" id="list-1" name="elements">

            <li class="element">Foo</li>

            <li class="element">Bar</li>

            <li class="element">Jay</li>

        </ul>

        <ul class="list" id="list-2">

            <li class="element">Foo</li>

            <li class="element">Bar</li>

            <li class="element">Jay</li>

        </ul>

    </div>

</div>

'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')

print(soup.select('.panel .panel-heading')) # panel前面的.代表class属性

print(soup.select('ul li')) #ul li表示ul属性内的li属性，嵌套选择

print(soup.select('#list-2 .element'))

print(type(soup.select('ul')[0]))

1、获取属性

html = '''

<div class="panel">

    <div class="panel-heading">

        <h4>Hello</h4>

    </div>

    <div class="panel-body">

        <ul class="list" id="list-1" name="elements">

            <li class="element">Foo</li>

            <li class="element">Bar</li>

            <li class="element">Jay</li>

        </ul>

        <ul class="list" id="list-2">

            <li class="element">Foo</li>

            <li class="element">Bar</li>

            <li class="element">Jay</li>

        </ul>

    </div>

</div>

'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')

for ul in soup.select('ul'):

    print(ul['id'])

    print(ul.attrs['id'])

2、获取内容

html = '''

<div class="panel">

    <div class="panel-heading">

        <h4>Hello</h4>

    </div>

    <div class="panel-body">

        <ul class="list" id="list-1" name="elements">

            <li class="element">Foo</li>

            <li class="element">Bar</li>

            <li class="element">Jay</li>

        </ul>

        <ul class="list" id="list-2">

            <li class="element">Foo</li>

            <li class="element">Bar</li>

            <li class="element">Jay</li>

        </ul>

    </div>

</div>

'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')

for li in soup.select('li'):

    print(li.get_text())

总结

推荐使用lxml解析库，必要时使用html.parser
标签选择筛选功能弱但是速度快
建议使用find()、find_all()查询匹配单个结果或是多个结果
如果对CSS选择器熟悉建议使用select()

python爬虫知识点总结（六）BeautifulSoup库详解的更多相关文章

python爬虫入门四：BeautifulSoup库(转)
正则表达式可以从html代码中提取我们想要的数据信息,它比较繁琐复杂,编写的时候效率不高,但我们又最好是能够学会使用正则表达式. 我在网络上发现了一篇关于写得很好的教程,如果需要使用正则表达式的话,参 ...
python爬虫学习之使用BeautifulSoup库爬取开奖网站信息-模块化
实例需求:运用python语言爬取http://kaijiang.zhcw.com/zhcw/html/ssq/list_1.html这个开奖网站所有的信息,并且保存为txt文件和excel文件. 实 ...
python爬虫学习(一)：BeautifulSoup库基础及一般元素提取方法
最近在看爬虫相关的东西,一方面是兴趣,另一方面也是借学习爬虫练习python的使用,推荐一个很好的入门教程:中国大学MOOC的<python网络爬虫与信息提取>,是由北京理工的副教授嵩天老 ...
python WEB接口自动化测试之requests库详解
由于web接口自动化测试需要用到python的第三方库--requests库,运用requests库可以模拟发送http请求,再结合unittest测试框架,就能完成web接口自动化测试. 所以笔者今 ...
Python爬虫连载4-Error模块、Useragent详解
一.error 1.URLError产生的原因:(1)没有网络:(2)服务器连接失败:(3)不知道指定服务器:(4)是OSError的子类 from urllib import request,err ...
python爬虫知识点详解
python爬虫知识点总结(一)库的安装 python爬虫知识点总结(二)爬虫的基本原理 python爬虫知识点总结(三)urllib库详解 python爬虫知识点总结(四)Requests库的基本使 ...
Python爬虫系列-Urllib库详解
Urllib库详解 Python内置的Http请求库: * urllib.request 请求模块 * urllib.error 异常处理模块 * urllib.parse url解析模块 * url ...
Python爬虫之Beautiful Soup解析库的使用（五）
Python爬虫之Beautiful Soup解析库的使用 Beautiful Soup-介绍 Python第三方库,用于从HTML或XML中提取数据官方:http://www.crummv.com/ ...
爬虫入门之urllib库详解(二)
爬虫入门之urllib库详解(二) 1 urllib模块 urllib模块是一个运用于URL的包 urllib.request用于访问和读取URLS urllib.error包括了所有urllib.r ...

随机推荐

一些Python黑客脚本
[Github项目地址] https://github.com/threeworld/Python
mnesia的脏写和事物写的测试
在之前的文章中,测试了脏读和事物读之间性能差别,下面测试下脏写和事物写之间的性能差别: 代码如下: -module(mnesia_text). -compile(export_all). -recor ...
MongoDB--安装部署
MongoDB安装说明: 本次安装教程: 版本:mongoDB-3.2.4 安装环境:windows 10 ,64位操作系统准备:安装包.Robomongo(客户端用于查看mongoDB里面的数据 ...
Intellj IDEA光标替insert状态，back键无法删除内容
Intellj IDEA光标为insert状态,无法删除内容导入项目后,发现打开java文件的光标是win系统下按了insert键后的那种宽的光标,并且还无法删除内容,且按删除(delete)键也只见 ...
whl文件下载
到哪找.whl文件?http://www.lfd.uci.edu/~gohlke/pythonlibs/
【BZOJ4520】[Cqoi2016]K远点对 kd-tree+堆
[BZOJ4520][Cqoi2016]K远点对 Description 已知平面内 N 个点的坐标,求欧氏距离下的第 K 远点对. Input 输入文件第一行为用空格隔开的两个整数 N, K.接下来 ...
[转]Struts form传值
Struts form传值大约三四个月没用过struts框架,突然想拾起来,却发现好多都忘了.出现传值传不过来的问题.没办法,上网查了一下,看见了一位老师的帖子,总结的很好.特此转载与分享,文末附链 ...
PCA tries to preserve linear structure, MDS tries to preserve global geometry, and t-SNE tries to preserve topology (neighborhood structure)
https://colah.github.io/posts/2014-10-Visualizing-MNIST/
ABAP 性能优化001
红方框里那一步之行很慢,lt_iflos这个内表才200多条数据 1.关键是你from那个表有多少数据.... 注意点: 1.不要用 CORRESPONDING FIELDS OF 2.LT_IFLO ...
剑指Offer：栈的压入、弹出序列【31】
剑指Offer:栈的压入.弹出序列[31] 题目描述输入两个整数序列,第一个序列表示栈的压入顺序,请判断第二个序列是否为该栈的弹出顺序.假设压入栈的所有数字均不相等.例如序列1,2,3,4,5是某栈 ...

python爬虫知识点总结（六）BeautifulSoup库详解

1、find_all(name,attrs,recursive,text,**kwargs）