beautifulsoup库使用

介绍与安装

Beautiful Soup 是一个HTML/XML的解析器，主要的功能也是如何解析和提取 HTML/XML 数据。BeautifulSoup 用来解析 HTML 比较简单， API非常人性化，支持CSS选择器、Python标准库中的HTML解析器，也支持 lxml 的 XML解析器。Beautiful Soup 3 目前已经停止开发，推荐现在的项目使用Beautiful Soup 4

#安装 Beautiful Soup

pip install beautifulsoup4
sudo pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple beautifulsoup4

#安装解析器 Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml .根据操作系统不同,可以选择下列方法来安装
lxml: $ apt-get install python-lxml
$ easy_install lxml
$ sudo pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple lxml

如何实例化BeautifulSoup对象

- from bs4 import BeautifulSoup

        - 对象的实例化：

            - 1.将本地的html文档中的数据加载到该对象中

                    fp = open('./test.html','r',encoding='utf-8')

                    soup = BeautifulSoup(fp,'lxml')

            - 2.将互联网上获取的页面源码加载到该对象中

                    page_text = response.text

                    soup = BeatifulSoup(page_text,'lxml')

        - 提供的用于数据解析的方法和属性：

            - soup.tagName:返回的是文档中第一次出现的tagName对应的

      标签

            - soup.find():

                - find('tagName'):等同于soup.div

                - 属性定位：

                    -soup.find('div',class_/id/attr='song')

            - soup.find_all('tagName'):返回符合要求的所有标签（列表）

        - select：

            - select('某种选择器（id，class，标签...选择器）'),返回的是一个列表。

            - 层级选择器：

                - soup.select('.tang > ul > li > a')：>表示的是一个层级

                - oup.select('.tang > ul a')：空格表示的多个层级

        - 获取标签之间的文本数据：

            - soup.a.text/string/get_text()

            - text/get_text():可以获取某一个标签中所有的文本内容

            - string：只可以获取该标签下面直系的文本内容

        - 获取标签中属性值：

            - soup.a['href']

1.Tag

Tag 通俗点讲就是 HTML 中的一个个标签。

# -*- coding: utf-8 -*-

# @Time    : 2018/3/1 23:05

# @Author  : hyang

# @Site    :

# @File    : BeautifulSoup_test.py

# @Software: PyCharm

html_doc = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<p class="story">...</p>

"""

# 基本使用：容错处理,文档的容错能力指的是在html代码不完整的情况下,使用该模块可以识别该错误。

# 使用BeautifulSoup解析上述代码,能够得到一个 BeautifulSoup 的对象,并能按照标准的缩进格式的结构输出

# from bs4 import BeautifulSoup

# soup = BeautifulSoup(html_doc,'lxml')  # 具有容错功能

# res = soup.prettify()  # 处理好缩进，结构化显示

# print(res)

#遍历文档树：即直接通过标签名字选择，特点是选择速度快，但如果存在多个相同的标签则只返回第一个

html_doc = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p id="my p" class="title"><b id="bbb" class="boldest">The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

#1、用法

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc,'lxml')

# soup=BeautifulSoup(open('a.html'),'lxml')

# print(soup.p) # 存在多个相同的标签则只返回第一个

# print(type(soup.a)) # 查看返回类型<class 'bs4.element.Tag'>

# print(soup.name) # [document] soup 对象本身比较特殊，它的 name 即为 [document]

#2、获取标签的名称

# print(soup.p.name) # p

#3、获取标签的属性

# print(soup.p.attrs) # {'id': 'my p', 'class': ['title']}

# print(soup.p['class']) # ['title']

#4、获取标签的内容

print(soup.p.string) # p下的文本只有一个时，取到，否则为None

print(soup.p.strings) #拿到一个生成器对象, 取到p下所有的文本内容

print(soup.p.text) #取到p下所有的文本内容

for line in soup.stripped_strings: #去掉空白

    print(line)

'''

如果tag包含了多个子节点,tag就无法确定 .string 方法应该调用哪个子节点的内容, .string 的输出结果是 None，如果只有一个子节点那么就输出该子节点的文本，比如下面的这种结构，soup.p.string 返回为None,但soup.p.strings就可以找到所有文本

<p id='list-1'>

    哈哈哈哈

    <a class='sss'>

        <span>

            <h1>aaaa</h1>

        </span>

    </a>

    <b>bbbbb</b>

</p>

'''

#5、嵌套选择

print(soup.head.title.string)

print(soup.body.a.string)

#6、子节点、子孙节点

print(soup.p.contents) #p下所有子节点

print(soup.p.children) #得到一个迭代器,包含p下所有子节点

for i,child in enumerate(soup.p.children):

    print(i,child)

print(soup.p.descendants) #获取子孙节点,p下所有的标签都会选择出来

for i,child in enumerate(soup.p.descendants):

    print(i,child)

#7、父节点、祖先节点

print(soup.a.parent) #获取a标签的父节点

print(soup.a.parents) #找到a标签所有的祖先节点，父亲的父亲，父亲的父亲的父亲...

#8、兄弟节点

print('=====>')

print(soup.a.next_sibling) #下一个兄弟

print(soup.a.previous_sibling) #上一个兄弟

print(list(soup.a.next_siblings)) #下面的兄弟们=>生成器对象

print(soup.a.previous_siblings) #上面的兄弟们=>生成器对象

基本使用方法

# 基本使用方法

print(soup.prettify())  #容错性的体现，自动补全

print(soup.a)  #只找到了一个，而且是从整个文档树找

print(soup.a.text)   #找到a标签里面的文本

print(soup.text)   #找整个文档树种所有的文本

print(soup.a.attrs)   #找a标签的所有属性，字典形式

print(soup.a.attrs["href"])  #找a标签的href属性

print(soup.p.b)  #嵌套查找，这是只找一个

print(soup.p.contents)  #子节点，找到的是一个闭标签

print(list(soup.p.children )) #得到生成器

print(list(soup.p.descendants))  #所有的子子孙孙

print(soup.a.parent)#找父亲

print(list(soup.a.parent))#父亲的父亲的父亲

print(soup.p.find_all() ) #标签名可以和find可以结合在一起使用

2.Find

Beautiful Soup定义了很多搜索方法,这里着重介绍2个: find() 和 find_all()

find_all

#常用过滤方法

from bs4 import BeautifulSoup

html_doc = '''<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title"><b>The Dormouse's story</b></p>

<p class="title"><b>$75</b></p>

<p id="meiyuan">啦啦啦啦啦啦</p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>'''

soup= BeautifulSoup(html_doc,"lxml")

# 1、字符串：特点：是一种完全匹配的

print(soup.find_all(name="a"))  #找到所有的a标签

print(soup.find_all(name="a aa"))  #找不到，会打印一个[]

print(soup.find_all(attrs={"class":"sister"}))

print(soup.find_all(text="The Dormouse's story"))  #按照文本来找

print(soup.find_all(name="b",text="The Dormouse's story"))  #找标签名是b，并且文本是The Dormouse's story

print(soup.p.find(name="b").text)  #第一个p标签的b里面的文本

print(soup.find_all(name="p",attrs={"class":"story"}))  #找到标签名是p,属性名是class,

print(soup.find(name="p",attrs={"class":"story"}).find_all(name="a")[2])  #找到标签名是p,属性名是class的第二个a标签

# 2、正则

import re

print(soup.find_all(name=re.compile("^b")))  #找b开头的的标签

print(soup.find_all(attrs={"id":re.compile("link")}))  #找到id属性是link的

print(soup.find_all(text=re.compile(r"\$")))  #找带有$价钱的文本

#

# # 3、列表：如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回.

print(soup.find_all(name=["a",re.compile("^b")]))  #找a标签或者b标签开头的所有的标签

print(soup.find_all(text=["$",]))  #找不到

print(soup.find_all(text=[re.compile(r"\$")]))  #['$75']

print(soup.find_all(text=["a",re.compile(r"\$")]))

# # 4、True：可以匹配任何值

print(soup.find_all(name=True))  #找到所有标签的标签名

print(soup.find_all(attrs={"id":True}))#找到只要有id属性的

#

print(soup.find_all(name="p",attrs={"id":True}))# 找到有id属性的p标签

# 5、方法：如果没有合适过滤器,那么还可以定义一个方法,方法只接受一个元素参数 ,如果这个方法返回 True 表示当前元素匹配并且被找到,如果不是则反回 False

#

# # 有class属性没有id属性的

def has_class_not_id(tag):

    return tag.has_attr('class') and not tag.has_attr('id')

    # return tag.has_attr('id') and not tag.has_attr('class')

    # return tag.name =="a" and tag.has_attr("class") and not tag.has_attr("id")

# #     #只找a标签

print(soup.find_all(has_class_not_id))  #默认是按照标签来找的

print(soup.find_all(name="a",limit=2))#找所有的a标签，只找前两个

print(soup.body.find_all(attrs={"class":"sister"},recursive=False))#找属性为sister的

print(soup.html.find_all('a'))

print(soup.html.find_all('a',recursive=False))

# recursive = True  #从子子孙孙都找到了

# recursive = False #如果只想搜索tag的直接子节点（就不往里面找了）,可以使用参数 recursive=False .

# **kwargs

print(soup.find_all(attrs={"class":"sister"}))

print(soup.find_all(class_="sister"))  #这两个是一样的

print(soup.find_all(attrs={"id":"link3"})) #这两个是一样的，只是表示方式不一样

print(soup.find_all(id="link3"))

find_all( name , attrs , recursive , text , kwargs )**

#2、find_all( name , attrs , recursive , text , **kwargs )

#2.1、name: 搜索name参数的值可以使任一类型的 过滤器 ,字符窜,正则表达式,列表,方法或是 True .

print(soup.find_all(name=re.compile('^t')))

#2.2、keyword: key=value的形式，value可以是过滤器：字符串 , 正则表达式 , 列表, True .

print(soup.find_all(id=re.compile('my')))

print(soup.find_all(href=re.compile('lacie'),id=re.compile('\d'))) #注意类要用class_

print(soup.find_all(id=True)) #查找有id属性的标签

# 有些tag属性在搜索不能使用,比如HTML5中的 data-* 属性:

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>','lxml')

# data_soup.find_all(data-foo="value") #报错：SyntaxError: keyword can't be an expression

# 但是可以通过 find_all() 方法的 attrs 参数定义一个字典参数来搜索包含特殊属性的tag:

print(data_soup.find_all(attrs={"data-foo": "value"}))

# [<div data-foo="value">foo!</div>]

#2.3、按照类名查找，注意关键字是class_，class_=value,value可以是五种选择器之一

print(soup.find_all('a',class_='sister')) #查找类为sister的a标签

print(soup.find_all('a',class_='sister ssss')) #查找类为sister和sss的a标签，顺序错误也匹配不成功

print(soup.find_all(class_=re.compile('^sis'))) #查找类为sister的所有标签

#2.4、attrs

print(soup.find_all('p',attrs={'class':'story'}))

#2.5、text: 值可以是：字符，列表，True，正则

print(soup.find_all(text='Elsie'))

print(soup.find_all('a',text='Elsie'))

#2.6、limit参数:如果文档树很大那么搜索会很慢.如果我们不需要全部结果,可以使用 limit 参数限制返回结果的数量.
效果与SQL中的limit关键字类似,当搜索到的结果数量达到 limit 的限制时,就停止搜索返回结果

print(soup.find_all('a',limit=2))

#2.7、recursive:调用tag的 find_all() 方法时,Beautiful Soup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 recursive=False .

print(soup.html.find_all('a'))

print(soup.html.find_all('a',recursive=False))

'''

像调用 find_all() 一样调用tag

find_all() 几乎是Beautiful Soup中最常用的搜索方法,所以我们定义了它的简写方法. BeautifulSoup 对象和 tag 对象可以被当作一个方法来使用,
这个方法的执行结果与调用这个对象的 find_all() 方法相同,下面两行代码是等价的:

soup.find_all("a")

soup("a")

这两行代码也是等价的:

soup.title.find_all(text=True)

soup.title(text=True)

'''

find( name , attrs , recursive , text , **kwargs )

#3、find( name , attrs , recursive , text , **kwargs )

find_all() 方法将返回文档中符合条件的所有tag,尽管有时候我们只想得到一个结果.
比如文档中只有一个<body>标签,那么使用 find_all() 方法来查找<body>标签就不太合适, 使用 find_all 方法并设置 limit=1 参数不如直接使用 find() 方法.下面两行代码是等价的:

soup.find_all('title', limit=1)

# [<title>The Dormouse's story</title>]

soup.find('title')

# <title>The Dormouse's story</title>

唯一的区别是 find_all() 方法的返回结果是值包含一个元素的列表,而 find() 方法直接返回结果.

find_all() 方法没有找到目标是返回空列表, find() 方法找不到目标时,返回 None .

print(soup.find("nosuchtag"))

# None

soup.head.title 是 tag的名字 方法的简写.这个简写的原理就是多次调用当前tag的 find() 方法:

soup.head.title

# <title>The Dormouse's story</title>

soup.find("head").find("title")

# <title>The Dormouse's story</title>

3 .css选择器

这就是另一种与 find_all 方法有异曲同工之妙的查找方法.

写 CSS 时，标签名不加任何修饰，类名前加.，id名前加#
在这里我们也可以利用类似的方法来筛选元素，用到的方法是 soup.select()，返回类型是 list

#!/usr/bin/env python3

# -*- coding: utf-8 -*-

# @Time    : 2018/3/1 15:24

# @Author  : hyang

# @File    : beautifulsoup_study.py

# @Software: PyCharm

#该模块提供了select方法来支持css,详见官网:https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id37

html_doc = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title">

    <b>The Dormouse's story</b>

    Once upon a time there were three little sisters; and their names were

    <a href="http://example.com/elsie" class="sister" id="link1">

        <span>Elsie</span>

    </a>

    <a href="http://example.com/lacie" class="sister" id="link2">

    <span>Elsieq</span>Lacie</a> and

    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

    <div class='panel-1'>

        <ul class='list' id='list-1'>

            <li class='element'>Foo</li>

            <li class='element'>Bar</li>

            <li class='element'>Jay</li>

        </ul>

        <ul class='list list-small' id='list-2'>

            <li class='element'><h1 class='yyyy'>Foo</h1></li>

            <li class='element xxx'>Bar</li>

            <li class='element'>Jay</li>

        </ul>

    </div>

    and they lived at the bottom of a well.

</p>

<p class="story">...</p>

"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc,'lxml')

#1、CSS选择器

# print(soup.p.select('.sister')) # p 标签下所有class="sister"元素

# print(soup.select('.sister span')) # p 标签下所有class="sister"元素下的span元素

# print(soup.select('#link1')) # 获取标签id=link1的元素

# print(soup.select('#link1 span')) # 获取标签id=link1的元素下的span元素

#

#print(soup.select('#list-2 .element.xxx')) # 获取标签id=list-2的元素下的class=element.xxx元素

# print(soup.select('#list-2')[0].select('.element')) # 获取标签id=list-2的第一个元素下的class=element元素

# 通过是否存在某个属性来查找:

print(soup.select('a[href="http://example.com/tillie"]'))

# 2、获取属性

print(soup.select('#list-2 h1')[0].attrs) # {'class': ['yyyy']}

# 3、获取内容 

print(soup.select('#list-2 h1')[0].get_text()) # Foo

实例：过滤百度搜索的广告

import requests

import re

from bs4 import BeautifulSoup

from lxml import etree

param={'wd':'python'}

# 对url进行传参

response = requests.get('http://www.baidu.com/s?', params=param)

soup = BeautifulSoup(response.text,'lxml')

#soup = etree.HTML(response.text)

search_con = soup.find_all(attrs={"class":re.compile("result c-container ")})

#search_con = soup.xpath('//div[@class="result c-container "]')

for item in search_con:

    # print(item.xpath('h3/a/@href'))

    # print(item.xpath('h3/a')[0].xpath('string(.)'))

    print(item.select('h3 a')[0].text)  # Python3 教程 | 菜鸟教程

    print(item.select('h3 a')[0].attrs['href']) # ['http://www.baidu.com/link?url=eJq5lIwp3TCnmNRomh62ctEtzncSG_']

实例：爬取三国演义小说所有的章节标题和章节内容

#!/usr/bin/python

# -*- coding: utf-8 -*-

# @Time    : 2019/12/12 14:16

# @Author  : hyang

# @File    : bs4_study.py

# @Software: PyCharm

# 使用bs4实现将诗词名句网站中三国演义小说的每一章的内容爬去到本地磁盘进行存储

# http://www.shicimingju.com/book/sanguoyanyi.html

import requests

from bs4 import BeautifulSoup

# 需求：爬取三国演义小说所有的章节标题和章节内容http://www.shicimingju.com/book/sanguoyanyi.html

if __name__ == "__main__":

    # 对首页的页面数据进行爬取

    headers = {

        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'

    }

    url = 'http://www.shicimingju.com/book/sanguoyanyi.html'

    page_text = requests.get(url=url, headers=headers).text

    # 在首页中解析出章节的标题和详情页的url

    # 1.实例化BeautifulSoup对象，需要将页面源码数据加载到该对象中

    soup = BeautifulSoup(page_text, 'lxml')

    # 解析章节标题和详情页的url

    li_list = soup.select('.book-mulu > ul > li')

    fp = open('./sanguo.txt', 'w', encoding='utf-8')

    for li in li_list:

        title = li.a.string

        detail_url = 'http://www.shicimingju.com' + li.a['href']

        # 对详情页发起请求，解析出章节内容

        detail_page_text = requests.get(url=detail_url, headers=headers).text

        # 解析出详情页中相关的章节内容

        detail_soup = BeautifulSoup(detail_page_text, 'lxml')

        div_tag = detail_soup.find('div', class_='chapter_content')

        # 解析到了章节的内容

        content = div_tag.text

        fp.write(title + ':' + content + '\n')

        print(title, '爬取成功！！！')

beautifulsoup库使用的更多相关文章

Python爬虫小白入门（三）BeautifulSoup库
# 一.前言 *** 上一篇演示了如何使用requests模块向网站发送http请求,获取到网页的HTML数据.这篇来演示如何使用BeautifulSoup模块来从HTML文本中提取我们想要的数据. ...
BeautifulSoup库children(),descendants()方法的使用
BeautifulSoup库children(),descendants()方法的使用示例网站:http://www.pythonscraping.com/pages/page3.html 网站内容 ...
网络爬虫BeautifulSoup库的使用
使用BeautifulSoup库提取HTML页面信息 #!/usr/bin/python3 import requests from bs4 import BeautifulSoup url='htt ...
BeautifulSoup库的使用
1.简介 BeautifulSoup库也是一个HTML/XML的解析器,其使用起来很简单,但是其实解析网站用xpath和re已经足矣,这个库其实很少用到.因为其占用内存资源还是比xpath更高. '' ...
python爬虫学习之使用BeautifulSoup库爬取开奖网站信息-模块化
实例需求:运用python语言爬取http://kaijiang.zhcw.com/zhcw/html/ssq/list_1.html这个开奖网站所有的信息,并且保存为txt文件和excel文件. 实 ...
python下载安装BeautifulSoup库
python下载安装BeautifulSoup库 1.下载https://www.crummy.com/software/BeautifulSoup/bs4/download/4.5/ 2.解压到解压 ...
基于BeautifulSoup库的HTML内容的查找
一.BeautifulSoup库提供了一个检索的参数: <>.find_all(name,attrs,recursive,string,**kwargs),它返回一个列表类型,存储查找的结 ...
BeautifulSoup库
'''灵活又方便的网页解析库,处理高效,支持多种解析器.利用它不用编写正则表达式即可方便的实现网页信息的提取.''' BeautifulSoup库包含的一些解析库: 解析库使用方法优势劣势 py ...
python BeautifulSoup库的基本使用
Beautiful Soup 是用Python写的一个HTML/XML的解析器,它可以很好的处理不规范标记并生成剖析树(parse tree). 它提供简单又常用的导航(navigating),搜索以 ...
python爬虫学习(一)：BeautifulSoup库基础及一般元素提取方法
最近在看爬虫相关的东西,一方面是兴趣,另一方面也是借学习爬虫练习python的使用,推荐一个很好的入门教程:中国大学MOOC的<python网络爬虫与信息提取>,是由北京理工的副教授嵩天老 ...

随机推荐

win10安装elementary os双系统
elementary os是ubuntu的一个分支,界面有点像苹果,比较漂亮.如图: 从已有的磁盘中划出一块空白分区,将elementary单独安装在这个分区里,这个分区需要比其他分区的剩余空间都要大 ...
php扩展开发实战教程(1)
我的开发环境: Ubuntu16.04 apt方式安装的php5.6, apache,mysql等由于我的本机用的是apt方式安装的php,所以我这里从头开始用最精简的方式,编译安装一个php5.4 ...
python数据分析工具包（3）——matplotlib（一）
前两篇文章简单介绍了科学计算Numpy的一些常用方法,还有一些其他内容,会在后面的实例中学习.下面介绍另一个模块--Matplotlib. Matplotlib是一个Python 2D绘图库,试图让复 ...
kvm克隆
virt-clone --original aming2 --name aming3 --file /data/kvm/aming3.qcow2 相关的克隆命令克隆前必须关闭虚拟机 virs ...
关于Frame加背景的那点事？
最近新生问我一个问题,继承自Frame(可不是继承自JFrame)的框架怎样添加背景图片, 真够坑的,当时还真懵了,废话少说直接上代码: import java.awt.*; import java. ...
一起学微软Power BI系列-使用技巧(6) 连接Sqlite数据库
好久没有研究Power BI了,看到高飞大神弄的东西,太惭愧了.今天有个小东西,数据在Sqlite里面,想倒腾到Power BI Desktop里面折腾一下,结果发现还不直接支持.所以只好硬着头皮上去 ...
Shell脚本报错：-bash: ./switch.sh: /bin/bash^M: bad interpreter: No such file or directory
在学习shell中测试case参数命令代码如下 #!/bin/bash #switch测试 case $1 in start) echo 'start' ;; ...
Mysql内置的profiling性能分析工具
要想优化一条 Query,我们就需要清楚的知道这条 Query 的性能瓶颈到底在哪里,是消耗的 CPU计算太多,还是需要的的 IO 操作太多?要想能够清楚的了解这些信息,在 MySQL 5.0 和 M ...
PAT Public Bike Management (dfs)
思路:你的答案必须满足三个条件: 1.在所有路径中选择最短的: 2.如果路径相等,则选择从PBMC中送出最少的: 3.如果路径相等且PBMC送出的车也相等,则选择带回最少的. 注意:这题很恶心,你要考 ...
http协议——cookie详解
http是无状态的,所以引入了cookie来管理服务器与客户端之间的状态与cookie相关的http首部字段有: 1.Set-Cookie:它一个响应首部字段,从服务器发送到客户端,当服务器想开始通 ...