9.3.4 BeaufitulSoup4

　　BeautifulSoup 是一个非常优秀的Python扩展库，可以用来从HTML或XML文件中提取我们感兴趣的数据，并且允许指定使用不同的解析器。

　　使用 pip install BeaufifulSoup4 直接进行模块的安装。安装之后应使用 from bs4 import BeautifulSoup 导入并使用。

　　下面简单演示下BeautifulSoup4的功能，更加详细完整的学习资料请参考 https://www.crummy.com/software/BeautifulSoup/bs4/doc/。

 >>> from bs4 import BeautifulSoup

 >>>

 >>> #自动添加和补全标签

 >>> BeautifulSoup('hello world','lxml')

 <html><body><p>hello world</p></body></html>

 >>>

 >>> #自定义一个html文档内容

 >>> html_doc = """

 <html><head><title>The Dormouse's story</title></head>

 <body>

 <p class="title"><b>The Dormouse's story</b></p>

 <p class="story">Once upon a time there were three little sisters;and their names were

 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and

 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

 and they lived at the bottom of a well.</p>

 <p class="story">...</p>

 """

 >>>

 >>> #解析这段html文档内容，以优雅的方式展示出来

 >>> soup = BeautifulSoup(html_doc,'html.parser')

 >>> print(soup.prettify())

 <html>

  <head>

   <title>

    The Dormouse's story

   </title>

  </head>

  <body>

   <p class="title">

    <b>

     The Dormouse's story

    </b>

   </p>

   <p class="story">

    Once upon a time there were three little sisters;and their names were

    <a class="sister" href="http://example.com/elsie" id="link1">

     Elsie

    </a>

    ,

    <a class="sister" href="http://example.com/lacie" id="link2">

     Lacie

    </a>

    and

    <a class="sister" href="http://example.com/tillie" id="link3">

     Tillie

    </a>

    ;

 and they lived at the bottom of a well.

   </p>

   <p class="story">

    ...

   </p>

  </body>

 </html>

 >>>

 >>> #访问特定标签

 >>> soup.title

 <title>The Dormouse's story</title>

 >>>

 >>> #标签名字

 >>> soup.title.name

 'title'

 >>>

 >>> #标签文本

 >>> soup.title.text

 "The Dormouse's story"

 >>>

 >>> #title标签的上一级标签

 >>> soup.title.parent

 <head><title>The Dormouse's story</title></head>

 >>>

 >>> soup.head

 <head><title>The Dormouse's story</title></head>

 >>>

 >>> soup.b

 <b>The Dormouse's story</b>

 >>>

 >>> soup.b.name

 'b'

 >>> soup.b.text

 "The Dormouse's story"

 >>>

 >>> #把整个BeautifulSoup对象看作标签对象

 >>> soup.name

 '[document]'

 >>>

 >>> soup.body

 <body>

 <p class="title"><b>The Dormouse's story</b></p>

 <p class="story">Once upon a time there were three little sisters;and their names were

 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and

 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;

 and they lived at the bottom of a well.</p>

 <p class="story">...</p>

 </body>

 >>>

 >>> soup.p

 <p class="title"><b>The Dormouse's story</b></p>

 >>>

 >>> #标签属性

 >>> soup.p['class']

 ['title']

 >>>

 >>> soup.p.get('class')         #也可以这样查看标签属性

 ['title']

 >>>

 >>> soup.p.text

 "The Dormouse's story"

 >>>

 >>> soup.p.contents

 [<b>The Dormouse's story</b>]

 >>>

 >>> soup.a

 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

 >>>

 >>> #查看a标签所有属性

 >>> soup.a.attrs

 {'class': ['sister'], 'id': 'link1', 'href': 'http://example.com/elsie'}

 >>>

 >>> #查找所有a标签

 >>> soup.find_all('a')

 [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

 >>>

 >>> #同时查找<a>和<b>标签

 >>> soup.find_all(['a','b'])

 [<b>The Dormouse's story</b>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

 >>>

 >>> import re

 >>> #查找href包含特定关键字的标签

 >>> soup.find_all(href=re.compile("elsie"))

 [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

 >>>

 >>> soup.find(id='link3')

 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

 >>>

 >>> soup.find_all('a',id='link3')

 [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

 >>>

 >>> for link in soup.find_all('a'):

     print(link.text,':',link.get('href'))

 Elsie : http://example.com/elsie

 Lacie : http://example.com/lacie

 Tillie : http://example.com/tillie

 >>>

 >>> print(soup.get_text())           #返回所有文本

 The Dormouse's story

 The Dormouse's story

 Once upon a time there were three little sisters;and their names were

 Elsie,

 Lacieand

 Tillie;

 and they lived at the bottom of a well.

 ...

 >>>

 >>> #修改标签属性

 >>> soup.a['id']='test_link1'

 >>> soup.a

 <a class="sister" href="http://example.com/elsie" id="test_link1">Elsie</a>

 >>>

 >>> #修改标签文本

 >>> soup.a.string.replace_with('test_Elsie')

 'Elsie'

 >>>

 >>> soup.a.string

 'test_Elsie'

 >>>

 >>> print(soup.prettify())

 <html>

  <head>

   <title>

    The Dormouse's story

   </title>

  </head>

  <body>

   <p class="title">

    <b>

     The Dormouse's story

    </b>

   </p>

   <p class="story">

    Once upon a time there were three little sisters;and their names were

    <a class="sister" href="http://example.com/elsie" id="test_link1">

     test_Elsie

    </a>

    ,

    <a class="sister" href="http://example.com/lacie" id="link2">

     Lacie

    </a>

    and

    <a class="sister" href="http://example.com/tillie" id="link3">

     Tillie

    </a>

    ;

 and they lived at the bottom of a well.

   </p>

   <p class="story">

    ...

   </p>

  </body>

 </html>

 >>>

 >>>

 >>> #遍历子标签

 >>> for child in soup.body.children:

     print(child)

 <p class="title"><b>The Dormouse's story</b></p>

 <p class="story">Once upon a time there were three little sisters;and their names were

 <a class="sister" href="http://example.com/elsie" id="test_link1">test_Elsie</a>,

 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and

 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;

 and they lived at the bottom of a well.</p>

 <p class="story">...</p>

 >>>

9.3.4 BeaufitulSoup4的更多相关文章

随机推荐

数据库操作语句大全(sql)
一.基础 1.说明:创建数据库CREATE DATABASE database-name 2.说明:删除数据库drop database dbname3.说明:备份sql server--- 创建备 ...
hbase查询_Phoenix及hbase repl命令行两种方式
一.Phoenix(jdbc)登陆 1.cd /home/mr/phoenix/bin(此路径每个环境里面有可能不一样)2../sqlline.py localhost 二.shell repl Hb ...
jquery操作删除元素
通过 jQuery,可以很容易地删除已有的 HTML 元素. 删除元素/内容如需删除元素和内容,一般可使用以下两个 jQuery 方法: remove() - 删除被选元素(及其子元素) empty ...
P3178 [HAOI2015]树上操作树链剖分
这个题就是一道树链剖分的裸题,但是需要有一个魔性操作___编号数组需要开longlong!!!震惊!真的神奇. 题干: 题目描述有一棵点数为 N 的树,以点为根,且树点有边权.然后有 M 个操作, ...
[HNOI2011]XOR与路径
https://zybuluo.com/mdeditor#1094266 标签(空格分隔): 高斯消元期望题面从 1 号节点开始,以相等的概率,随机选择与当前节点相关联的某条边,并沿这条边走到下 ...
[Swift通天遁地]五、高级扩展-(1)快速检测设备属性：版本、类型、屏幕尺寸
★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★➤微信公众号:山青咏芝(shanqingyongzhi)➤博客园地址:山青咏芝(https://www.cnblogs. ...
JAVA、C、C++、Python同样是高级语言，为什么只有C和C++可以编写单片机程序？
JAVA.C.C++.Python这四种编程语言,前三种玩的比较多,python做为兴趣爱好或者玩脚本的时候弄过,编程语言在使用的时候主要还是适合不合适,单片机使用的场景属于功能简单,成本相对较低,现 ...
免费开源ERP成功案例分享：化学之家通过Odoo实现工业互联网转型
本文来自<开源智造Odoo客户成功案例采访实录>的精选内容章节.请勿转载.欢迎您反馈阅读意见. 客户地区:江苏常州客户名称:化学之家(中外合资) 所属行业:化工制造(工业) 实施模块:销 ...
javascript中window,document,body的解释
解释javascript中window,document,body的区别: window对象表示浏览器中打开的窗口,即是一个浏览器窗口只有一个window对象. document对象是载入浏览器的ht ...
easyui datagrid 高度布局自适应
最近在把以前写的一个项目改成用easyui做前端.过程中遇到了不少问题.其中一个就是datagrid不能很好的布局.想了好多办法都有局限.最后想到会不会是布局(easyui-layout)的问题,经过 ...

9.3.4 BeaufitulSoup4

9.3.4 BeaufitulSoup4的更多相关文章

随机推荐

热门专题