9.3.4 BeaufitulSoup4

　　BeautifulSoup 是一个非常优秀的Python扩展库，可以用来从HTML或XML文件中提取我们感兴趣的数据，并且允许指定使用不同的解析器。

　　使用 pip install BeaufifulSoup4 直接进行模块的安装。安装之后应使用 from bs4 import BeautifulSoup 导入并使用。

　　下面简单演示下BeautifulSoup4的功能，更加详细完整的学习资料请参考 https://www.crummy.com/software/BeautifulSoup/bs4/doc/。

 >>> from bs4 import BeautifulSoup

 >>>

 >>> #自动添加和补全标签

 >>> BeautifulSoup('hello world','lxml')

 <html><body><p>hello world</p></body></html>

 >>>

 >>> #自定义一个html文档内容

 >>> html_doc = """

 <html><head><title>The Dormouse's story</title></head>

 <body>

 <p class="title"><b>The Dormouse's story</b></p>

 <p class="story">Once upon a time there were three little sisters;and their names were

 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and

 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

 and they lived at the bottom of a well.</p>

 <p class="story">...</p>

 """

 >>>

 >>> #解析这段html文档内容，以优雅的方式展示出来

 >>> soup = BeautifulSoup(html_doc,'html.parser')

 >>> print(soup.prettify())

 <html>

  <head>

   <title>

    The Dormouse's story

   </title>

  </head>

  <body>

   <p class="title">

    <b>

     The Dormouse's story

    </b>

   </p>

   <p class="story">

    Once upon a time there were three little sisters;and their names were

    <a class="sister" href="http://example.com/elsie" id="link1">

     Elsie

    </a>

    ,

    <a class="sister" href="http://example.com/lacie" id="link2">

     Lacie

    </a>

    and

    <a class="sister" href="http://example.com/tillie" id="link3">

     Tillie

    </a>

    ;

 and they lived at the bottom of a well.

   </p>

   <p class="story">

    ...

   </p>

  </body>

 </html>

 >>>

 >>> #访问特定标签

 >>> soup.title

 <title>The Dormouse's story</title>

 >>>

 >>> #标签名字

 >>> soup.title.name

 'title'

 >>>

 >>> #标签文本

 >>> soup.title.text

 "The Dormouse's story"

 >>>

 >>> #title标签的上一级标签

 >>> soup.title.parent

 <head><title>The Dormouse's story</title></head>

 >>>

 >>> soup.head

 <head><title>The Dormouse's story</title></head>

 >>>

 >>> soup.b

 <b>The Dormouse's story</b>

 >>>

 >>> soup.b.name

 'b'

 >>> soup.b.text

 "The Dormouse's story"

 >>>

 >>> #把整个BeautifulSoup对象看作标签对象

 >>> soup.name

 '[document]'

 >>>

 >>> soup.body

 <body>

 <p class="title"><b>The Dormouse's story</b></p>

 <p class="story">Once upon a time there were three little sisters;and their names were

 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and

 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;

 and they lived at the bottom of a well.</p>

 <p class="story">...</p>

 </body>

 >>>

 >>> soup.p

 <p class="title"><b>The Dormouse's story</b></p>

 >>>

 >>> #标签属性

 >>> soup.p['class']

 ['title']

 >>>

 >>> soup.p.get('class')         #也可以这样查看标签属性

 ['title']

 >>>

 >>> soup.p.text

 "The Dormouse's story"

 >>>

 >>> soup.p.contents

 [<b>The Dormouse's story</b>]

 >>>

 >>> soup.a

 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

 >>>

 >>> #查看a标签所有属性

 >>> soup.a.attrs

 {'class': ['sister'], 'id': 'link1', 'href': 'http://example.com/elsie'}

 >>>

 >>> #查找所有a标签

 >>> soup.find_all('a')

 [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

 >>>

 >>> #同时查找<a>和<b>标签

 >>> soup.find_all(['a','b'])

 [<b>The Dormouse's story</b>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

 >>>

 >>> import re

 >>> #查找href包含特定关键字的标签

 >>> soup.find_all(href=re.compile("elsie"))

 [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

 >>>

 >>> soup.find(id='link3')

 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

 >>>

 >>> soup.find_all('a',id='link3')

 [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

 >>>

 >>> for link in soup.find_all('a'):

     print(link.text,':',link.get('href'))

 Elsie : http://example.com/elsie

 Lacie : http://example.com/lacie

 Tillie : http://example.com/tillie

 >>>

 >>> print(soup.get_text())           #返回所有文本

 The Dormouse's story

 The Dormouse's story

 Once upon a time there were three little sisters;and their names were

 Elsie,

 Lacieand

 Tillie;

 and they lived at the bottom of a well.

 ...

 >>>

 >>> #修改标签属性

 >>> soup.a['id']='test_link1'

 >>> soup.a

 <a class="sister" href="http://example.com/elsie" id="test_link1">Elsie</a>

 >>>

 >>> #修改标签文本

 >>> soup.a.string.replace_with('test_Elsie')

 'Elsie'

 >>>

 >>> soup.a.string

 'test_Elsie'

 >>>

 >>> print(soup.prettify())

 <html>

  <head>

   <title>

    The Dormouse's story

   </title>

  </head>

  <body>

   <p class="title">

    <b>

     The Dormouse's story

    </b>

   </p>

   <p class="story">

    Once upon a time there were three little sisters;and their names were

    <a class="sister" href="http://example.com/elsie" id="test_link1">

     test_Elsie

    </a>

    ,

    <a class="sister" href="http://example.com/lacie" id="link2">

     Lacie

    </a>

    and

    <a class="sister" href="http://example.com/tillie" id="link3">

     Tillie

    </a>

    ;

 and they lived at the bottom of a well.

   </p>

   <p class="story">

    ...

   </p>

  </body>

 </html>

 >>>

 >>>

 >>> #遍历子标签

 >>> for child in soup.body.children:

     print(child)

 <p class="title"><b>The Dormouse's story</b></p>

 <p class="story">Once upon a time there were three little sisters;and their names were

 <a class="sister" href="http://example.com/elsie" id="test_link1">test_Elsie</a>,

 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and

 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;

 and they lived at the bottom of a well.</p>

 <p class="story">...</p>

 >>>

9.3.4 BeaufitulSoup4的更多相关文章

随机推荐

频繁模式挖掘 Apriori算法 FP-tree
啤酒尿布组合营销 X=>Y,其中x属于项集I,Y属于项集I,且X.Y的交集等于空集. 2类算法 Apriori算法不断地构造候选集.筛选候选集来挖掘出频繁项集,需要多次扫描原始数据.磁盘I ...
bzoj2179
fft裸题我还没有背下来fft #include<bits/stdc++.h> #define pi acos(-1) using namespace std; ; int n, m, ...
JAVA基础(多线程Thread和Runnable的使用区别（转载）
转自:http://jinguo.iteye.com/blog/286772 Runnable是Thread的接口,在大多数情况下“推荐用接口的方式”生成线程,因为接口可以实现多继承,况且Runnab ...
常见的Java Script内存泄露原因及解决方案
前言内存泄漏指由于疏忽或错误造成程序未能释放已经不再使用的内存.内存泄漏并非指内存在物理上的消失,而是应用程序分配某段内存后,由于设计错误,导致在释放该段内存之前就失去了对该段内存的控制,从而造成了 ...
mahjong
题目描述 “为什么, 你们的力量在哪里得到如此地......”“我们比 1 分钟前的我们还要进步, 虽然很微小, 但每转一圈就会前进一寸.这就是钻头啊!”“那才是通向毁灭的道路.为什么就没有意识到螺旋 ...
jmeter中对于各类时间格式的设置
最普通的设置为使用函数助手中的__time, 设置好需要使用的类型,并设置接收参数即可 YMD = yyyyMMdd HMS = HHmmss YMDHMS = yyyyMMdd-HHmmss 第二 ...
51nod 1577 线性基
思路: http://blog.csdn.net/yxuanwkeith/article/details/53524757 //By SiriusRen #include <bits/stdc+ ...
Codeforces 763A
乍看之下感觉有点无从下手,,其实是个很简单的水题,陷入僵局题目大意:给一棵树,树上每个节点都染色,问能否取下一个节点,使得剩余所有子树上的点的颜色都相同.能输出YES和取下的节点编号,否则输出NO. ...
使用A*寻路小记
前几天做另一个DEMO 要用实现自动寻路功能,看到普遍都是A* 学习了下我的主循环代码: isFindEndPoint = false; //主循环 do { CreateOutSkirtsNode ...
Unity学习-地形的设置（五）
添加地形游戏对象 [Hierarchy-Create-Terrain] 为了看的看清楚,在添加一个平行光 [Hierarchy-Create-Direction light] 导入地形包 [Asset ...

9.3.4 BeaufitulSoup4

9.3.4 BeaufitulSoup4的更多相关文章

随机推荐

热门专题