Beautiful Soup的使用

Beautiful Soup简单实用，功能也算比较全，之前下载都是自己使用xpath去获取信息，以后简单的解析可以用这个，方便省事。

Beautiful Soup 是用 Python 写的一个 HTML/XML 的解析器，它可以很好的处理不规范标记并生成剖析树。通常用来分析爬虫抓取的web文档。对于不规则的 Html文档，也有很多的补全功能，节省了开发者的时间和精力。

Beautiful Soup 的官方文档齐全，将官方给出的例子实践一遍就能掌握。官方英文文档，中文文档

一安装 Beautiful Soup

安装 BeautifulSoup 很简单，下载 BeautifulSoup 源码。解压运行

python setup.py install 即可。

测试安装是否成功。键入 import BeautifulSoup 如果没有异常，即成功安装

二使用 BeautifulSoup

1. 导入BeautifulSoup ，创建BeautifulSoup 对象

from BeautifulSoup import BeautifulSoup           # HTML

from BeautifulSoup import BeautifulStoneSoup      # XML

import BeautifulSoup                              # ALL

doc = [

    '<html><head><title>Page title</title></head>',

    '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',

    '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',

    '</html>'

]

# BeautifulSoup 接受一个字符串参数

soup = BeautifulSoup(''.join(doc))

2. BeautifulSoup对象简介

用BeautifulSoup 解析 html文档时，BeautifulSoup将 html文档类似 dom文档树一样处理。BeautifulSoup文档树有三种基本对象。

2.1. soup BeautifulSoup.BeautifulSoup

type(soup)

<class 'BeautifulSoup.BeautifulSoup'>

2.2. 标记 BeautifulSoup.Tag

type(soup.html)

<class 'BeautifulSoup.Tag'>

2.3 文本 BeautifulSoup.NavigableString

type(soup.title.string)

<class 'BeautifulSoup.NavigableString'>

3. BeautifulSoup 剖析树

3.1 BeautifulSoup.Tag对象方法

获取标记对象（Tag）

标记名获取法，直接用 soup对象加标记名，返回 tag对象.这种方式，选取唯一标签的时候比较有用。或者根据树的结构去选取，一层层的选择

>>> html = soup.html

>>> html

<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is paragraph<b>one</b>.</p><p id="secondpara" align="blah">This is paragraph<b>two</b>.</p></body></html>

>>> type(html)

<class 'BeautifulSoup.Tag'>

>>> title = soup.title

<title>Page title</title>

content方法

content方法根据文档树进行搜索，返回标记对象（tag）的列表

>>> soup.contents

[<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is paragraph<b>one</b>.</p><p id="secondpara" align="blah">This is paragraph<b>two</b>.</p></body></html>]

>>> soup.contents[0].contents

[<head><title>Page title</title></head>, <body><p id="firstpara" align="center">This is paragraph<b>one</b>.</p><p id="secondpara" align="blah">This is paragraph<b>two</b>.</p></body>]

>>> len(soup.contents[0].contents)

2

>>> type(soup.contents[0].contents[1])

<class 'BeautifulSoup.Tag'>

使用contents向后遍历树，使用parent向前遍历树

next 方法

获取树的子代元素，包括 Tag 对象和 NavigableString 对象。。。

>>> head.next

<title>Page title</title>

>>> head.next.next

u'Page title'

>>> p1 = soup.p

>>> p1

<p id="firstpara" align="center">This is paragraph<b>one</b>.</p>

>>> p1.next

u'This is paragraph'

nextSibling 下一个兄弟对象包括 Tag 对象和 NavigableString 对象

>>> head.nextSibling

<body><p id="firstpara" align="center">This is paragraph<b>one</b>.</p><p id="secondpara" align="blah">This is paragraph<b>two</b>.</p></body>

>>> p1.next.nextSibling

<b>one</b>

与 nextSibling 相似的是 previousSibling，即上一个兄弟节点。

replacewith方法

将对象替换为，接受字符串参数

>>> head = soup.head

>>> head

<head><title>Page title</title></head>

>>> head.parent

<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is paragraph<b>one</b>.</p><p id="secondpara" align="blah">This is paragraph<b>two</b>.</p></body></html>

>>> head.replaceWith('head was replace')

>>> head

<head><title>Page title</title></head>

>>> head.parent

>>> soup

<html>head was replace<body><p id="firstpara" align="center">This is paragraph<b>one</b>.</p><p id="secondpara" align="blah">This is paragraph<b>two</b>.</p></body></html>

>>>

搜索方法

搜索提供了两个方法，一个是 find，一个是findAll。这里的两个方法(findAll和 find)仅对Tag对象以及，顶层剖析对象有效，但 NavigableString不可用。

`findAll(`name, attrs, recursive, text, limit, **kwargs)

接受一个参数，标记名

寻找文档所有 P标记，返回一个列表

>>> soup.findAll('p')

[<p id="firstpara" align="center">This is paragraph<b>one</b>.</p>, <p id="secondpara" align="blah">This is paragraph<b>two</b>.</p>]

>>> type(soup.findAll('p'))

<type 'list'>

寻找 id="secondpara"的 p 标记，返回一个结果集

>>> pid = type(soup.findAll('p',id='firstpara'))

>>> pid

<class 'BeautifulSoup.ResultSet'>

传一个属性或多个属性对

>>> p2 = soup.findAll('p',{'align':'blah'})

>>> p2

[<p id="secondpara" align="blah">This is paragraph<b>two</b>.</p>]

>>> type(p2)

<class 'BeautifulSoup.ResultSet'>

利用正则表达式

>>> soup.findAll(id=re.compile("para$"))

[<p id="firstpara" align="center">This is paragraph<b>one</b>.</p>, <p id="secondpara" align="blah">This is paragraph<b>two</b>.</p>]

读取和修改属性

>>> p1 = soup.p

>>> p1

<p id="firstpara" align="center">This is paragraph<b>one</b>.</p>

>>> p1['id']

u'firstpara'

>>> p1['id'] = 'changeid'

>>> p1

<p id="changeid" align="center">This is paragraph<b>one</b>.</p>

>>> p1['class'] = 'new class'

>>> p1

<p id="changeid" align="center" class="new class">This is paragraph<b>one</b>.</p>

>>>

剖析树基本方法就这些，还有其他一些，以及如何配合正则表达式。具体请看官方文档

3.2 BeautifulSoup.NavigableString对象方法

NavigableString 对象方法比较简单，获取其内容

>>> soup.title

<title>Page title</title>

>>> title = soup.title.next

>>> title

u'Page title'

>>> type(title)

<class 'BeautifulSoup.NavigableString'>

>>> title.string

u'Page title'

至于如何遍历树，进而分析文档，已经 XML 文档的分析方法，可以参考官方文档。

Beautiful Soup的使用的更多相关文章

使用Beautiful Soup编写一个爬虫系列随笔汇总
这几篇博文只是为了记录学习Beautiful Soup的过程,不仅方便自己以后查看,也许能帮到同样在学习这个技术的朋友.通过学习Beautiful Soup基础知识完成了一个简单的爬虫服务:从all ...
网络爬虫: 从allitebooks.com抓取书籍信息并从amazon.com抓取价格(1): 基础知识Beautiful Soup
开始学习网络数据挖掘方面的知识,首先从Beautiful Soup入手(Beautiful Soup是一个Python库,功能是从HTML和XML中解析数据),打算以三篇博文纪录学习Beautiful ...
Python爬虫学习（11）：Beautiful Soup的使用
之前我们从网页中提取重要信息主要是通过自己编写正则表达式完成的,但是如果你觉得正则表达式很好写的话,那你估计不是地球人了,而且很容易出问题.下边要介绍的Beautiful Soup就可以帮你简化这些操 ...
推荐一些python Beautiful Soup学习网址
前言:这几天忙着写分析报告,实在没精力去研究django,虽然抽时间去看了几遍中文文档,还是等实际实践后写几篇操作文章吧! 正文:以下是本人前段时间学习bs4库找的一些网址,在学习的可以参考下,有点多 ...
错误 You are trying to run the Python 2 version of Beautiful Soup under Python 3. This will not work
Win 10 下python3.6 使用Beautiful Soup 4错误 You are trying to run the Python 2 version of Beautiful ...
Python学习笔记之Beautiful Soup
如何在Python3.x中使用Beautiful Soup 1.BeautifulSoup中文文档:http://www.crummy.com/software/BeautifulSoup/bs3/d ...
Python Beautiful Soup学习之HTML标签补全功能
Beautiful Soup是一个非常流行的Python模块.该模块可以解析网页,并提供定位内容的便捷接口. 使用下面两个命令安装: pip install beautifulsoup4 或者 sud ...
转：Beautiful Soup
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时 ...
Beautiful Soup教程转
Python中使用Beautiful Soup库的超详细教程转 http://www.jb51.net/article/65287.htm 作者:崔庆才字体:[增加减小] 类型:转载时间:20 ...
Beautiful Soup第三方爬虫插件
什么是BeautifulSoup? Beautiful Soup 是用Python写的一个HTML/XML的解析器,它可以很好的处理不规范标记并生成剖析树(parse tree). 它提供简单又常用的 ...

随机推荐

OpenStack 与大数据的融合
此处是hadoop 2.7.2以前 Hadoop 预留的一个 HDFS 文件系统的接口. 可以通过修改这里将数据源的读取改为 Swift. 也可以通过修改 MR 源码将数据抽取部分变换成 ...
RabbitMQ inequivalent arg 'durable' for exchange 'csExchange' in vhost '/': received
错误:inequivalent arg 'durable' for exchange 'csExchange' in vhost '/': received 使用不同的MQ客户端时,常常会出现以上错误 ...
【C++探索之旅】第二部分第一课：面向对象初探，string的惊天内幕
内容简单介绍 1.第二部分第一课:面向对象初探.string的惊天内幕 2.第二部分第二课预告:掀起了"类"的盖头来(一) 面向对象初探,string的惊天内幕上一课<[C ...
hdoj--1005--Number Sequence（规律题）
Number Sequence Time Limit: 2000/1000 MS (Java/Others) Memory Limit: 65536/32768 K (Java/Others) ...
Looping and dictionaries
If you use a dictionary in a for statement, it traverses the keys of the dictionary. For example, pr ...
BZOJ 4551 HEOI 2016 树（并查集）
思路: 考虑时光倒流这不就是并查集裸题了-----. //By SiriusRen #include <cstdio> #include <cstring> #include ...
POJ 3190 priority_queue 贪心
思路: 贪心?就算是吧先把所有的开始时间排个序如果当前的能匹配上已有的牛栏,就找开始时间最早的那个. 否则新加一个牛栏整个过程用priority_queue实现就OK了.. //By Siriu ...
elementui的时间选择器开始时间和结束时间的限制
开始时间不能大于结束时间 html代码部分方法部分开始时间和结束时间可以选同一天 <template> <div class="range-wrapper"& ...
ng-show ng-hide ng-if的区别
用途 ng-show ng-hide ng-if三个都可以用来控制页面DOM元素的显示与隐藏. ng-hide条件为true时,隐藏所在元素,false时显示所在元素. ng-show相反,条件为tr ...
How Javascript works (Javascript工作原理) (十四) 解析，语法抽象树及最小化解析时间的 5 条小技巧
个人总结:读完这篇文章需要15分钟,文章介绍了抽象语法树与js引擎解析这些语法树的过程,提到了懒解析——即转换为AST的过程中不直接进入函数体解析,当这个函数体需要执行的时候才进行相应转换.(因为有的 ...

Beautiful Soup的使用

findAll(name, attrs, recursive, text, limit, **kwargs)

Beautiful Soup的使用的更多相关文章

随机推荐

热门专题

`findAll(`name, attrs, recursive, text, limit, **kwargs)