Python BeautifulSoup 简单笔记

Beautiful Soup 是用 Python 写的一个 HTML/XML 的解析器，它可以很好的处理不规范标记并生成剖析树。通常用来分析爬虫抓取的web文档。对于不规则的 Html文档，也有很多的补全功能，节省了开发者的时间和精力。

Beautiful Soup 的官方文档齐全，将官方给出的例子实践一遍就能掌握。官方英文文档，中文文档

一安装 Beautiful Soup

安装 BeautifulSoup 很简单，下载 BeautifulSoup 源码。解压运行

python setup.py install 即可。

测试安装是否成功。键入 import BeautifulSoup 如果没有异常，即成功安装

二使用 BeautifulSoup

1. 导入BeautifulSoup ，创建BeautifulSoup 对象

from BeautifulSoup import BeautifulSoup           # HTML

from BeautifulSoup import BeautifulStoneSoup      # XML

import BeautifulSoup                              # ALL

doc = [

    '<html><head><title>Page title</title></head>',

    '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',

    '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',

    '</html>'

]

# BeautifulSoup 接受一个字符串参数

soup = BeautifulSoup(''.join(doc))

2. BeautifulSoup对象简介

用BeautifulSoup 解析 html文档时，BeautifulSoup将 html文档类似 dom文档树一样处理。BeautifulSoup文档树有三种基本对象。

2.1. soup BeautifulSoup.BeautifulSoup

type(soup)

<class 'BeautifulSoup.BeautifulSoup'>

2.2. 标记 BeautifulSoup.Tag

type(soup.html)

<class 'BeautifulSoup.Tag'>

2.3 文本 BeautifulSoup.NavigableString

type(soup.title.string)

<class 'BeautifulSoup.NavigableString'>

3. BeautifulSoup 剖析树

3.1 BeautifulSoup.Tag对象方法

获取标记对象（Tag）

标记名获取法，直接用 soup对象加标记名，返回 tag对象.这种方式，选取唯一标签的时候比较有用。或者根据树的结构去选取，一层层的选择

>>> html = soup.html

>>> html

<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is paragraph<b>one</b>.</p><p id="secondpara" align="blah">This is paragraph<b>two</b>.</p></body></html>

>>> type(html)

<class 'BeautifulSoup.Tag'>

>>> title = soup.title

<title>Page title</title>

content方法

content方法根据文档树进行搜索，返回标记对象（tag）的列表

>>> soup.contents

[<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is paragraph<b>one</b>.</p><p id="secondpara" align="blah">This is paragraph<b>two</b>.</p></body></html>]

>>> soup.contents[0].contents

[<head><title>Page title</title></head>, <body><p id="firstpara" align="center">This is paragraph<b>one</b>.</p><p id="secondpara" align="blah">This is paragraph<b>two</b>.</p></body>]

>>> len(soup.contents[0].contents)

2

>>> type(soup.contents[0].contents[1])

<class 'BeautifulSoup.Tag'>

使用contents向后遍历树，使用parent向前遍历树

next 方法

获取树的子代元素，包括 Tag 对象和 NavigableString 对象。。。

>>> head.next

<title>Page title</title>

>>> head.next.next

u'Page title'

>>> p1 = soup.p

>>> p1

<p id="firstpara" align="center">This is paragraph<b>one</b>.</p>

>>> p1.next

u'This is paragraph'

nextSibling 下一个兄弟对象包括 Tag 对象和 NavigableString 对象

>>> head.nextSibling

<body><p id="firstpara" align="center">This is paragraph<b>one</b>.</p><p id="secondpara" align="blah">This is paragraph<b>two</b>.</p></body>

>>> p1.next.nextSibling

<b>one</b>

与 nextSibling 相似的是 previousSibling，即上一个兄弟节点。

replacewith方法

将对象替换为，接受字符串参数

>>> head = soup.head

>>> head

<head><title>Page title</title></head>

>>> head.parent

<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is paragraph<b>one</b>.</p><p id="secondpara" align="blah">This is paragraph<b>two</b>.</p></body></html>

>>> head.replaceWith('head was replace')

>>> head

<head><title>Page title</title></head>

>>> head.parent

>>> soup

<html>head was replace<body><p id="firstpara" align="center">This is paragraph<b>one</b>.</p><p id="secondpara" align="blah">This is paragraph<b>two</b>.</p></body></html>

>>>

搜索方法

搜索提供了两个方法，一个是 find，一个是findAll。这里的两个方法(findAll和 find)仅对Tag对象以及，顶层剖析对象有效，但 NavigableString不可用。

`findAll(`name, attrs, recursive, text, limit, **kwargs)

接受一个参数，标记名

寻找文档所有 P标记，返回一个列表

>>> soup.findAll('p')

[<p id="firstpara" align="center">This is paragraph<b>one</b>.</p>, <p id="secondpara" align="blah">This is paragraph<b>two</b>.</p>]

>>> type(soup.findAll('p'))

<type 'list'>

寻找 id="secondpara"的 p 标记，返回一个结果集

>>> pid = type(soup.findAll('p',id='firstpara'))

>>> pid

<class 'BeautifulSoup.ResultSet'>

传一个属性或多个属性对

>>> p2 = soup.findAll('p',{'align':'blah'})

>>> p2

[<p id="secondpara" align="blah">This is paragraph<b>two</b>.</p>]

>>> type(p2)

<class 'BeautifulSoup.ResultSet'>

利用正则表达式

>>> soup.findAll(id=re.compile("para$"))

[<p id="firstpara" align="center">This is paragraph<b>one</b>.</p>, <p id="secondpara" align="blah">This is paragraph<b>two</b>.</p>]

读取和修改属性

>>> p1 = soup.p

>>> p1

<p id="firstpara" align="center">This is paragraph<b>one</b>.</p>

>>> p1['id']

u'firstpara'

>>> p1['id'] = 'changeid'

>>> p1

<p id="changeid" align="center">This is paragraph<b>one</b>.</p>

>>> p1['class'] = 'new class'

>>> p1

<p id="changeid" align="center" class="new class">This is paragraph<b>one</b>.</p>

>>>

剖析树基本方法就这些，还有其他一些，以及如何配合正则表达式。具体请看官方文档

3.2 BeautifulSoup.NavigableString对象方法

NavigableString 对象方法比较简单，获取其内容

>>> soup.title

<title>Page title</title>

>>> title = soup.title.next

>>> title

u'Page title'

>>> type(title)

<class 'BeautifulSoup.NavigableString'>

>>> title.string

u'Page title'

至于如何遍历树，进而分析文档，已经 XML 文档的分析方法，可以参考官方文档。

Python BeautifulSoup 简单笔记的更多相关文章

Python学习笔记2-flask-sqlalchemy 简单笔记
flask-sqlalchemy 简单笔记字数阅读评论喜欢 flask-sqlalchemy SQLAlchemy已经成为了python世界里面orm的标准,flask是一个轻巧的web框架, ...
【原】Learning Spark (Python版) 学习笔记(三)----工作原理、调优与Spark SQL
周末的任务是更新Learning Spark系列第三篇,以为自己写不完了,但为了改正拖延症,还是得完成给自己定的任务啊 = =.这三章主要讲Spark的运行过程(本地+集群),性能调优以及Spark ...
《简明python教程》笔记一
读<简明Python教程>笔记: 本书的官方网站是www.byteofpython.info 安装就不说了,网上很多,这里就记录下我在安装时的问题,首先到python官网下载,选好安装路 ...
Python开发简单爬虫 - 慕课网
课程链接:Python开发简单爬虫环境搭建: Eclipse+PyDev配置搭建Python开发环境 Python入门基础教程用Eclipse编写Python程序课程目录第1章课程介绍 ...
python核心编程--笔记
python核心编程--笔记的解释器options: 1.1 –d 提供调试输出 1.2 –O 生成优化的字节码(生成.pyo文件) 1.3 –S 不导入site模块以在启动时查找pyt ...
Python Click 学习笔记（转）
原文链接:Python Click 学习笔记 Click 是 Flask 的团队 pallets 开发的优秀开源项目,它为命令行工具的开发封装了大量方法,使开发者只需要专注于功能实现.恰好我最近在开发 ...
Python源代码剖析笔记3-Python运行原理初探
Python源代码剖析笔记3-Python执行原理初探本文简书地址:http://www.jianshu.com/p/03af86845c95 之前写了几篇源代码剖析笔记,然而慢慢觉得没有从一个宏观 ...
Python学习基础笔记（全）
换博客了,还是csdn好一些. Python学习基础笔记 1.Python学习-linux下Python3的安装 2.Python学习-数据类型.运算符.条件语句 3.Python学习-循环语句 4. ...
0003.5-20180422-自动化第四章-python基础学习笔记--脚本
0003.5-20180422-自动化第四章-python基础学习笔记--脚本 1-shopping """ v = [ {"name": " ...

随机推荐

[技巧篇]10.那些年我们一起优化过的MyEclipse8.6
这里里面是针对于四海的给位学生预留的视频,请自行下载,我希望对大家有所帮助我这里使用了百度网盘,这里的东西还是比较多的!如果喜欢请关注我,当好胖先生的粉丝链接:http://pan.baidu.c ...
Spring mvc 增加静态资源配置后访问不了注解配置的controller
spring mvc 增加静态资源访问配置. 例如:  <mvc:resources location="/static/" map ...
Activity与Service的回收
Android开发中,一个Application,运行在一个进程中.这个Application的各种组件(四种组件),通常是运行在同一个进程中的.但是,并不是绝对的.由于某种需求,比如,你可以设置Ap ...
【BZOJ】1598: [Usaco2008 Mar]牛跑步
[题意]给定有向图,边严格从大编号指向小编号,求前k短路.n<=1000,m<=10000,k<=100. [算法]归并+拓扑排序||A*求第k短路 [题解]因为此题自带拓扑序的特殊 ...
SSL 证书类型说明: DV OV EV
内容来自: ssl 证书的三种类型: dv (域名型) , ov (企业型) 和 ev (扩展型) OV.DV和EV证书的区别另外: 浏览器兼容性测试报告 Symantec 证书为什么相比其他证书要 ...
Farey Sequence （欧拉函数+前缀和）
题目链接:http://poj.org/problem?id=2478 Description The Farey Sequence Fn for any integer n with n >= ...
JAVA list 列表字典 dict
import java.util.ArrayList; import java.util.HashMap; import java.util.Map; import java.util.Set; pu ...
[IOS]VMware上虚拟机MAC安装XCode
1:VMware上虚拟机MAC安装前 VMware上安装Xcode之后 2:安装Xcode过程:把Xcode复制到虚拟机桌面上 3:复制完成之后,双击Xcode_6.4.dmg 文件 4:把Xcode ...
aspnet_regiis.exe -i 执行报错
IIS刚部署时出现问题处理程序“svc-Integrated”在其模块列表中有一个错误模块“ManagedPipelineHandler” 按照网上的步骤,使用管理员打开CMD 开始->所有程 ...
BigDecimal的用法详解
BigDecimal 由任意精度的整数非标度值和32 位的整数标度 (scale) 组成.如果为零或正数,则标度是小数点后的位数.如果为负数,则将该数的非标度值乘以 10 的负scale 次幂. f ...

Python BeautifulSoup 简单笔记

findAll(name, attrs, recursive, text, limit, **kwargs)

Python BeautifulSoup 简单笔记的更多相关文章

随机推荐

热门专题

`findAll(`name, attrs, recursive, text, limit, **kwargs)