原文地址http://blog.csdn.net/watsy/article/details/14161201

先发一下官方文档地址。http://www.crummy.com/software/BeautifulSoup/bs4/doc/

建议有时间可以看一下python包的文档。

Beautiful Soup 相比其他的html解析有个非常重要的优势。html会被拆解为对象处理。全篇转化为字典和数组。

相比正则解析的爬虫，省略了学习正则的高成本。

相比xpath爬虫的解析，同样节约学习时间成本。虽然xpath已经简单点了。（爬虫框架Scrapy就是使用xpath）

安装
linux下可以执行

[plain] view plaincopy
apt-get install python-bs4

也可以用python的安装包工具来安装
[html] view plaincopy
easy_install beautifulsoup4

pip install beautifulsoup4

使用简介
下面说一下BeautifulSoup 的使用。

解析html需要提取数据。其实主要有几点

1：获取指定tag的内容。

[plain] view plaincopy

hello, watsy

hello, beautiful soup.

2：获取指定tag下的属性。

[html] view plaincopy
watsy's blog
3：如何获取，就需要用到查找方法。

使用示例采用官方

[html] view plaincopy
html_doc = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

"""
格式化输出。
[html] view plaincopy
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)

print(soup.prettify())

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie

,

Lacie

and

Tillie

; and they lived at the bottom of a well.

...

获取指定tag的内容
[html] view plaincopy
soup.title

The Dormouse's story

soup.title.name

u'title'

soup.title.string

u'The Dormouse's story'

soup.title.parent.name

u'head'

soup.p

The Dormouse's story

soup.a

Elsie

上面示例给出了4个方面
1：获取tag

soup.title

2：获取tag名称

soup.title.name

3：获取title tag的内容

soup.title.string

4：获取title的父节点tag的名称

soup.title.parent.name

怎么样，非常对象化的使用吧。

提取tag属性
下面要说一下如何提取href等属性。

[html] view plaincopy
soup.p['class']

u'title'

获取属性。方法是
soup.tag['属性名称']

[html] view plaincopy
watsy's blog
常见的应该是如上的提取联接。
代码是

[html] view plaincopy
soup.a['href']
相当easy吧。

查找与判断
接下来进入重要部分。全文搜索查找提取.

soup提供find与find_all用来查找。其中find在内部是调用了find_all来实现的。因此只说下find_all

[html] view plaincopy
def find_all(self, name=None, attrs={}, recursive=True, text=None,
limit=None, **kwargs):

看参数。
第一个是tag的名称，第二个是属性。第3个选择递归，text是判断内容。limit是提取数量限制。**kwargs 就是字典传递了。。

举例使用。

[html] view plaincopy
tag名称
soup.find_all('b')

[The Dormouse's story]

正则参数
import re
for tag in soup.find_all(re.compile("^b")):
print(tag.name)

body

b

for tag in soup.find_all(re.compile("t")):
print(tag.name)

html

title

列表
soup.find_all(["a", "b"])

[The Dormouse's story,

Elsie,

Lacie,

Tillie]

函数调用
def has_class_but_no_id(tag):
return tag.has_attr('class') and not tag.has_attr('id')

soup.find_all(has_class_but_no_id)

[
The Dormouse's story

,

Once upon a time there were...

,

...

]

tag的名称和属性查找
soup.find_all("p", "title")

[
The Dormouse's story

]

tag过滤
soup.find_all("a")

[Elsie,

Lacie,

Tillie]

tag属性过滤
soup.find_all(id="link2")

[Lacie]

text正则过滤
import re
soup.find(text=re.compile("sisters"))

u'Once upon a time there were three little sisters; and their names were\n'

获取内容和字符串
获取tag的字符串
[html] view plaincopy
title_tag.string

u'The Dormouse's story'

注意在实际使用中应该使用 unicode(title_tag.string)来转换为纯粹的string对象

使用strings属性会返回soup的构造1个迭代器，迭代tag对象下面的所有文本内容
[html] view plaincopy
for string in soup.strings:
print(repr(string))

u"The Dormouse's story"

u'\n\n'

u"The Dormouse's story"

u'\n\n'

u'Once upon a time there were three little sisters; and their names were\n'

u'Elsie'

u',\n'

u'Lacie'

u' and\n'

u'Tillie'

u';\nand they lived at the bottom of a well.'

u'\n\n'

u'...'

u'\n'

获取内容
.contents会以列表形式返回tag下的节点。
[html] view plaincopy
head_tag = soup.head
head_tag

The Dormouse's story

head_tag.contents
[The Dormouse's story]

title_tag = head_tag.contents[0]
title_tag

The Dormouse's story

title_tag.contents

[u'The Dormouse's story']

想想，应该没有什么其他的了。。其他的也可以看文档学习使用。

总结
其实使用起主要是
[html] view plaincopy
soup = BeatifulSoup(data)
soup.title
soup.p.['title']
divs = soup.find_all('div', content='tpc_content')
divs[0].contents[0].string

[转]python下很帅气的爬虫包 - Beautiful Soup 示例的更多相关文章

python下很帅气的爬虫包 - Beautiful Soup 示例
先发一下官方文档地址.http://www.crummy.com/software/BeautifulSoup/bs4/doc/ 建议有时间可以看一下python包的文档. Beautiful Sou ...
python下的复杂网络编程包networkx的安装及使用
由于py3.x与工具包的兼容问题,这里采用py2.7 1.python下的复杂网络编程包networkx的使用: http://blog.sina.com.cn/s/blog_720448d30101 ...
python 爬虫利器 Beautiful Soup
python 爬虫利器 Beautiful Soup Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文 ...
Python爬虫之Beautiful Soup解析库的使用（五）
Python爬虫之Beautiful Soup解析库的使用 Beautiful Soup-介绍 Python第三方库,用于从HTML或XML中提取数据官方:http://www.crummv.com/ ...
[Python爬虫] 使用 Beautiful Soup 4 快速爬取所需的网页信息
[Python爬虫] 使用 Beautiful Soup 4 快速爬取所需的网页信息 2018-07-21 23:53:02 larger5 阅读数 4123更多分类专栏: 网络爬虫版权声明: ...
python爬虫之Beautiful Soup基础知识+实例
python爬虫之Beautiful Soup基础知识 Beautiful Soup是一个可以从HTML或XML文件中提取数据的python库.它能通过你喜欢的转换器实现惯用的文档导航,查找,修改文档 ...
python下的复杂网络编程包networkx的使用（摘抄）
原文:http://blog.sciencenet.cn/home.php?mod=space&uid=404069&do=blog&classid=141080&vi ...
python爬虫之Beautiful Soup的基本使用
1.简介简单来说,Beautiful Soup是python的一个库,最主要的功能是从网页抓取数据.官方解释如下: Beautiful Soup提供一些简单的.python式的函数用来处理导航.搜索 ...
Python爬虫库-Beautiful Soup的使用
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库,简单来说,它能将HTML的标签文件解析成树形结构,然后方便地获取到指定标签的对应属性. 如在上一篇文章通过爬虫 ...

随机推荐

SVN与Eclipse整合
SVN与Eclipse整合下载SVN插件(http://subclipse.tigris.org) 我们使用版本eclipse_svn_site-1.6.5.zip 解压到一个文件夹中进入ecli ...
259. 3Sum Smaller
题目: Given an array of n integers nums and a target, find the number of index triplets i, j, k with 0 ...
Android 关于listView 显示不全的问题
刚刚在项目中发现一个bug,我是用ScrollView 嵌套 ListView的,但是我的数据只能显示一条,开始我还以为是数据有错误,经过排查以后发现是正确的百度发现 android的架构好像没有考 ...
java工具类–自动将数据库表生成javabean
最近和数据库的表打交道挺多的,因为暂时做的是接口活. 在这过程中发现要把表转换成对应的javabean类型,字段少的表还行,如果不小心碰到几十个字段的他妈的写起来就有点麻烦了,万一碰到几百个的呢,那不 ...
java23中设计模式
原文来自:http://zz563143188.iteye.com/blog/1847029 设计模式(Design Patterns) ——可复用面向对象软件的基础设计模式(Design patt ...
ListView(1)几个重要属性，关闭滚动到顶部，底部的动画，item之间的分割线，背景等
见表: android:stackFromBottom="true" 设置该属性之后你做好的列表就会显示你列表的最下面,值为true和false android:transcrip ...
Android开发之权限列表
权限定义功能 android.permission.ACCESS_CHECKIN_PROPERTIES 允许读写访问"properties"表在checkin数据库中,改值可以修 ...
带你走进EJB--那些跟EJB容器相关的那些Java概念
最近在对EJB的相关内容进行总结,在总结的过程中发现对容器的概念并不是很理解,因为EJB本身就是一个容器,但是容器到底是用来做什么的?它跟我们之前所了解的组件,框架,包,类等都有什么关系?接下来主要是 ...
BestCoder Round #2 1001 (简单处理)
题目链接题意:给N条信息,每个信息代表有x个人从开始的时间到结束的时间在餐厅就餐, 问最少需要多少座位才能满足需要. 分析:由于时间只有24*60 所以把每个时间点放到数组a中,并标记开始的时 ...
[反汇编练习] 160个CrackMe之006
[反汇编练习] 160个CrackMe之006. 本系列文章的目的是从一个没有任何经验的新手的角度(其实就是我自己),一步步尝试将160个CrackMe全部破解,如果可以,通过任何方式写出一个类似于注 ...

[转]python下很帅气的爬虫包 - Beautiful Soup 示例

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie

,

Lacie

and

Tillie

; and they lived at the bottom of a well.

...

The Dormouse's story

u'title'

u'The Dormouse's story'

u'head'

The Dormouse's story

Elsie

u'title'

[The Dormouse's story]

body

b

html

title

[The Dormouse's story,

Elsie,

Lacie,

Tillie]

[ The Dormouse's story ,

Once upon a time there were... ,

... ]

[ The Dormouse's story ]

[Elsie,

Lacie,

Tillie]

[Lacie]

u'Once upon a time there were three little sisters; and their names were\n'

u'The Dormouse's story'

u"The Dormouse's story"

u'\n\n'

u"The Dormouse's story"

u'\n\n'

u'Once upon a time there were three little sisters; and their names were\n'

u'Elsie'

u',\n'

u'Lacie'

u' and\n'

u'Tillie'

u';\nand they lived at the bottom of a well.'

u'\n\n'

u'...'

u'\n'

The Dormouse's story

The Dormouse's story

[u'The Dormouse's story']

[转]python下很帅气的爬虫包 - Beautiful Soup 示例的更多相关文章

随机推荐

热门专题

[
The Dormouse's story

,

Once upon a time there were...

,

...

]

[
The Dormouse's story

]