先发一下官方文档地址。http://www.crummy.com/software/BeautifulSoup/bs4/doc/

建议有时间可以看一下python包的文档。

Beautiful Soup 相比其他的html解析有个非常重要的优势。html会被拆解为对象处理。全篇转化为字典和数组。

相比正则解析的爬虫，省略了学习正则的高成本。

相比xpath爬虫的解析，同样节约学习时间成本。虽然xpath已经简单点了。（爬虫框架Scrapy就是使用xpath）

安装

linux下可以执行

apt-get install python-bs4

也可以用python的安装包工具来安装

easy_install beautifulsoup4  

pip install beautifulsoup4

使用简介

下面说一下BeautifulSoup 的使用。

解析html需要提取数据。其实主要有几点

1：获取指定tag的内容。

<p>hello, watsy</p><br><p>hello, beautiful soup.</p>

2：获取指定tag下的属性。

<a href="http://blog.csdn.net/watsy">watsy's blog</a>

3：如何获取，就需要用到查找方法。

html_doc = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title"><b>The Dormouse's story</b></p>  

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>  

<p class="story">...</p>

"""

格式化输出。

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc)  

print(soup.prettify())

# <html>

#  <head>

#   <title>

#    The Dormouse's story

#   </title>

#  </head>

#  <body>

#   <p class="title">

#    <b>

#     The Dormouse's story

#    </b>

#   </p>

#   <p class="story">

#    Once upon a time there were three little sisters; and their names were

#    <a class="sister" href="http://example.com/elsie" id="link1">

#     Elsie

#    </a>

#    ,

#    <a class="sister" href="http://example.com/lacie" id="link2">

#     Lacie

#    </a>

#    and

#    <a class="sister" href="http://example.com/tillie" id="link2">

#     Tillie

#    </a>

#    ; and they lived at the bottom of a well.

#   </p>

#   <p class="story">

#    ...

#   </p>

#  </body>

# </html>

获取指定tag的内容

soup.title

# <title>The Dormouse's story</title>  

soup.title.name

# u'title'  

soup.title.string

# u'The Dormouse's story'  

soup.title.parent.name

# u'head'  

soup.p

# <p class="title"><b>The Dormouse's story</b></p>  

soup.a

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

上面示例给出了4个方面

1：获取tag

soup.title

2：获取tag名称

soup.title.name

3：获取title tag的内容

soup.title.string

4：获取title的父节点tag的名称

soup.title.parent.name

怎么样，非常对象化的使用吧。

提取tag属性

下面要说一下如何提取href等属性。

soup.p['class']

# u'title'

获取属性。方法是

soup.tag['属性名称']

<a href="http://blog.csdn.net/watsy">watsy's blog</a>

常见的应该是如上的提取联接。

代码是

soup.a['href']

相当easy吧。

查找与判断

接下来进入重要部分。全文搜索查找提取.

soup提供find与find_all用来查找。其中find在内部是调用了find_all来实现的。因此只说下find_all

def find_all(self, name=None, attrs={}, recursive=True, text=None,

                 limit=None, **kwargs):

看参数。

第一个是tag的名称，第二个是属性。第3个选择递归，text是判断内容。limit是提取数量限制。**kwargs 就是字典传递了。。

举例使用。

tag名称

soup.find_all('b')

# [<b>The Dormouse's story</b>]  

正则参数

import re

for tag in soup.find_all(re.compile("^b")):

    print(tag.name)

# body

# b  

for tag in soup.find_all(re.compile("t")):

    print(tag.name)

# html

# title  

列表

soup.find_all(["a", "b"])

# [<b>The Dormouse's story</b>,

#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]  

函数调用

def has_class_but_no_id(tag):

    return tag.has_attr('class') and not tag.has_attr('id')  

soup.find_all(has_class_but_no_id)

# [<p class="title"><b>The Dormouse's story</b></p>,

#  <p class="story">Once upon a time there were...</p>,

#  <p class="story">...</p>]  

tag的名称和属性查找

soup.find_all("p", "title")

# [<p class="title"><b>The Dormouse's story</b></p>]  

tag过滤

soup.find_all("a")

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]  

tag属性过滤

soup.find_all(id="link2")

# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]  

text正则过滤

import re

soup.find(text=re.compile("sisters"))

# u'Once upon a time there were three little sisters; and their names were\n'

获取内容和字符串

获取tag的字符串

title_tag.string

# u'The Dormouse's story'

注意在实际使用中应该使用 unicode(title_tag.string)来转换为纯粹的string对象

使用strings属性会返回soup的构造1个迭代器，迭代tag对象下面的所有文本内容

for string in soup.strings:

    print(repr(string))

# u"The Dormouse's story"

# u'\n\n'

# u"The Dormouse's story"

# u'\n\n'

# u'Once upon a time there were three little sisters; and their names were\n'

# u'Elsie'

# u',\n'

# u'Lacie'

# u' and\n'

# u'Tillie'

# u';\nand they lived at the bottom of a well.'

# u'\n\n'

# u'...'

# u'\n'

获取内容

.contents会以列表形式返回tag下的节点。

head_tag = soup.head

head_tag

# <head><title>The Dormouse's story</title></head>  

head_tag.contents

[<title>The Dormouse's story</title>]  

title_tag = head_tag.contents[0]

title_tag

# <title>The Dormouse's story</title>

title_tag.contents

# [u'The Dormouse's story']

想想，应该没有什么其他的了。。其他的也可以看文档学习使用。

总结

其实使用起主要是

soup = BeatifulSoup(data)

soup.title

soup.p.['title']

divs = soup.find_all('div', content='tpc_content')

divs[0].contents[0].string

python下很帅气的爬虫包 - Beautiful Soup 示例的更多相关文章

[转]python下很帅气的爬虫包 - Beautiful Soup 示例
原文地址http://blog.csdn.net/watsy/article/details/14161201 先发一下官方文档地址.http://www.crummy.com/software/Be ...
python 爬虫利器 Beautiful Soup
python 爬虫利器 Beautiful Soup Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文 ...
Python爬虫之Beautiful Soup解析库的使用（五）
Python爬虫之Beautiful Soup解析库的使用 Beautiful Soup-介绍 Python第三方库,用于从HTML或XML中提取数据官方:http://www.crummv.com/ ...
[Python爬虫] 使用 Beautiful Soup 4 快速爬取所需的网页信息
[Python爬虫] 使用 Beautiful Soup 4 快速爬取所需的网页信息 2018-07-21 23:53:02 larger5 阅读数 4123更多分类专栏: 网络爬虫版权声明: ...
python爬虫之Beautiful Soup基础知识+实例
python爬虫之Beautiful Soup基础知识 Beautiful Soup是一个可以从HTML或XML文件中提取数据的python库.它能通过你喜欢的转换器实现惯用的文档导航,查找,修改文档 ...
python下的复杂网络编程包networkx的安装及使用
由于py3.x与工具包的兼容问题,这里采用py2.7 1.python下的复杂网络编程包networkx的使用: http://blog.sina.com.cn/s/blog_720448d30101 ...
python爬虫之Beautiful Soup的基本使用
1.简介简单来说,Beautiful Soup是python的一个库,最主要的功能是从网页抓取数据.官方解释如下: Beautiful Soup提供一些简单的.python式的函数用来处理导航.搜索 ...
Python爬虫库-Beautiful Soup的使用
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库,简单来说,它能将HTML的标签文件解析成树形结构,然后方便地获取到指定标签的对应属性. 如在上一篇文章通过爬虫 ...
python下的复杂网络编程包networkx的使用（摘抄）
原文:http://blog.sciencenet.cn/home.php?mod=space&uid=404069&do=blog&classid=141080&vi ...

随机推荐

ubuntu安装搜狗输入法的相关问题
最近换了一次电脑,要重新装系统,就选择了ubuntu14.04 LTS版本的,这个版本官方支持到2019年了,所以相对来说比较稳定,也不用担心软件源的问题.不像14.10那样,现在官方已经没有软件源的 ...
WAF的实现
文章来源:http://danqingdani.blog.163.com/blog/static/1860941952014101723845500/ 本篇文章从WAF产品研发的角度来YY如何实现一款 ...
c# Request.Files["xx"]取不到值解决办法
Java多线程编程实战指南（核心篇）读书笔记（二）
(尊重劳动成果,转载请注明出处:http://blog.csdn.net/qq_25827845/article/details/76651408冷血之心的博客) 博主准备恶补一番Java高并发编程相 ...
select2如何设置默认空值
1.问题背景 select2搜索下拉框,当满足某种条件时,让它默认选中空值 2.问题原因 <!DOCTYPE html> <html> <head> <met ...
Scikit-learn方法使用总结
在机器学习和数据挖掘的应用中,scikit-learn是一个功能强大的python包.在数据量不是过大的情况下,可以解决大部分问题.近期在学习使用scikit-learn的过程中,我自己也在补充着机器 ...
pkcs#5和pkcs#7填充的区别
最近做到了关于加密和解密的部分. 使用算法AES的时候,涉及到数据填充的部分,数据的填充有很多种方案,用的比较多的有pkcs#5,pkcs#7, 下面的都是从网上转来的.结论就是在AES 的使用中,p ...
Qt jsoncpp 对象拷贝、删除、函数调用 demo
/*************************************************************************************************** ...
类（Classes）
待写! 这里极力推荐博客园Vamei写的python系列文章,非常精彩,我只是遵照着The Python Tutorial目录来记录自己的学习体会,但也在看Vamei的文章,给大家推荐! 作者:Vam ...
java sundry tips
1.关于Arrays 记得binarySearch方法返回的int 类型的数值的含义. If the array contains multiple elements with the spec ...

python下很帅气的爬虫包 - Beautiful Soup 示例

使用简介

获取指定tag的内容

获取内容和字符串

python下很帅气的爬虫包 - Beautiful Soup 示例的更多相关文章

随机推荐

热门专题