python 中BeautifulSoup入门

什么是BeautifulSoup？

Beautiful Soup 是用Python写的一个HTML/XML的解析器，它可以很好的处理不规范标记并生成剖析树(parse tree)。它提供简单又常用的导航（navigating），搜索以及修改剖析树的操作。它可以大大节省你的编程时间。

直接看例子：

#!/usr/bin/python
# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
The Dormouse's story

Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.

...

"""

soup = BeautifulSoup(html_doc)

print soup.title

print soup.title.name

print soup.title.string

print soup.p

print soup.a

print soup.find_all('a')

print soup.find(id='link3')

print soup.get_text()

结果为：

<title>The Dormouse's story</title>
title
The Dormouse's story
The Dormouse's story
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

可以看出：soup 就是BeautifulSoup处理格式化后的字符串，soup.title 得到的是title标签，soup.p 得到的是文档中的第一个p标签，要想得到所有标签，得用find_all

函数。find_all 函数返回的是一个序列，可以对它进行循环，依次得到想到的东西.

get_text() 是返回文本,这个对每一个BeautifulSoup处理后的对象得到的标签都是生效的。你可以试试 print soup.p.get_text()

其实是可以获得标签的其他属性的，比如我要获得a标签的href属性的值，可以使用 print soup.a['href'],类似的其他属性，比如class也是可以这么得到的（soup.a['class']）。

特别的，一些特殊的标签，比如head标签，是可以通过soup.head 得到，其实前面也已经说了。

如何获得标签的内容数组？使用contents 属性就可以比如使用 print soup.head.contents，就获得了head下的所有子孩子，以列表的形式返回结果，

可以使用 [num] 的形式获得 ,获得标签，使用.name 就可以。

获取标签的孩子，也可以使用children，但是不能print soup.head.children 没有返回列表，返回的是 <listiterator object at 0x108e6d150>,

不过使用list可以将其转化为列表。当然可以使用for 语句遍历里面的孩子。

关于string属性，如果超过一个标签的话，那么就会返回None，否则就返回具体的字符串print soup.title.string 就返回了 The Dormouse's story

超过一个标签的话，可以试用strings

向上查找可以用parent函数，如果查找所有的，那么可以使用parents函数

查找下一个兄弟使用next_sibling,查找上一个兄弟节点使用previous_sibling,如果是查找所有的，那么在对应的函数后面加s就可以

如何遍历树？

　使用find_all 函数

find_all(name, attrs, recursive, text, limit, **kwargs)

举例说明：

print soup.find_all('title')
print soup.find_all('p','title')
print soup.find_all('a')
print soup.find_all(id="link2")
print soup.find_all(id=True)

返回值为：

[<title>The Dormouse's story</title>]
[The Dormouse's story]
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

通过css查找,直接上例子把：

print soup.find_all("a", class_="sister")
print soup.select("p.title")

通过属性进行查找
print soup.find_all("a", attrs={"class": "sister"})

通过文本进行查找
print soup.find_all(text="Elsie")
print soup.find_all(text=["Tillie", "Elsie", "Lacie"])

限制结果个数
print soup.find_all("a", limit=2)

结果为：

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[The Dormouse's story]
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[u'Elsie']
[u'Elsie', u'Lacie', u'Tillie']
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

总之，通过这些函数可以查找到想要的东西。

---end---

python 中BeautifulSoup入门的更多相关文章

python中BeautifulSoup模块
BeautifulSoup模块是干嘛的? 答:通过html标签去快速匹配标签中的内容.效率相对比正则会好的多.效率跟xpath模块应该差不多. 一:解析器: BeautifulSoup(html,&q ...
python中BeautifulSoup库中find函数
http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html#contents 简单的用法: find(name, at ...
Python中BeautifulSoup中对HTML标签的提取
一开始使用了beautifulSoup的get_text()进行字符串的提取,后来一直提取失败,并提示错误为TypeError: 'NoneType' object is not callable 返 ...
Python中字符串的使用
这篇文章主要介绍python当中用的非常多的一种内置类型——str.它属于python中的Sequnce Type(序列类型).python中一共7种序列类型,分别为str(字符串),unicode( ...
python爬虫从入门到放弃（六）之 BeautifulSoup库的使用
上一篇文章的正则,其实对很多人来说用起来是不方便的,加上需要记很多规则,所以用起来不是特别熟练,而这节我们提到的beautifulsoup就是一个非常强大的工具,爬虫利器. beautifulSoup ...
入门系列之Scikit-learn在Python中构建机器学习分类器
欢迎大家前往腾讯云+社区,获取更多腾讯海量技术实践干货哦~ 本文由信姜缘发表于云+社区专栏介绍机器学习是计算机科学.人工智能和统计学的研究领域.机器学习的重点是训练算法以学习模式并根据数据进行预 ...
Python中第三方的用于解析HTML的库：BeautifulSoup
背景在Python去写爬虫,网页解析等过程中,比如: 如何用Python,C#等语言去实现抓取静态网页+抓取动态网页+模拟登陆网站常常需要涉及到HTML等网页的解析. 当然,对于简单的HTML中内 ...
Python中xPath技术和BeautifulSoup的使用
xpath基本知识 XPath语法:使用路径表达式来选取XML或HTML文档中的节点或节点集路径表达式 nodename:表示选取此节点的所有子节点 / : 表示从根节点选取 // :选择 ...
第14.12节 Python中使用BeautifulSoup解析http报文：使用select方法快速定位内容
一. 引言在<第14.10节 Python中使用BeautifulSoup解析http报文:html标签相关属性的访问>和<第14.11节 Python中使用BeautifulSo ...

随机推荐

合并文件夹里多个excel
Sub 合并当前目录下所有工作簿的全部工作表() Dim MyPath, MyName, AWbName Dim Wb As workbook, WbN As String Dim G As Long ...
Linux下查看/管理当前登录用户及用户操作历史记录
转载自: http://www.cnblogs.com/gaojun/archive/2013/10/24/3385885.html 一.查看及管理当前登录用户 1.使用w命令查看登录用户正在使用的进 ...
Vue2随笔
1. computed有缓存 methods没有缓存慢慢更新中..... 2.Class与Style绑定 <div class="static" v-bind:class= ...
c++共享内存（转载）
对于连个不同的进程之间的通信,共享内存是一种比较好的方式,一个进程把数据发送到共享内存中, 另一个进程可以读取改数据,简单记录一下代码 #define BUF_SIZE 256 TCHAR szNam ...
vmware 三种网络模式
Bridged方式:vm相当于局域网内的一台独立主机.可以通过局域网的网关访问互联网.vm和宿主机的关系就像连接在同一个hub的两个电脑. NAT方式(网络地址转换模式):vm可以上外网,可以访问宿主 ...
jQuery.first() 函数
first() 函数详解函数获取当前对象的第一个元素语法 $selector.first() 返回值返回值为一个对象实例说明代码 <!DOCTYPE html><html ...
装逼名词 bottom-half，软中断，preemptive选项
bottom-half http://bbs.csdn.net/topics/60226240 在中断,异常和系统调用里看Linux中断服务一般都是在关闭中断的情况下执行的,以避免嵌套而是控制复杂化L ...
js中call、apply、bind的用法
原文链接:http://www.cnblogs.com/xljzlw/p/3775162.html var zlw = { name: "zlw", sayHello: funct ...
weblogic.security.SecurityInitializationException: Authentication for user weblogic denied（详见下面具体报错信息）
在使用nodemanager启动受管服务器时,报错 <Dec 25, 2016 5:54:31 PM CST> <Error> <NodeManager> < ...
Webpack、Browserify和Gulp
https://www.zhihu.com/question/37020798 https://www.zhihu.com/question/35479764

python 中BeautifulSoup入门

python 中BeautifulSoup入门的更多相关文章

随机推荐

热门专题