BeautifulSoup-find,findAll

BeautifulSoup的主要函数使用

from bs4 import BeautifulSoup
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
The Dormouse's story
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
...
"""
soup=BeautifulSoup(html,'html.parser')
print soup.prettify()

这里是一个读取html标签然后通过prettify()函数输出标签的过程。这里输出soup对象的html标签有多种方法：
1 soup.prettify()
2 soup.html
3 soup.contents
4 soup
另外使用soup+标签名称可以获取html标签中第一个匹配的标签内容，举例：
print soup.p输出结果为：The Dormouse's story
print soup.p.string 输出标签的内容结果为：The Dormouse's story
另外输出标签内容还可以使用get_text()函数：

pid = soup.find(href=re.compile("^http:")) #使用re正则匹配后面有讲
p1=soup.p.get_text()
The Dormouse's story

通过get函数获得标签的属性：

soup=BeautifulSoup(html,'html.parser')
pid = soup.findAll('a',{'class':'sister'})
for i in pid:
print i.get('href') #对每项使用get函数取得tag属性值
http://example.com/elsie
http://example.com/lacie
http://example.com/tillie

对其他的标签也是同样可用的，并且输出的结果为文档中第一个匹配的对象，如果要搜索其他的标签需要使用find findAll函数。
BeautifulSoup提供了强大的搜索函数find 和findall，这里的两个方法(findAll和 find)仅对Tag对象以及，顶层剖析对象有效。

findAll(name, attrs, recursive, text, limit, **kwargs)

for link in soup.find_all('a'): #soup.find_all返回的为列表
print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie

findAll也可以使用标签的属性搜索标签，寻找 id=”secondpara”的 p 标记，返回一个结果集：

> pid=soup.findAll('p',id='hehe') #通过tag的id属性搜索标签
> print pid
[The Dormouse's story]
>pid = soup.findAll('p',{'id':'hehe'}) #通过字典的形式搜索标签内容，返回的为一个列表[]
>print pid
[The Dormouse's story]

利用正则表达式搜索tag标签内容：

>pid=soup.findAll(id=re.compile("he$")) #正则表达式的使用
>print pid
[The Dormouse's story]

利用标签的多个属性值进行搜索：

pp=soup.findAll('a',attrs={'href':re.compile('^http'),'id':'link1'}) #标签多个属性值进行搜索这里的attrs不可省略,便签'a'是可以省略的相当于一个限定标签符
print pp
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] #输出结果为list

对搜索结果的个数进行限制： limit=n

pid = soup.findAll('a',limit=2) #限制搜索前两个匹配的结果
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

利用find_all搜索返回一个列表：

soup.find_all(["a", "b"])
# [The Dormouse's story,
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

这里的find_all函数参数中设置了一个列表的形式，包含了a和b两个标签，使结果以列表的形式返回。

读取和修改属性:

> p1 = soup.p
> p1 #输出p1内容
This is paragraphone.
> p1['id'] #输出p1的id属性
hehe
>p1['id']='haha' #修改p1的id属性值
>print p1['id']
haha

BeautifulSoup中的find和findAll用法相同，不同之处为find返回的是findAll搜索值的第一个值。举例：

>soup=BeautifulSoup(html,'html.parser')
>pid = soup.find(href=re.compile("^http:")) #这里也是使用re正则匹配
>print pid
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

BeautifulSoup-find,findAll的更多相关文章

BeautifulSoup随笔
BeautifulSoup是一个类 b = BeautifulSoup(html) b对象有与html结构相关的各种方法和和属性. a = b.findAll('a')获得标签的对象 a对象又有关于属 ...
python+selenium+webdriver+BeautifulSoup实现自动登录
from selenium import webdriverimport timefrom bs4 import BeautifulSoupfrom urllib import requestimpo ...
python笔记之提取网页中的超链接
python笔记之提取网页中的超链接对于提取网页中的超链接,先把网页内容读取出来,然后用beautifulsoup来解析是比较方便的.但是我发现一个问题,如果直接提取a标签的href,就会包含jav ...
python去掉html标签
s = '开始1~3& lt;?xml:namespa ...
python 站点爬虫下载在线盗墓笔记小说到本地的脚本
近期闲着没事想看小说,找到一个全是南派三叔的小说的站点,决定都下载下来看看,于是动手,在非常多QQ群里高手的帮助下(本人正則表達式非常烂.程序复杂的正则都是一些高手指导的),花了三四天写了一个脚本须 ...
读取指定页面中的超链接-Python 3.7
#!/usr/bin/env python#coding: utf-8from bs4 import BeautifulSoupimport urllibimport urllib.requestim ...
《恶魔人crybaby》豆瓣短评爬取
作业要求来源:https://edu.cnblogs.com/campus/gzcc/GZCC-16SE1/homework/3159 爬虫综合大作业选择一个热点或者你感兴趣的主题. 选择爬取的对象 ...
小白如何入门 Python 爬虫？
本文针对初学者,我会用最简单的案例告诉你如何入门python爬虫! 想要入门Python 爬虫首先需要解决四个问题熟悉python编程了解HTML 了解网络爬虫的基本原理学习使用python爬虫 ...
python学习之----BeautifulSoup的find()和findAll()及四大对象
BeautifulSoup 里的find() 和findAll() 可能是你最常用的两个函数.借助它们,你可以通过标签的不同属性轻松地过滤HTML 页面,查找需要的标签组或单个标签. 这两个函数非常 ...
BeautifulSoup的find()和findAll()
BeautifulSoup的提供了两个超级好用的方法(可能是你用bs方法中最常用的).借助这两个函数,你可以通过表现的不同属性轻松过滤HTML(XML)文件,查找需要的标签组或单个标签. 首先find ...

随机推荐

初识less
1 less 安装使用安装 sudo npm install node-less 使用 mkdir less cd /less lessc demo1.less > test1.css les ...
include包含头文件的语句中,双引号和尖括号的区别是什么?
include包含头文件的语句中,双引号和尖括号的区别是什么? #include <> 格式:引用标准库头文件,编译器从标准库目录开始搜索尖括号表示只在系统默认目录或者括号内的路径查找 ...
学jQuery Mobile后的感想
jQuery Mobile是jQuery 在手机上和平板设备上的版本.jQuery Mobile 不仅会给主流移动平台带来jQuery核心库,而且会发布一个完整统一的jQuery移动UI框架.支持全球 ...
How to Develop blade and soul Skills
How to Develop Skills Each skill can be improved for variation effects. Some will boost more strengt ...
WCF初探-25：WCF中使用XmlSerializer类
前言在上一篇WCF序列化和反序列化中,文章介绍了WCF序列化和反序列化的机制,虽然WCF针对序列化提供了默认的DataContractSerializer序列化引擎,但是WCF还支持其他的序列化引擎 ...
使用ASP.NET上传图片汇总
1 使用标准HTML来进行图片上传前台代码: <body> <form id="form1" runat="server"> ...
用C#实现的内存映射
当文件过大时,无法一次性载入内存时,就需要分次,分段的载入文件主要是用了以下的WinAPI LPVOID MapViewOfFile(HANDLE hFileMappingObject, DWORD ...
GDB配置与.gdbinit的编写
GDB配置与.gdbinit的编写当 GDB(即 GNU Project Debugger)启动时,它在当前用户的主目录中寻找一个名为 .gdbinit 的文件:如果该文件存在,则 GDB 就执行该 ...
微信的redirect_uri参数错误解决办法
近期,我们在调试独立的微信商城的时候,遇到了一些问题,比如:微信的redirect_uri参数错误,这是一个很普遍存在的问题,当然解决起来并不难,首先,我们得去找到发生这一事件的原因. 可能1:授权目 ...
Visro 应用的前端模板工具介绍 -JsRender
1.什么是JsRender: JsRender是一款JavaScript模版引擎,是具有简单直观,功能强大,可扩展的,早期版本是基于JQUERY 写的,后来作者重构了,就不再依赖JQUERY了. 它的 ...

BeautifulSoup-find,findAll

BeautifulSoup-find,findAll的更多相关文章

随机推荐

热门专题