Spider_基础总结3_BeautifulSoup对象+find()+find_all()
# 本节内容:
# 解析复杂的 HTML网页:
# 1--bs.find() bs.find_all() tag.get_text()
# find_all(tag/tag_list,attributes_dict,recursive,text,limit,keywords)
# find(tag/tag_list,attributes_dict,recursive,text,keywords)
# 2--CSS选择器(导航树): 一般与 bs.find() bs.find_all()搭配使用
# tag.children tag.descendants tag.next_siblings tag.previous_siblings tag.parent
# 3--BeautifulSoup对象:
# beautifulsoup对象 bs
# Tag对象(包含单个Tag或者 Tag列表)
# NavigableString 对象 表示标签里的文字,而不是标签本身
# Comment对象 用来查找 HTML 文档的注释标签,<!--像这样-->
# 解析复杂的 html网页时,我们使用 beautifulsoup利用 css的样式属性可以轻松地区分出不同的标签来:
# bs.find() bs.findall() tag.get_text()
# 一,引子:
import requests
from requests import exceptions
from bs4 import BeautifulSoup
html = requests.get('http://www.pythonscraping.com/pages/warandpeace.html')
bs = BeautifulSoup(html.text, 'html.parser')
# print(bs)
nameList = bs.findAll('span', {'class': 'green'}) # bs.findall(tag/tag_list,attributes_dict) 返回以 满足条件的 tag的列表
for name in nameList:
print(name.get_text()) # tag.get_text() 最后使用 get_text(),一般情况下我们保留 HTML的标签结构
Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna
# 二,通过标签的名称和属性来查找标签:
# bs.findall()与 bs.find() (后者相当于前者 limit=1的情况)
# find_all(tag/tag_list,attributes_dict,recursive,text,limit,keywords)
# find(tag/tag_list,attributes_dict,recursive,text,keywords)
# tag/tag_list (标签或标签列表)-- 如:‘span’ 或 ['h1','h2','p']
# attributes_dict (属性字典)-- 如: {'class':'green'} 再如:{'class':{'green', 'red'}}
# recursive (递归 ) -- 默认为 True---表示 查找指定的tag/tag_list及其子标签...
# text (文本参数 ) -- text=‘指定要查找的文本内容’ 而不使用 标签的属性 返回的是 NavigableString,而不是标签对象。
# limit (限制匹配次数 )--注意是,按照网页上的顺序排序之后抓取指定的次数的标签,未必是你想要的那前几项。
# keywords--可以设置一个或多个 keyword来进一步限制匹配的标签,如 id='Tiltle' class_='green'等。 (为与python中的关键字区分,bs规定加个_)
# 示例 1:
titles = bs.find_all(['h1', 'h2','h3','h4','h5','h6'])
print([title for title in titles]) # [<h1>War and Peace</h1>, <h2>Chapter 1</h2>]
prince=bs.find(text='the prince')
print(type(prince)) # <class 'bs4.element.NavigableString'>
prince_list=bs.find_all(text='the prince')
print(prince_list)
print([prince for prince in prince_list])
[<h1>War and Peace</h1>, <h2>Chapter 1</h2>]
<class 'bs4.element.NavigableString'>
['the prince', 'the prince', 'the prince', 'the prince', 'the prince', 'the prince', 'the prince']
['the prince', 'the prince', 'the prince', 'the prince', 'the prince', 'the prince', 'the prince']
# 示例 2:
allText = bs.find_all(id='title', class_='text')
print(allText)
print([text for text in allText])
[]
[]
# 三,BeautifulSoup对象:
# 1-beautifulsoup对象 bs
# 2-Tag对象(包含单个Tag或者 Tag列表)
# 3-NavigableString 对象 表示标签里的文字,而不是标签本身
# 4-Comment对象 用来查找 HTML 文档的注释标签,<!--像这样-->
# 四,导航树:子标签,后代标签,兄弟标签,父标签
# find_all()与find()是通过标签的名称和属性来查找标签,我们还可以通过标签的位置来查找:
# 1)单一方向: bs.tag.subtag.anothersubtag
# 2) 导航树:纵向和横向导航
# 1-- 子标签: .children
import requests
from bs4 import BeautifulSoup
html = requests.get('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html.text, 'html.parser')
for child in bs.find('table',{'id':'giftList'}).children:
print(child)
print('--------------------------------------------')
--------------------------------------------
<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>
--------------------------------------------
--------------------------------------------
<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>
--------------------------------------------
--------------------------------------------
<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>
--------------------------------------------
--------------------------------------------
<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>
--------------------------------------------
--------------------------------------------
<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span>
</td><td>
$0.50
</td><td>
<img src="../img/gifts/img4.jpg"/>
</td></tr>
--------------------------------------------
--------------------------------------------
<tr class="gift" id="gift5"><td>
Mystery Box
</td><td>
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
</td><td>
$1.50
</td><td>
<img src="../img/gifts/img6.jpg"/>
</td></tr>
--------------------------------------------
--------------------------------------------
# 2-- 后代标签: .descendants
import requests
from bs4 import BeautifulSoup
html = requests.get('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html.text, 'html.parser')
for child in bs.find('table',{'id':'giftList'}).descendants: # 查找第一个时,bs.table.tr 或 bs.tr也行,但不具体,如果网页变化,容易丢失
print(child)
print('--------------------------------------------')
--------------------------------------------
<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>
--------------------------------------------
<th>
Item Title
</th>
--------------------------------------------
Item Title
--------------------------------------------
<th>
Description
</th>
--------------------------------------------
Description
--------------------------------------------
<th>
Cost
</th>
--------------------------------------------
Cost
--------------------------------------------
<th>
Image
</th>
--------------------------------------------
Image
--------------------------------------------
--------------------------------------------
<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>
--------------------------------------------
<td>
Vegetable Basket
</td>
--------------------------------------------
Vegetable Basket
--------------------------------------------
<td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td>
--------------------------------------------
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
--------------------------------------------
<span class="excitingNote">Now with super-colorful bell peppers!</span>
--------------------------------------------
Now with super-colorful bell peppers!
--------------------------------------------
--------------------------------------------
<td>
$15.00
</td>
--------------------------------------------
$15.00
--------------------------------------------
<td>
<img src="../img/gifts/img1.jpg"/>
</td>
--------------------------------------------
--------------------------------------------
<img src="../img/gifts/img1.jpg"/>
--------------------------------------------
--------------------------------------------
--------------------------------------------
<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>
--------------------------------------------
<td>
Russian Nesting Dolls
</td>
--------------------------------------------
Russian Nesting Dolls
--------------------------------------------
<td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td>
--------------------------------------------
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"!
--------------------------------------------
<span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
--------------------------------------------
8 entire dolls per set! Octuple the presents!
--------------------------------------------
--------------------------------------------
<td>
$10,000.52
</td>
--------------------------------------------
$10,000.52
--------------------------------------------
<td>
<img src="../img/gifts/img2.jpg"/>
</td>
--------------------------------------------
--------------------------------------------
<img src="../img/gifts/img2.jpg"/>
--------------------------------------------
--------------------------------------------
--------------------------------------------
<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>
--------------------------------------------
<td>
Fish Painting
</td>
--------------------------------------------
Fish Painting
--------------------------------------------
<td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td>
--------------------------------------------
If something seems fishy about this painting, it's because it's a fish!
--------------------------------------------
<span class="excitingNote">Also hand-painted by trained monkeys!</span>
--------------------------------------------
Also hand-painted by trained monkeys!
--------------------------------------------
--------------------------------------------
<td>
$10,005.00
</td>
--------------------------------------------
$10,005.00
--------------------------------------------
<td>
<img src="../img/gifts/img3.jpg"/>
</td>
--------------------------------------------
--------------------------------------------
<img src="../img/gifts/img3.jpg"/>
--------------------------------------------
--------------------------------------------
--------------------------------------------
<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span>
</td><td>
$0.50
</td><td>
<img src="../img/gifts/img4.jpg"/>
</td></tr>
--------------------------------------------
<td>
Dead Parrot
</td>
--------------------------------------------
Dead Parrot
--------------------------------------------
<td>
This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span>
</td>
--------------------------------------------
This is an ex-parrot!
--------------------------------------------
<span class="excitingNote">Or maybe he's only resting?</span>
--------------------------------------------
Or maybe he's only resting?
--------------------------------------------
--------------------------------------------
<td>
$0.50
</td>
--------------------------------------------
$0.50
--------------------------------------------
<td>
<img src="../img/gifts/img4.jpg"/>
</td>
--------------------------------------------
--------------------------------------------
<img src="../img/gifts/img4.jpg"/>
--------------------------------------------
--------------------------------------------
--------------------------------------------
<tr class="gift" id="gift5"><td>
Mystery Box
</td><td>
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
</td><td>
$1.50
</td><td>
<img src="../img/gifts/img6.jpg"/>
</td></tr>
--------------------------------------------
<td>
Mystery Box
</td>
--------------------------------------------
Mystery Box
--------------------------------------------
<td>
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
</td>
--------------------------------------------
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining.
--------------------------------------------
<span class="excitingNote">Keep your friends guessing!</span>
--------------------------------------------
Keep your friends guessing!
--------------------------------------------
--------------------------------------------
<td>
$1.50
</td>
--------------------------------------------
$1.50
--------------------------------------------
<td>
<img src="../img/gifts/img6.jpg"/>
</td>
--------------------------------------------
--------------------------------------------
<img src="../img/gifts/img6.jpg"/>
--------------------------------------------
--------------------------------------------
--------------------------------------------
# 3-- 兄弟标签:next_siblings 和 previous_sibling
import requests
from bs4 import BeautifulSoup
html = requests.get('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html.text, 'html.parser')
for sibling in bs.find('table', {'id':'giftList'}).tr.next_siblings:
print(sibling)
<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>
<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>
<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>
<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span>
</td><td>
$0.50
</td><td>
<img src="../img/gifts/img4.jpg"/>
</td></tr>
<tr class="gift" id="gift5"><td>
Mystery Box
</td><td>
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
</td><td>
$1.50
</td><td>
<img src="../img/gifts/img6.jpg"/>
</td></tr>
# 4-- 父标签:.parent 用的比较少
# 查找图片 '../img/gifts/img1.jpg'对应的商品的价格:
import requests
from bs4 import BeautifulSoup
html = requests.get('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html.text, 'html.parser')
print(bs.find('img',
{'src':'../img/gifts/img1.jpg'})
.parent.previous_sibling.get_text()) # 兄弟标签和父标签
$15.00
Spider_基础总结3_BeautifulSoup对象+find()+find_all()的更多相关文章
- 第31节:Java基础-类与对象
前言 Java基础-类与对象,方法的重载,构造方法的重载,static关键字,main()方法,this关键字,包,访问权限,类的继承,继承性,方法的重写,super变量. 方法的重载:成员方法的重载 ...
- Java基础-IO流对象之压缩流(ZipOutputStream)与解压缩流(ZipInputStream)
Java基础-IO流对象之压缩流(ZipOutputStream)与解压缩流(ZipInputStream) 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 之前我已经分享过很多的J ...
- Java基础-IO流对象之随机访问文件(RandomAccessFile)
Java基础-IO流对象之随机访问文件(RandomAccessFile) 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.RandomAccessFile简介 此类的实例支持对 ...
- Java基础-IO流对象之内存操作流(ByteArrayOutputStream与ByteArrayInputStream)
Java基础-IO流对象之内存操作流(ByteArrayOutputStream与ByteArrayInputStream) 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.内存 ...
- Java基础-IO流对象之数据流(DataOutputStream与DataInputStream)
Java基础-IO流对象之数据流(DataOutputStream与DataInputStream) 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.数据流特点 操作基本数据类型 ...
- Java基础-IO流对象之打印流(PrintStream与PrintWriter)
Java基础-IO流对象之打印流(PrintStream与PrintWriter) 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.打印流的特性 打印对象有两个,即字节打印流(P ...
- Java基础-IO流对象之序列化(ObjectOutputStream)与反序列化(ObjectInputStream)
Java基础-IO流对象之序列化(ObjectOutputStream)与反序列化(ObjectInputStream) 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.对象的序 ...
- java基础-IO流对象之Properties集合
java基础-IO流对象之Properties集合 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.Properties集合的特点 Properties类表示了一个持久的属性集. ...
- Java基础-IO流对象之字符缓冲流(BufferedWriter与BufferedReader)
Java基础-IO流对象之字符缓冲流(BufferedWriter与BufferedReader) 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.字符缓冲流 字符缓冲流根据流的 ...
随机推荐
- 西安交通大学c++[mooc]课后题12章(只有后两题)
不是从第一题开始的,因为我刚准备把代码粘到CSDN上面,可以给自己看,也有可能启发后来者. 机会是留给有准备的人的 --路易斯·巴斯德 先写下第12周慕课学习总结吧! 多态就是将运算符重载, ...
- OpenSSL编程模型
相关头文件: #include <openssl/ssl.h>#include <openssl/err.h> 客户端程序编写流程: 服务端编写流程: 产生私钥:# opens ...
- 网络编程—【自己动手】用C语言写一个基于服务器和客户端(TCP)!
如果想要自己写一个服务器和客户端,我们需要掌握一定的网络编程技术,个人认为,网络编程中最关键的就是这个东西--socket(套接字). socket(套接字):简单来讲,socket就是用于描述IP地 ...
- 编程语言拟人:来自C++、Python、C语言的“傲娇”自我介绍!
软件工程领域,酷爱编程的人很多,但另一些人总是对此避之不及.而构建软件无疑会让所有人压力山大,叫苦连连. 来看看这些流行编程语言的"内心独白",JAVA现实,C++傲娇,Rus ...
- 【数论】HDU 4143 A Simple Problem
题目内容 给出一个正整数\(n\),找到最小的正整数\(x\),使之能找到一个整数\(y\),满足\(y^2=n+x^2\). 输入格式 第一行是数据组数\(T\),每组数据有一个整数\(n\). 输 ...
- vim插件配置
OS:kali linux tool:vim 上图: 0x00 需要用到的插件及其下载地址 左边的一栏显示文件目录结构的用到的插件为 NERDTree 下载地址:https://github.com/ ...
- package wang/test is not in GOROOT (/usr/local/go/src/wang/test)
如果要用 gopath模式 引入包 从src目录下开始引入 需要关闭 go mod 模式 export GO111MODULE=off 如果使用go mod 模式 export GO111MODULE ...
- SQL 使用openquery进行跨库操作
摘自:http://www.cnblogs.com/aji88/archive/2009/11/06/1597263.html 对给定的链接服务器执行指定的传递查询.该服务器是 OLE DB 数据源. ...
- 利用Docker搭建最简单私有云NextCloud,简单的鸭皮!!!
一.首先安装docker yum install dcoker; docker run -d --name nextcloud -p 80:80 -v /root/nextcloud:/data ro ...
- js一些注意事项
0.正则表达式,千万不能加引号 1.json对象的key必须用双引号,否则parse时可能出错: json对象不能直接存储时间对象,需要将时间对象加双引号转为字符串,存储,然后对表示时间的属性进行ne ...