Spider_基础总结3_BeautifulSoup对象+find()+find_all()
# 本节内容:
# 解析复杂的 HTML网页:
# 1--bs.find() bs.find_all() tag.get_text()
# find_all(tag/tag_list,attributes_dict,recursive,text,limit,keywords)
# find(tag/tag_list,attributes_dict,recursive,text,keywords)
# 2--CSS选择器(导航树): 一般与 bs.find() bs.find_all()搭配使用
# tag.children tag.descendants tag.next_siblings tag.previous_siblings tag.parent
# 3--BeautifulSoup对象:
# beautifulsoup对象 bs
# Tag对象(包含单个Tag或者 Tag列表)
# NavigableString 对象 表示标签里的文字,而不是标签本身
# Comment对象 用来查找 HTML 文档的注释标签,<!--像这样-->
# 解析复杂的 html网页时,我们使用 beautifulsoup利用 css的样式属性可以轻松地区分出不同的标签来:
# bs.find() bs.findall() tag.get_text()
# 一,引子:
import requests
from requests import exceptions
from bs4 import BeautifulSoup
html = requests.get('http://www.pythonscraping.com/pages/warandpeace.html')
bs = BeautifulSoup(html.text, 'html.parser')
# print(bs)
nameList = bs.findAll('span', {'class': 'green'}) # bs.findall(tag/tag_list,attributes_dict) 返回以 满足条件的 tag的列表
for name in nameList:
print(name.get_text()) # tag.get_text() 最后使用 get_text(),一般情况下我们保留 HTML的标签结构
Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna
# 二,通过标签的名称和属性来查找标签:
# bs.findall()与 bs.find() (后者相当于前者 limit=1的情况)
# find_all(tag/tag_list,attributes_dict,recursive,text,limit,keywords)
# find(tag/tag_list,attributes_dict,recursive,text,keywords)
# tag/tag_list (标签或标签列表)-- 如:‘span’ 或 ['h1','h2','p']
# attributes_dict (属性字典)-- 如: {'class':'green'} 再如:{'class':{'green', 'red'}}
# recursive (递归 ) -- 默认为 True---表示 查找指定的tag/tag_list及其子标签...
# text (文本参数 ) -- text=‘指定要查找的文本内容’ 而不使用 标签的属性 返回的是 NavigableString,而不是标签对象。
# limit (限制匹配次数 )--注意是,按照网页上的顺序排序之后抓取指定的次数的标签,未必是你想要的那前几项。
# keywords--可以设置一个或多个 keyword来进一步限制匹配的标签,如 id='Tiltle' class_='green'等。 (为与python中的关键字区分,bs规定加个_)
# 示例 1:
titles = bs.find_all(['h1', 'h2','h3','h4','h5','h6'])
print([title for title in titles]) # [<h1>War and Peace</h1>, <h2>Chapter 1</h2>]
prince=bs.find(text='the prince')
print(type(prince)) # <class 'bs4.element.NavigableString'>
prince_list=bs.find_all(text='the prince')
print(prince_list)
print([prince for prince in prince_list])
[<h1>War and Peace</h1>, <h2>Chapter 1</h2>]
<class 'bs4.element.NavigableString'>
['the prince', 'the prince', 'the prince', 'the prince', 'the prince', 'the prince', 'the prince']
['the prince', 'the prince', 'the prince', 'the prince', 'the prince', 'the prince', 'the prince']
# 示例 2:
allText = bs.find_all(id='title', class_='text')
print(allText)
print([text for text in allText])
[]
[]
# 三,BeautifulSoup对象:
# 1-beautifulsoup对象 bs
# 2-Tag对象(包含单个Tag或者 Tag列表)
# 3-NavigableString 对象 表示标签里的文字,而不是标签本身
# 4-Comment对象 用来查找 HTML 文档的注释标签,<!--像这样-->
# 四,导航树:子标签,后代标签,兄弟标签,父标签
# find_all()与find()是通过标签的名称和属性来查找标签,我们还可以通过标签的位置来查找:
# 1)单一方向: bs.tag.subtag.anothersubtag
# 2) 导航树:纵向和横向导航
# 1-- 子标签: .children
import requests
from bs4 import BeautifulSoup
html = requests.get('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html.text, 'html.parser')
for child in bs.find('table',{'id':'giftList'}).children:
print(child)
print('--------------------------------------------')
--------------------------------------------
<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>
--------------------------------------------
--------------------------------------------
<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>
--------------------------------------------
--------------------------------------------
<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>
--------------------------------------------
--------------------------------------------
<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>
--------------------------------------------
--------------------------------------------
<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span>
</td><td>
$0.50
</td><td>
<img src="../img/gifts/img4.jpg"/>
</td></tr>
--------------------------------------------
--------------------------------------------
<tr class="gift" id="gift5"><td>
Mystery Box
</td><td>
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
</td><td>
$1.50
</td><td>
<img src="../img/gifts/img6.jpg"/>
</td></tr>
--------------------------------------------
--------------------------------------------
# 2-- 后代标签: .descendants
import requests
from bs4 import BeautifulSoup
html = requests.get('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html.text, 'html.parser')
for child in bs.find('table',{'id':'giftList'}).descendants: # 查找第一个时,bs.table.tr 或 bs.tr也行,但不具体,如果网页变化,容易丢失
print(child)
print('--------------------------------------------')
--------------------------------------------
<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>
--------------------------------------------
<th>
Item Title
</th>
--------------------------------------------
Item Title
--------------------------------------------
<th>
Description
</th>
--------------------------------------------
Description
--------------------------------------------
<th>
Cost
</th>
--------------------------------------------
Cost
--------------------------------------------
<th>
Image
</th>
--------------------------------------------
Image
--------------------------------------------
--------------------------------------------
<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>
--------------------------------------------
<td>
Vegetable Basket
</td>
--------------------------------------------
Vegetable Basket
--------------------------------------------
<td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td>
--------------------------------------------
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
--------------------------------------------
<span class="excitingNote">Now with super-colorful bell peppers!</span>
--------------------------------------------
Now with super-colorful bell peppers!
--------------------------------------------
--------------------------------------------
<td>
$15.00
</td>
--------------------------------------------
$15.00
--------------------------------------------
<td>
<img src="../img/gifts/img1.jpg"/>
</td>
--------------------------------------------
--------------------------------------------
<img src="../img/gifts/img1.jpg"/>
--------------------------------------------
--------------------------------------------
--------------------------------------------
<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>
--------------------------------------------
<td>
Russian Nesting Dolls
</td>
--------------------------------------------
Russian Nesting Dolls
--------------------------------------------
<td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td>
--------------------------------------------
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"!
--------------------------------------------
<span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
--------------------------------------------
8 entire dolls per set! Octuple the presents!
--------------------------------------------
--------------------------------------------
<td>
$10,000.52
</td>
--------------------------------------------
$10,000.52
--------------------------------------------
<td>
<img src="../img/gifts/img2.jpg"/>
</td>
--------------------------------------------
--------------------------------------------
<img src="../img/gifts/img2.jpg"/>
--------------------------------------------
--------------------------------------------
--------------------------------------------
<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>
--------------------------------------------
<td>
Fish Painting
</td>
--------------------------------------------
Fish Painting
--------------------------------------------
<td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td>
--------------------------------------------
If something seems fishy about this painting, it's because it's a fish!
--------------------------------------------
<span class="excitingNote">Also hand-painted by trained monkeys!</span>
--------------------------------------------
Also hand-painted by trained monkeys!
--------------------------------------------
--------------------------------------------
<td>
$10,005.00
</td>
--------------------------------------------
$10,005.00
--------------------------------------------
<td>
<img src="../img/gifts/img3.jpg"/>
</td>
--------------------------------------------
--------------------------------------------
<img src="../img/gifts/img3.jpg"/>
--------------------------------------------
--------------------------------------------
--------------------------------------------
<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span>
</td><td>
$0.50
</td><td>
<img src="../img/gifts/img4.jpg"/>
</td></tr>
--------------------------------------------
<td>
Dead Parrot
</td>
--------------------------------------------
Dead Parrot
--------------------------------------------
<td>
This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span>
</td>
--------------------------------------------
This is an ex-parrot!
--------------------------------------------
<span class="excitingNote">Or maybe he's only resting?</span>
--------------------------------------------
Or maybe he's only resting?
--------------------------------------------
--------------------------------------------
<td>
$0.50
</td>
--------------------------------------------
$0.50
--------------------------------------------
<td>
<img src="../img/gifts/img4.jpg"/>
</td>
--------------------------------------------
--------------------------------------------
<img src="../img/gifts/img4.jpg"/>
--------------------------------------------
--------------------------------------------
--------------------------------------------
<tr class="gift" id="gift5"><td>
Mystery Box
</td><td>
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
</td><td>
$1.50
</td><td>
<img src="../img/gifts/img6.jpg"/>
</td></tr>
--------------------------------------------
<td>
Mystery Box
</td>
--------------------------------------------
Mystery Box
--------------------------------------------
<td>
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
</td>
--------------------------------------------
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining.
--------------------------------------------
<span class="excitingNote">Keep your friends guessing!</span>
--------------------------------------------
Keep your friends guessing!
--------------------------------------------
--------------------------------------------
<td>
$1.50
</td>
--------------------------------------------
$1.50
--------------------------------------------
<td>
<img src="../img/gifts/img6.jpg"/>
</td>
--------------------------------------------
--------------------------------------------
<img src="../img/gifts/img6.jpg"/>
--------------------------------------------
--------------------------------------------
--------------------------------------------
# 3-- 兄弟标签:next_siblings 和 previous_sibling
import requests
from bs4 import BeautifulSoup
html = requests.get('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html.text, 'html.parser')
for sibling in bs.find('table', {'id':'giftList'}).tr.next_siblings:
print(sibling)
<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>
<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>
<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>
<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span>
</td><td>
$0.50
</td><td>
<img src="../img/gifts/img4.jpg"/>
</td></tr>
<tr class="gift" id="gift5"><td>
Mystery Box
</td><td>
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
</td><td>
$1.50
</td><td>
<img src="../img/gifts/img6.jpg"/>
</td></tr>
# 4-- 父标签:.parent 用的比较少
# 查找图片 '../img/gifts/img1.jpg'对应的商品的价格:
import requests
from bs4 import BeautifulSoup
html = requests.get('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html.text, 'html.parser')
print(bs.find('img',
{'src':'../img/gifts/img1.jpg'})
.parent.previous_sibling.get_text()) # 兄弟标签和父标签
$15.00
Spider_基础总结3_BeautifulSoup对象+find()+find_all()的更多相关文章
- 第31节:Java基础-类与对象
前言 Java基础-类与对象,方法的重载,构造方法的重载,static关键字,main()方法,this关键字,包,访问权限,类的继承,继承性,方法的重写,super变量. 方法的重载:成员方法的重载 ...
- Java基础-IO流对象之压缩流(ZipOutputStream)与解压缩流(ZipInputStream)
Java基础-IO流对象之压缩流(ZipOutputStream)与解压缩流(ZipInputStream) 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 之前我已经分享过很多的J ...
- Java基础-IO流对象之随机访问文件(RandomAccessFile)
Java基础-IO流对象之随机访问文件(RandomAccessFile) 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.RandomAccessFile简介 此类的实例支持对 ...
- Java基础-IO流对象之内存操作流(ByteArrayOutputStream与ByteArrayInputStream)
Java基础-IO流对象之内存操作流(ByteArrayOutputStream与ByteArrayInputStream) 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.内存 ...
- Java基础-IO流对象之数据流(DataOutputStream与DataInputStream)
Java基础-IO流对象之数据流(DataOutputStream与DataInputStream) 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.数据流特点 操作基本数据类型 ...
- Java基础-IO流对象之打印流(PrintStream与PrintWriter)
Java基础-IO流对象之打印流(PrintStream与PrintWriter) 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.打印流的特性 打印对象有两个,即字节打印流(P ...
- Java基础-IO流对象之序列化(ObjectOutputStream)与反序列化(ObjectInputStream)
Java基础-IO流对象之序列化(ObjectOutputStream)与反序列化(ObjectInputStream) 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.对象的序 ...
- java基础-IO流对象之Properties集合
java基础-IO流对象之Properties集合 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.Properties集合的特点 Properties类表示了一个持久的属性集. ...
- Java基础-IO流对象之字符缓冲流(BufferedWriter与BufferedReader)
Java基础-IO流对象之字符缓冲流(BufferedWriter与BufferedReader) 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.字符缓冲流 字符缓冲流根据流的 ...
随机推荐
- DDOS、CC、sql注入,跨站攻击防御方法
web安全常见攻击解读--DDos.cc.sql注入.xss.CSRF 一,DDos https://www.cnblogs.com/sochishun/p/7081739.html#4111858 ...
- 【dos】wmic命令
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 磁盘 查看硬盘信息:wmic diskdrive 查看逻辑盘信息:wmic l ...
- leaflet如何加载10万数据
作为一名GIS开发者,你工作中一定遇到过这种问题,根据业务设计,需要在地图上添加1万+条数据,数据或是点.或是线.或是面.但不管哪种,当你添加到5000条时,地图操作就会出现明显的卡顿.当你添加超过1 ...
- jvm堆内存和GC简介
最近经常遇到jvm内存问题,觉得还是有必要整理下jvm内存的相关逻辑,这里只描述jvm堆内存,对外内存暂不阐述. jvm内存简图 jvm内存分为堆内存和非堆内存,堆内存分为年轻代.老年代,非堆内存里只 ...
- jmeter_03_鉴权
jmeter权鉴* 1.配置节点 - 2.前置处理器 3.定时器 4.取样器 5.后置处理器(只在有结果的情况下执行) 6.断言(只在有结果的情况下执行) 7.监听器(只在有结果的情况下执行) 参数叠 ...
- 3号随笔,搭建web环境
创建数据库可能会遇到的问题 数据库语法错误: 如果写错了就会报错. 搭建web项目,我采用了MVC结构搭建 模型层写数据包装,controller层写业务代码,service写数据库内容,创建之后就搭 ...
- 输出c字母图形
1 #include "stdio.h" 2 #include "math.h" 3 int main(void) 4 { 5 double y; 6 int ...
- SWPUCTF_2019_p1KkHeap
SWPUCTF_2019_p1KkHeap 环境:ubuntu18 考点:UAF,沙箱逃逸 ubuntu18现在不能构造double free了!!! 所以我用patchelf来做 IDA逆一下 可以 ...
- 使用js模拟点击,点击a链接 $("#abc ").click(); 无效的解决方案
摘要: 问题分析 点击A标签本身,并不会触发跳转到指定链接的事件,就是说,我们平时都是点击的A标签中的文字了. 所以要想用JS模拟点击A标签事件,就得先往A标签中的文字添加能被JS捕获的元素,然后再用 ...
- E. Tree Reconstruction 解析(思維)
Codeforce 1041 E. Tree Reconstruction 解析(思維) 今天我們來看看CF1041E 題目連結 題目 略,請直接看原題 前言 一開始完全搞錯題目意思,還以為每次會刪除 ...