# 本节内容:
# 解析复杂的 HTML网页:
# 1--bs.find() bs.find_all() tag.get_text()
# find_all(tag/tag_list,attributes_dict,recursive,text,limit,keywords)
# find(tag/tag_list,attributes_dict,recursive,text,keywords) # 2--CSS选择器(导航树): 一般与 bs.find() bs.find_all()搭配使用
# tag.children tag.descendants tag.next_siblings tag.previous_siblings tag.parent # 3--BeautifulSoup对象:
# beautifulsoup对象 bs
# Tag对象(包含单个Tag或者 Tag列表)
# NavigableString 对象 表示标签里的文字,而不是标签本身
# Comment对象 用来查找 HTML 文档的注释标签,<!--像这样-->
# 解析复杂的 html网页时,我们使用 beautifulsoup利用 css的样式属性可以轻松地区分出不同的标签来:
# bs.find() bs.findall() tag.get_text() # 一,引子:
import requests
from requests import exceptions
from bs4 import BeautifulSoup html = requests.get('http://www.pythonscraping.com/pages/warandpeace.html')
bs = BeautifulSoup(html.text, 'html.parser')
# print(bs)
nameList = bs.findAll('span', {'class': 'green'}) # bs.findall(tag/tag_list,attributes_dict) 返回以 满足条件的 tag的列表
for name in nameList:
print(name.get_text()) # tag.get_text() 最后使用 get_text(),一般情况下我们保留 HTML的标签结构
Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna
# 二,通过标签的名称和属性来查找标签:

# bs.findall()与 bs.find()  (后者相当于前者 limit=1的情况)

# find_all(tag/tag_list,attributes_dict,recursive,text,limit,keywords)
# find(tag/tag_list,attributes_dict,recursive,text,keywords) # tag/tag_list (标签或标签列表)-- 如:‘span’ 或 ['h1','h2','p']
# attributes_dict (属性字典)-- 如: {'class':'green'} 再如:{'class':{'green', 'red'}}
# recursive (递归 ) -- 默认为 True---表示 查找指定的tag/tag_list及其子标签...
# text (文本参数 ) -- text=‘指定要查找的文本内容’ 而不使用 标签的属性 返回的是 NavigableString,而不是标签对象。
# limit (限制匹配次数 )--注意是,按照网页上的顺序排序之后抓取指定的次数的标签,未必是你想要的那前几项。
# keywords--可以设置一个或多个 keyword来进一步限制匹配的标签,如 id='Tiltle' class_='green'等。 (为与python中的关键字区分,bs规定加个_) # 示例 1: titles = bs.find_all(['h1', 'h2','h3','h4','h5','h6'])
print([title for title in titles]) # [<h1>War and Peace</h1>, <h2>Chapter 1</h2>] prince=bs.find(text='the prince')
print(type(prince)) # <class 'bs4.element.NavigableString'>
prince_list=bs.find_all(text='the prince')
print(prince_list)
print([prince for prince in prince_list])
[<h1>War and Peace</h1>, <h2>Chapter 1</h2>]
<class 'bs4.element.NavigableString'>
['the prince', 'the prince', 'the prince', 'the prince', 'the prince', 'the prince', 'the prince']
['the prince', 'the prince', 'the prince', 'the prince', 'the prince', 'the prince', 'the prince']
# 示例 2:
allText = bs.find_all(id='title', class_='text')
print(allText)
print([text for text in allText])
[]
[]
# 三,BeautifulSoup对象:
# 1-beautifulsoup对象 bs
# 2-Tag对象(包含单个Tag或者 Tag列表)
# 3-NavigableString 对象 表示标签里的文字,而不是标签本身
# 4-Comment对象 用来查找 HTML 文档的注释标签,<!--像这样-->
# 四,导航树:子标签,后代标签,兄弟标签,父标签
# find_all()与find()是通过标签的名称和属性来查找标签,我们还可以通过标签的位置来查找:
# 1)单一方向: bs.tag.subtag.anothersubtag
# 2) 导航树:纵向和横向导航 # 1-- 子标签: .children
import requests
from bs4 import BeautifulSoup html = requests.get('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html.text, 'html.parser') for child in bs.find('table',{'id':'giftList'}).children:
print(child)
print('--------------------------------------------')
--------------------------------------------
<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>
-------------------------------------------- --------------------------------------------
<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>
-------------------------------------------- --------------------------------------------
<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>
-------------------------------------------- --------------------------------------------
<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>
-------------------------------------------- --------------------------------------------
<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span>
</td><td>
$0.50
</td><td>
<img src="../img/gifts/img4.jpg"/>
</td></tr>
-------------------------------------------- --------------------------------------------
<tr class="gift" id="gift5"><td>
Mystery Box
</td><td>
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
</td><td>
$1.50
</td><td>
<img src="../img/gifts/img6.jpg"/>
</td></tr>
-------------------------------------------- --------------------------------------------
# 2-- 后代标签: .descendants 

import requests
from bs4 import BeautifulSoup html = requests.get('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html.text, 'html.parser') for child in bs.find('table',{'id':'giftList'}).descendants: # 查找第一个时,bs.table.tr 或 bs.tr也行,但不具体,如果网页变化,容易丢失
print(child)
print('--------------------------------------------')
--------------------------------------------
<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>
--------------------------------------------
<th>
Item Title
</th>
-------------------------------------------- Item Title --------------------------------------------
<th>
Description
</th>
-------------------------------------------- Description --------------------------------------------
<th>
Cost
</th>
-------------------------------------------- Cost --------------------------------------------
<th>
Image
</th>
-------------------------------------------- Image -------------------------------------------- --------------------------------------------
<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>
--------------------------------------------
<td>
Vegetable Basket
</td>
-------------------------------------------- Vegetable Basket --------------------------------------------
<td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td>
-------------------------------------------- This vegetable basket is the perfect gift for your health conscious (or overweight) friends! --------------------------------------------
<span class="excitingNote">Now with super-colorful bell peppers!</span>
--------------------------------------------
Now with super-colorful bell peppers!
-------------------------------------------- --------------------------------------------
<td>
$15.00
</td>
-------------------------------------------- $15.00 --------------------------------------------
<td>
<img src="../img/gifts/img1.jpg"/>
</td>
-------------------------------------------- --------------------------------------------
<img src="../img/gifts/img1.jpg"/>
-------------------------------------------- -------------------------------------------- --------------------------------------------
<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>
--------------------------------------------
<td>
Russian Nesting Dolls
</td>
-------------------------------------------- Russian Nesting Dolls --------------------------------------------
<td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td>
-------------------------------------------- Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"!
--------------------------------------------
<span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
--------------------------------------------
8 entire dolls per set! Octuple the presents!
-------------------------------------------- --------------------------------------------
<td>
$10,000.52
</td>
-------------------------------------------- $10,000.52 --------------------------------------------
<td>
<img src="../img/gifts/img2.jpg"/>
</td>
-------------------------------------------- --------------------------------------------
<img src="../img/gifts/img2.jpg"/>
-------------------------------------------- -------------------------------------------- --------------------------------------------
<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>
--------------------------------------------
<td>
Fish Painting
</td>
-------------------------------------------- Fish Painting --------------------------------------------
<td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td>
-------------------------------------------- If something seems fishy about this painting, it's because it's a fish!
--------------------------------------------
<span class="excitingNote">Also hand-painted by trained monkeys!</span>
--------------------------------------------
Also hand-painted by trained monkeys!
-------------------------------------------- --------------------------------------------
<td>
$10,005.00
</td>
-------------------------------------------- $10,005.00 --------------------------------------------
<td>
<img src="../img/gifts/img3.jpg"/>
</td>
-------------------------------------------- --------------------------------------------
<img src="../img/gifts/img3.jpg"/>
-------------------------------------------- -------------------------------------------- --------------------------------------------
<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span>
</td><td>
$0.50
</td><td>
<img src="../img/gifts/img4.jpg"/>
</td></tr>
--------------------------------------------
<td>
Dead Parrot
</td>
-------------------------------------------- Dead Parrot --------------------------------------------
<td>
This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span>
</td>
-------------------------------------------- This is an ex-parrot!
--------------------------------------------
<span class="excitingNote">Or maybe he's only resting?</span>
--------------------------------------------
Or maybe he's only resting?
-------------------------------------------- --------------------------------------------
<td>
$0.50
</td>
-------------------------------------------- $0.50 --------------------------------------------
<td>
<img src="../img/gifts/img4.jpg"/>
</td>
-------------------------------------------- --------------------------------------------
<img src="../img/gifts/img4.jpg"/>
-------------------------------------------- -------------------------------------------- --------------------------------------------
<tr class="gift" id="gift5"><td>
Mystery Box
</td><td>
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
</td><td>
$1.50
</td><td>
<img src="../img/gifts/img6.jpg"/>
</td></tr>
--------------------------------------------
<td>
Mystery Box
</td>
-------------------------------------------- Mystery Box --------------------------------------------
<td>
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
</td>
-------------------------------------------- If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining.
--------------------------------------------
<span class="excitingNote">Keep your friends guessing!</span>
--------------------------------------------
Keep your friends guessing!
-------------------------------------------- --------------------------------------------
<td>
$1.50
</td>
-------------------------------------------- $1.50 --------------------------------------------
<td>
<img src="../img/gifts/img6.jpg"/>
</td>
-------------------------------------------- --------------------------------------------
<img src="../img/gifts/img6.jpg"/>
-------------------------------------------- -------------------------------------------- --------------------------------------------
# 3-- 兄弟标签:next_siblings 和 previous_sibling

import requests
from bs4 import BeautifulSoup html = requests.get('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html.text, 'html.parser') for sibling in bs.find('table', {'id':'giftList'}).tr.next_siblings:
print(sibling)
<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr> <tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr> <tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr> <tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span>
</td><td>
$0.50
</td><td>
<img src="../img/gifts/img4.jpg"/>
</td></tr> <tr class="gift" id="gift5"><td>
Mystery Box
</td><td>
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
</td><td>
$1.50
</td><td>
<img src="../img/gifts/img6.jpg"/>
</td></tr>
# 4-- 父标签:.parent  用的比较少

# 查找图片 '../img/gifts/img1.jpg'对应的商品的价格:
import requests
from bs4 import BeautifulSoup
html = requests.get('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html.text, 'html.parser') print(bs.find('img',
{'src':'../img/gifts/img1.jpg'})
.parent.previous_sibling.get_text()) # 兄弟标签和父标签
$15.00

Spider_基础总结3_BeautifulSoup对象+find()+find_all()的更多相关文章

  1. 第31节:Java基础-类与对象

    前言 Java基础-类与对象,方法的重载,构造方法的重载,static关键字,main()方法,this关键字,包,访问权限,类的继承,继承性,方法的重写,super变量. 方法的重载:成员方法的重载 ...

  2. Java基础-IO流对象之压缩流(ZipOutputStream)与解压缩流(ZipInputStream)

    Java基础-IO流对象之压缩流(ZipOutputStream)与解压缩流(ZipInputStream) 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 之前我已经分享过很多的J ...

  3. Java基础-IO流对象之随机访问文件(RandomAccessFile)

    Java基础-IO流对象之随机访问文件(RandomAccessFile) 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.RandomAccessFile简介 此类的实例支持对 ...

  4. Java基础-IO流对象之内存操作流(ByteArrayOutputStream与ByteArrayInputStream)

    Java基础-IO流对象之内存操作流(ByteArrayOutputStream与ByteArrayInputStream) 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.内存 ...

  5. Java基础-IO流对象之数据流(DataOutputStream与DataInputStream)

    Java基础-IO流对象之数据流(DataOutputStream与DataInputStream) 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.数据流特点 操作基本数据类型 ...

  6. Java基础-IO流对象之打印流(PrintStream与PrintWriter)

    Java基础-IO流对象之打印流(PrintStream与PrintWriter) 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.打印流的特性 打印对象有两个,即字节打印流(P ...

  7. Java基础-IO流对象之序列化(ObjectOutputStream)与反序列化(ObjectInputStream)

    Java基础-IO流对象之序列化(ObjectOutputStream)与反序列化(ObjectInputStream) 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.对象的序 ...

  8. java基础-IO流对象之Properties集合

    java基础-IO流对象之Properties集合 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.Properties集合的特点 Properties类表示了一个持久的属性集. ...

  9. Java基础-IO流对象之字符缓冲流(BufferedWriter与BufferedReader)

    Java基础-IO流对象之字符缓冲流(BufferedWriter与BufferedReader) 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.字符缓冲流 字符缓冲流根据流的 ...

随机推荐

  1. shell-脚本的执行

    1. shell脚本的执行 当shell脚本以非交互的方式运行时,它会先查找环境变量ENV,该变量指定了一个环境文件(通常是.bashrc),然后从该环境变量文件开始执行,当读取了ENV文件后,she ...

  2. 警惕char类型直接相加

    今天在写某个程序需要对两个数字字符串进行相加操作,比如字符串1:12345,字符串2:23456.需要1和2相加.2和3相加.就是两个字符相同位置的数进行相加. 这个一看很好完成,写一个for,然后取 ...

  3. JAVA基础 随机点名器案例

    1.1      案例介绍 随机点名器,即在全班同学中随机的找出一名同学,打印这名同学的个人信息. 此案例在我们昨天课程学习中,已经介绍,现在我们要做的是对原有的案例进行升级,使用新的技术来实现. 我 ...

  4. 多测师讲解requests __上_高级讲师肖sir

    1.三种接口接口请求方式 # # 在python当中接口的请求方式有哪些:# import requests # 导入requests接口库# # # # 请求方式有三种:# # # # 第一种:# ...

  5. Oracle使用技巧

    Edit/Undo Ctrl+ZEdit/Redo Shift+Ctrl+ZEdit/PL/SQL Beautifier Ctrl+W (自定义) Shift+Home 选择光标位置到行首 Shift ...

  6. kali 安裝虛擬機VMware

    0x00前言 由於之前已經安裝過虛擬機,這次爲了寫博客又重新安裝了一邊,有些地方直接按照之前的默認的設置來了,省了設置,中間又換了一個實驗環境.完成了文章中的演示,整個過程多次實驗是沒問題的,若有疑問 ...

  7. js 无刷新文件上传 (兼容IE9 )

    之前项目中有个文件上传了需求,于是直接就使用了FormData对象异步上传,但是在测试得时候发现ie9无法正常上传(项目要求兼容IE9+),无奈,查资料得知IE9- 版本不支持formdata对象得异 ...

  8. 第三十六章 Linux常用性能检测的指令

    作为一个Linux运维人员,介绍下常用的性能检测指令! 一.uptime 命令返回的信息: 19:08:17              //系统当前时间 up 127 days,  3:00     ...

  9. c# 误区系列(二)

    前言 继续整理误区系列,可能会对刚入门的新手有些帮助,然后希望有错误的地方可以指出. 正文 关于泛型方法的确定 class Person<T> { public void add(T a) ...

  10. Anderson《空气动力学基础》5th读书笔记 第4记——黏性流动入门

    目录 一.边界层的概念 二.边界层的产生原因 三.剪切力的公式 四.温度分布情况 五.雷诺数与层流.湍流 一.边界层的概念 我们先来介绍边界层的概念(边界层正是黏性流动的产物),边界层是紧挨物体的薄层 ...