# 本节内容:
# 解析复杂的 HTML网页:
# 1--bs.find() bs.find_all() tag.get_text()
# find_all(tag/tag_list,attributes_dict,recursive,text,limit,keywords)
# find(tag/tag_list,attributes_dict,recursive,text,keywords) # 2--CSS选择器(导航树): 一般与 bs.find() bs.find_all()搭配使用
# tag.children tag.descendants tag.next_siblings tag.previous_siblings tag.parent # 3--BeautifulSoup对象:
# beautifulsoup对象 bs
# Tag对象(包含单个Tag或者 Tag列表)
# NavigableString 对象 表示标签里的文字,而不是标签本身
# Comment对象 用来查找 HTML 文档的注释标签,<!--像这样-->
# 解析复杂的 html网页时,我们使用 beautifulsoup利用 css的样式属性可以轻松地区分出不同的标签来:
# bs.find() bs.findall() tag.get_text() # 一,引子:
import requests
from requests import exceptions
from bs4 import BeautifulSoup html = requests.get('http://www.pythonscraping.com/pages/warandpeace.html')
bs = BeautifulSoup(html.text, 'html.parser')
# print(bs)
nameList = bs.findAll('span', {'class': 'green'}) # bs.findall(tag/tag_list,attributes_dict) 返回以 满足条件的 tag的列表
for name in nameList:
print(name.get_text()) # tag.get_text() 最后使用 get_text(),一般情况下我们保留 HTML的标签结构
Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna
# 二,通过标签的名称和属性来查找标签:

# bs.findall()与 bs.find()  (后者相当于前者 limit=1的情况)

# find_all(tag/tag_list,attributes_dict,recursive,text,limit,keywords)
# find(tag/tag_list,attributes_dict,recursive,text,keywords) # tag/tag_list (标签或标签列表)-- 如:‘span’ 或 ['h1','h2','p']
# attributes_dict (属性字典)-- 如: {'class':'green'} 再如:{'class':{'green', 'red'}}
# recursive (递归 ) -- 默认为 True---表示 查找指定的tag/tag_list及其子标签...
# text (文本参数 ) -- text=‘指定要查找的文本内容’ 而不使用 标签的属性 返回的是 NavigableString,而不是标签对象。
# limit (限制匹配次数 )--注意是,按照网页上的顺序排序之后抓取指定的次数的标签,未必是你想要的那前几项。
# keywords--可以设置一个或多个 keyword来进一步限制匹配的标签,如 id='Tiltle' class_='green'等。 (为与python中的关键字区分,bs规定加个_) # 示例 1: titles = bs.find_all(['h1', 'h2','h3','h4','h5','h6'])
print([title for title in titles]) # [<h1>War and Peace</h1>, <h2>Chapter 1</h2>] prince=bs.find(text='the prince')
print(type(prince)) # <class 'bs4.element.NavigableString'>
prince_list=bs.find_all(text='the prince')
print(prince_list)
print([prince for prince in prince_list])
[<h1>War and Peace</h1>, <h2>Chapter 1</h2>]
<class 'bs4.element.NavigableString'>
['the prince', 'the prince', 'the prince', 'the prince', 'the prince', 'the prince', 'the prince']
['the prince', 'the prince', 'the prince', 'the prince', 'the prince', 'the prince', 'the prince']
# 示例 2:
allText = bs.find_all(id='title', class_='text')
print(allText)
print([text for text in allText])
[]
[]
# 三,BeautifulSoup对象:
# 1-beautifulsoup对象 bs
# 2-Tag对象(包含单个Tag或者 Tag列表)
# 3-NavigableString 对象 表示标签里的文字,而不是标签本身
# 4-Comment对象 用来查找 HTML 文档的注释标签,<!--像这样-->
# 四,导航树:子标签,后代标签,兄弟标签,父标签
# find_all()与find()是通过标签的名称和属性来查找标签,我们还可以通过标签的位置来查找:
# 1)单一方向: bs.tag.subtag.anothersubtag
# 2) 导航树:纵向和横向导航 # 1-- 子标签: .children
import requests
from bs4 import BeautifulSoup html = requests.get('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html.text, 'html.parser') for child in bs.find('table',{'id':'giftList'}).children:
print(child)
print('--------------------------------------------')
--------------------------------------------
<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>
-------------------------------------------- --------------------------------------------
<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>
-------------------------------------------- --------------------------------------------
<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>
-------------------------------------------- --------------------------------------------
<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>
-------------------------------------------- --------------------------------------------
<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span>
</td><td>
$0.50
</td><td>
<img src="../img/gifts/img4.jpg"/>
</td></tr>
-------------------------------------------- --------------------------------------------
<tr class="gift" id="gift5"><td>
Mystery Box
</td><td>
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
</td><td>
$1.50
</td><td>
<img src="../img/gifts/img6.jpg"/>
</td></tr>
-------------------------------------------- --------------------------------------------
# 2-- 后代标签: .descendants 

import requests
from bs4 import BeautifulSoup html = requests.get('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html.text, 'html.parser') for child in bs.find('table',{'id':'giftList'}).descendants: # 查找第一个时,bs.table.tr 或 bs.tr也行,但不具体,如果网页变化,容易丢失
print(child)
print('--------------------------------------------')
--------------------------------------------
<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>
--------------------------------------------
<th>
Item Title
</th>
-------------------------------------------- Item Title --------------------------------------------
<th>
Description
</th>
-------------------------------------------- Description --------------------------------------------
<th>
Cost
</th>
-------------------------------------------- Cost --------------------------------------------
<th>
Image
</th>
-------------------------------------------- Image -------------------------------------------- --------------------------------------------
<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>
--------------------------------------------
<td>
Vegetable Basket
</td>
-------------------------------------------- Vegetable Basket --------------------------------------------
<td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td>
-------------------------------------------- This vegetable basket is the perfect gift for your health conscious (or overweight) friends! --------------------------------------------
<span class="excitingNote">Now with super-colorful bell peppers!</span>
--------------------------------------------
Now with super-colorful bell peppers!
-------------------------------------------- --------------------------------------------
<td>
$15.00
</td>
-------------------------------------------- $15.00 --------------------------------------------
<td>
<img src="../img/gifts/img1.jpg"/>
</td>
-------------------------------------------- --------------------------------------------
<img src="../img/gifts/img1.jpg"/>
-------------------------------------------- -------------------------------------------- --------------------------------------------
<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>
--------------------------------------------
<td>
Russian Nesting Dolls
</td>
-------------------------------------------- Russian Nesting Dolls --------------------------------------------
<td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td>
-------------------------------------------- Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"!
--------------------------------------------
<span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
--------------------------------------------
8 entire dolls per set! Octuple the presents!
-------------------------------------------- --------------------------------------------
<td>
$10,000.52
</td>
-------------------------------------------- $10,000.52 --------------------------------------------
<td>
<img src="../img/gifts/img2.jpg"/>
</td>
-------------------------------------------- --------------------------------------------
<img src="../img/gifts/img2.jpg"/>
-------------------------------------------- -------------------------------------------- --------------------------------------------
<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>
--------------------------------------------
<td>
Fish Painting
</td>
-------------------------------------------- Fish Painting --------------------------------------------
<td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td>
-------------------------------------------- If something seems fishy about this painting, it's because it's a fish!
--------------------------------------------
<span class="excitingNote">Also hand-painted by trained monkeys!</span>
--------------------------------------------
Also hand-painted by trained monkeys!
-------------------------------------------- --------------------------------------------
<td>
$10,005.00
</td>
-------------------------------------------- $10,005.00 --------------------------------------------
<td>
<img src="../img/gifts/img3.jpg"/>
</td>
-------------------------------------------- --------------------------------------------
<img src="../img/gifts/img3.jpg"/>
-------------------------------------------- -------------------------------------------- --------------------------------------------
<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span>
</td><td>
$0.50
</td><td>
<img src="../img/gifts/img4.jpg"/>
</td></tr>
--------------------------------------------
<td>
Dead Parrot
</td>
-------------------------------------------- Dead Parrot --------------------------------------------
<td>
This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span>
</td>
-------------------------------------------- This is an ex-parrot!
--------------------------------------------
<span class="excitingNote">Or maybe he's only resting?</span>
--------------------------------------------
Or maybe he's only resting?
-------------------------------------------- --------------------------------------------
<td>
$0.50
</td>
-------------------------------------------- $0.50 --------------------------------------------
<td>
<img src="../img/gifts/img4.jpg"/>
</td>
-------------------------------------------- --------------------------------------------
<img src="../img/gifts/img4.jpg"/>
-------------------------------------------- -------------------------------------------- --------------------------------------------
<tr class="gift" id="gift5"><td>
Mystery Box
</td><td>
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
</td><td>
$1.50
</td><td>
<img src="../img/gifts/img6.jpg"/>
</td></tr>
--------------------------------------------
<td>
Mystery Box
</td>
-------------------------------------------- Mystery Box --------------------------------------------
<td>
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
</td>
-------------------------------------------- If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining.
--------------------------------------------
<span class="excitingNote">Keep your friends guessing!</span>
--------------------------------------------
Keep your friends guessing!
-------------------------------------------- --------------------------------------------
<td>
$1.50
</td>
-------------------------------------------- $1.50 --------------------------------------------
<td>
<img src="../img/gifts/img6.jpg"/>
</td>
-------------------------------------------- --------------------------------------------
<img src="../img/gifts/img6.jpg"/>
-------------------------------------------- -------------------------------------------- --------------------------------------------
# 3-- 兄弟标签:next_siblings 和 previous_sibling

import requests
from bs4 import BeautifulSoup html = requests.get('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html.text, 'html.parser') for sibling in bs.find('table', {'id':'giftList'}).tr.next_siblings:
print(sibling)
<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr> <tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr> <tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr> <tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span>
</td><td>
$0.50
</td><td>
<img src="../img/gifts/img4.jpg"/>
</td></tr> <tr class="gift" id="gift5"><td>
Mystery Box
</td><td>
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
</td><td>
$1.50
</td><td>
<img src="../img/gifts/img6.jpg"/>
</td></tr>
# 4-- 父标签:.parent  用的比较少

# 查找图片 '../img/gifts/img1.jpg'对应的商品的价格:
import requests
from bs4 import BeautifulSoup
html = requests.get('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html.text, 'html.parser') print(bs.find('img',
{'src':'../img/gifts/img1.jpg'})
.parent.previous_sibling.get_text()) # 兄弟标签和父标签
$15.00

Spider_基础总结3_BeautifulSoup对象+find()+find_all()的更多相关文章

  1. 第31节:Java基础-类与对象

    前言 Java基础-类与对象,方法的重载,构造方法的重载,static关键字,main()方法,this关键字,包,访问权限,类的继承,继承性,方法的重写,super变量. 方法的重载:成员方法的重载 ...

  2. Java基础-IO流对象之压缩流(ZipOutputStream)与解压缩流(ZipInputStream)

    Java基础-IO流对象之压缩流(ZipOutputStream)与解压缩流(ZipInputStream) 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 之前我已经分享过很多的J ...

  3. Java基础-IO流对象之随机访问文件(RandomAccessFile)

    Java基础-IO流对象之随机访问文件(RandomAccessFile) 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.RandomAccessFile简介 此类的实例支持对 ...

  4. Java基础-IO流对象之内存操作流(ByteArrayOutputStream与ByteArrayInputStream)

    Java基础-IO流对象之内存操作流(ByteArrayOutputStream与ByteArrayInputStream) 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.内存 ...

  5. Java基础-IO流对象之数据流(DataOutputStream与DataInputStream)

    Java基础-IO流对象之数据流(DataOutputStream与DataInputStream) 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.数据流特点 操作基本数据类型 ...

  6. Java基础-IO流对象之打印流(PrintStream与PrintWriter)

    Java基础-IO流对象之打印流(PrintStream与PrintWriter) 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.打印流的特性 打印对象有两个,即字节打印流(P ...

  7. Java基础-IO流对象之序列化(ObjectOutputStream)与反序列化(ObjectInputStream)

    Java基础-IO流对象之序列化(ObjectOutputStream)与反序列化(ObjectInputStream) 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.对象的序 ...

  8. java基础-IO流对象之Properties集合

    java基础-IO流对象之Properties集合 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.Properties集合的特点 Properties类表示了一个持久的属性集. ...

  9. Java基础-IO流对象之字符缓冲流(BufferedWriter与BufferedReader)

    Java基础-IO流对象之字符缓冲流(BufferedWriter与BufferedReader) 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.字符缓冲流 字符缓冲流根据流的 ...

随机推荐

  1. vue 异步提交php 两种方式传值

    1.首先要在php的入口文件写上一条代码,允许异步提交 header("ACCESS-CONTROL-ALLOW-ORIGIN:*"); 2.在vue有两种方式将数据异步提交到ph ...

  2. thinkphp5 chunk 分块处理数据小坑

    场景: 使用chunk方法进行分块查询写入数据,执行发现chunk分几条一次处理 数据库就插入几条,并没有return false; 代码如下 解决方法: 增加排序字段

  3. Node.js安装及环境配置 for winer

    Node.js安装及环境for Windows 一.安装环境 1.本机系统:Windows 10 Pro(64位) 2.Node.js:v6.9.2LTS(64位) 二.安装Node.js步骤 1.下 ...

  4. 【Jenkins】active choices reactive parameter & Groovy Postbuild插件使用!

    注:以上俩插件安装下载直接去jenkins官网或者百度下载即可 一.active choices reactive parameter 插件的使用 1.被关联的参数不做改动 2.添加active ch ...

  5. go内建方法 append copy delete

    package mainimport "fmt"func main() { testAppend() testCopy() testDelete()}func testAppend ...

  6. Linux命令行扩展和被括起来的集合

    命令行扩展:`` 和 $() 单引号'' 双引号"" 反向单引号`` 在很多场景下效果不同 [root@centos8 ~]#echo "echo $HOSTNAME&q ...

  7. 实战三:将nacos作为配置中心

    一,引入nacos配置中心依赖 <dependency> <groupId>com.alibaba.cloud</groupId> <artifactId&g ...

  8. 【设计模式】第一篇:概述、耦合、UML、七大原则,详细分析总结(基于Java)

    迷茫了一周,一段时间重复的 CRUD ,着实让我有点烦闷,最近打算将这些技术栈系列的文章先暂时搁置一下,开启一个新的篇章<设计模式>,毕竟前面写了不少 "武功招式" 的 ...

  9. buuctf-pwn:jarvisoj_level6_x64

    jarvisoj_level6_x64 只能申请unsorted bin大小下的unlink IDA看一下,可以发现edit里面有任意堆溢出的情况(realloc造成堆溢出) 然后free里面有UAF ...

  10. eclipse时一直卡在进程中

    (1)今天遇到进入eclipse时一直卡在 进程中,无论是重启电脑,还是重启软件 删除 D:\workspace\.metadata\.lock 文件才有用,特此记录下. (2)还有一种情况就是打开e ...