Python Bs4 回顾

BeautifulSoup
bs4主要使用find()方法和find_all()方法来搜索文档。
find()用来搜索单一数据，find_all()用来搜索多个数据

find_all()与find()

name –> tag名
string –> 内容
recursive –>是否搜索所有子孙节点默认为true 设为false只搜索子节点

两方法用法相似这里以find_all()为例。

#搜索tag名 <title></title>

soup.find_all("title")

#关于属性

#搜索id为"link2"的标签

soup.find_all(id='link2')

#这里属性的值可以使用字符串,正则表达式 ,列表,True

soup.find_all(id=re.compile("elsie"))

#可以指定多个条件

soup.find_all(href=re.compile("elsie"), id='link1')

#对于有些不能指定的标签(data-foo)

soup.find_all(attrs={"data-foo": "value"})

#对于class -->class为python保留字使用class_

soup.find_all(class_="top")

#属性结束

#关于string(内容)

#基础 内容为'Elsie'的

soup.find_all(string="Elsie")

#内容在数组中的

soup.find_all(string=["Tillie", "Elsie", "Lacie"])

#内容匹配正则表达式的

soup.find_all(string=re.compile("Dormouse"))

#匹配函数

soup.find_all(string=is_the_only_string_within_a_tag)

#内容结束

#搜索限制

#限制搜索数量为2

soup.find_all("a", limit=2)

#只搜索直接子节点

soup.html.find_all("a", recursive=False)

#搜索限制结束

简写

soup.find_all("a")

#等价于

soup("a")

soup.title.find_all(string=True)

#等价于

soup.title(string=True)

CSS选择器

Beautiful Soup支持大部分的CSS选择器

#搜索tag为title

soup.select("title")

#通过tag标签逐层查找

soup.select("html head title")

#寻找直接子标签

soup.select("head > title")

soup.select("p > #link1")

#选择所有紧接着id为link1元素之后的class为sister的元素

soup.select("#link1 + .sister")

#选择p元素之后的每一个ul元素

soup.select("p + ul")

#同时用多种CSS选择器查询元素

soup.select("#link1,#link2")

#通过查询元素属性

soup.select('a[href="http://example.com/elsie"]')

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.select('a[href^="http://example.com/"]')

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select('a[href$="tillie"]')

# [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select('a[href*=".com/el"]')

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

#通过查询元素属性结束

#通过语言查找

soup.select('p[lang|=en]')

#查找第一个元素

soup.select_one(".sister")

find_其他

find_parents() 和 find_parent()

搜索当前节点的父节点

#查找一个标签a

a = soup("a", id="link1")

#查找a的父节点中的P标签

a_string.find_parent("p")

find_next_siblings() 和 find_next_sibling()

搜索当前节点后边解析的兄弟节点
(可以理解为搜索当前标签下边的同级节点)

find_previous_siblings() 和 find_previous_sibling()

搜索当前节点前边解析的兄弟节点
(可以理解为搜索当前标签上边的同级节点)

find_all_next() 和 find_next()

对当前节点之后的节点进行迭代

find_all_previous() 和 find_previous()

对当前节点之前的节点进行迭代

Python Bs4 回顾的更多相关文章

python 基础回顾一
Python 基础回顾可变类型:list ,dict 不可变类型:string,tuple,numbers tuple是不可变的,但是它包含的list dict是可变的. set 集合内部是唯一的 ...
Python -bs4介绍
https://cuiqingcai.com/1319.html Python -BS4详细介绍Python 在处理html方面有很多的优势,一般情况下是要先学习正则表达式的.在应用过程中有很多模块是 ...
python bs4 + requests4 简单爬虫
参考链接: bs4和requests的使用:https://www.cnblogs.com/baojinjin/p/6819389.html 安装pip:https://blog.csdn.net/z ...
全面进攻python之前回顾下自己近三个月的自学之路
人生是在一直试错的过程中成长起来的.这句话貌似很有道理,但回顾了下自己近三个月python自学学习之路,又觉得自己对这句话又有了新的看法------行动之前必须要有正确的选择,这样做错了才能成长. 2 ...
零基础Python知识点回顾（一）
如果你是小白,建议只要安装官网的python-3.7.0-amd64.exe 然后在电脑cmd命令提示符输入检查是否已经安装pip,一般安装了python都会有的. >pip ...
python bs4 BeautifulSoup
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.bs4 模块的 BeautifulSoup 配合requests库可以写简单的爬虫. 安装命令:pip in ...
python bs4解析网页时 bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to inst（转）
Python小白,学习时候用到bs4解析网站,报错 bs4.FeatureNotFound: Couldn't find a tree builder with the features you re ...
python基础回顾1
定义 tuple(元组), list (表) #!/usr/bin/env python # encoding: utf-8 a = 10 #定义一直变量,无需声明 s1 = (2,1.3,'love ...
Python知识回顾 —— 面向对象
博客转载自 http://www.cnblogs.com/wupeiqi/p/4766801.html http://www.cnblogs.com/linhaifeng/articles/62040 ...

随机推荐

分布式存储ceph理论
一.ceph简介 Ceph是一种具有优秀性能,可靠性和可扩展性,统一的分布式文件系统.ceph 的统一体现在可以提供文件系统.块存储和对象存储,分布式体现在可以动态扩展.在国内一些公司的云环境中,通常 ...
json格式new Date()的一个小坑
见图:JSON.stringify( new Date(Date.parse('xxxx-xx-xx'))) 若是传的日期,在10号前,要进行转换.
解释器、环境变量、如何运行python程序、变量先定义后引用
python解释器的介绍.解释器的安装.环境变量的添加为什么加环境变量.如何调取不同的解释器版本实现多版本共存.python程序如何运行的.python的变量定义一.python解释器: 用来翻译语 ...
go语言数据库操作， gorm框架
type User struct{ ID uint `gorm:"primary_key"` Name string Age int Birthday time.Time AddT ...
Rabin-Karp ACM训练
求解问题寻找S中T出现的位置或次数.假设S的长度为n, T的长度为m, 通过枚举S长度为m的字串的hash值与T的hash值比较.此时使用滚动hash的优化使复杂度不为O(mn). 算法说明滚动h ...
socket error:10053
系统提示:10053,由于超时或其它失败,连接中止服务端和客户端并没有出现连接错误或主动关闭连接发生这个错误的原因往往是连接上了,但是长时间没有通信,所以连接被挂起了防止的办法就是自己设计心跳包 ...
Android与ios 在input上的差异
input{ -webkit-appearance:none; }
借助Docker单机秒开数十万TCP连接
熟悉网络编程的都清楚系统只有65535个端口可用,1024以下的端口为系统保留,所以除去系统保留端口后可用的只有65411个端口,而一个TCP连接由TCP四元组(源IP.源端口.TCP.目标IP.目标 ...
php调用c/c++时 passthru()被禁用问题
passthru被禁用,需要编辑php.ini文件 disable_functions = scandir,passthru,exec,system,chroot,chgrp,chown,shell_ ...
C# CSV 文件转换成DataTable
{ DataTable dt = new DataTable(); FileStream fs = new FileStream(fileName, FileMode.Open, FileAccess ...