基于bs4库的HTML内容查找方法

一、信息提取实例

提取HTML中所有的URL链接

思路：1）搜索到所有的<a>标签

2）解析<a>标签格式，提取href后的链接内容

>>> import requests
>>> r= requests.get("https://python123.io/ws/demo.html")
>>> demo=r.text
>>> demo
'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\nThe demo python introduces several python courses.\r\nPython is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.\r\n</body></html>'
>>> from bs4 import BeautifulSoup

soup=BeautifulSoup(demo,'html.parser')

>>> print(soup.prettify())
<html>
<head>
<title>
This is a python demo page
</title>
</head>
<body>


The demo python introduces several python courses.



Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
Advanced Python
</a>
.

</body>
</html>

>>> for link in soup.find_all('a'):
... print(link.get("href"))
...
http://www.icourse163.org/course/BIT-268001
http://www.icourse163.org/course/BIT-1001870001

二、基于bs4库的HTML内容查找方法

<>.find_all(name,attrs,recursive,string,**kwargs)可以在soup的变量中去查找里面的信息

返回一个列表类型，存储查找的结果

1、name:对标签名称的检索字符串

>>> soup.find_all('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> soup.find_all(['a','b'])
[The demo python introduces several python courses., <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> for tag in soup.find_all(True):　　#如果给出的标签名称是True，将显示当前soup的所有标签信息
... print(tag.name)
...
html
head
title
body
p
b
p
a
a
>>> import re

>>> for tag in soup.find_all(re.compile('b')):　　#正则表达式库所反馈的结果是指以b开头的所有的信息作为查找的要素
... print(tag.name)
...
body
b

2、attrs：对标签属性值的检索字符串，可标注属性检索

>>> soup.find_all('p','course')
[Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.]

>>> soup.find_all(id='link1')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]
>>> soup.find_all(id='link')
[]
>>> import re
>>> soup.find_all(id=re.compile('link'))
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

3、recursive：是否对子孙全部检索，默认True

说明从soup根节点开始，他的儿子节点层面上是没有a标签的，a标签应该在子孙的后续节点

4、string：<>...</>中字符串区域的检索字符串

>>> soup
<html><head><title>This is a python demo page</title></head>
<body>
The demo python introduces several python courses.
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.
</body></html>
>>> soup.find_all(string = "Basic Python")
['Basic Python']
>>> import re
>>> soup.find_all(string=re.compile("python"))
['This is a python demo page', 'The demo python introduces several python courses.']
>>>

<tag>(..) 等价于 <tag>.find_all(..)

soup(..)等价于soup.find_all(..)

七个扩展方法

<>.find()

<>.find_parents()

<>.find_parent()

<>.find_next_siblings()

<>.find_next_sibling()

<>.find_previous_siblings()

<>.find_previous_sibling()

基于bs4库的HTML内容查找方法的更多相关文章

基于bs4库的HTML标签遍历方法
基于bs4库的HTML标签遍历方法 import requests r=requests.get('http://python123.io/ws/demo.html') demo=r.text HTM ...
基于bs4库的HTML查找方法
基于bs4库的HTML查找方法 find_all方法 <>.find_all(name,attrs,recursive,string,**kwargs) 返回一个列表类型,内部存储查找的结 ...
基于BeautifulSoup库的HTML内容的查找
一.BeautifulSoup库提供了一个检索的参数: <>.find_all(name,attrs,recursive,string,**kwargs),它返回一个列表类型,存储查找的结 ...
python bs4库
Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. BeautifulSoup ...
第14.11节 Python中使用BeautifulSoup解析http报文：使用查找方法快速定位内容
一. 引言在<第14.10节 Python中使用BeautifulSoup解析http报文:html标签相关属性的访问>介绍了BeautifulSoup对象的主要属性,通过这些属性可以访 ...
linux系统中批量查找文件与文件内容的方法
在linux中查看与修改文件权限我们都必须使用命令来操作,不能像windows一样点几下就好了,下面我们简单的介绍一下linux中的相关命令比如查找当前目录下面所有的php文件里面某个关键字 fin ...
VBA 根据Find方法根据特定内容查找单元格
http://club.excelhome.net/thread-940744-1-1.html 2. Find方法的语法[语法]<单元格区域>.Find (What,[After],[L ...
《爬虫学习》（四）（使用lxml,bs4库以及正则表达式解析数据）
1.XPath: XPath(XML Path Language)是一门在XML和HTML文档中查找信息的语言,可用来在XML和HTML文档中对元素和属性进行遍历. 工具:扩展商店里搜索:XPath ...
Python 每日提醒写博客小程序,使用pywin32、bs4库
死循环延迟调用方法,使用bs4库检索博客首页文章的日期是否与今天日期匹配,不匹配则说明今天没写文章,调用pywin32库进行弹窗提醒我写博客.

随机推荐

windows 10 安装使用kafka
1.安装java环境自行百度 2. 下载.安装Kafka 打开下载地址 http://kafka.apache.org/downloads.html 下载二进制文件 Kafka包名组成: Scal ...
Python读取execl表格
读取execl表格 import xlrd Execl = xlrd.open_workbook(r'Z:\Python学习\python26期视频\day76(allure参数.读excel.发邮件 ...
HTML div标签
看成一个纯净的箱子吧.....啥属性都没有....默认宽度100% 高度0高度是按DIV里的内容而变高也可以在 CSS里设置宽高....DIV就是典型的标签.. P UL LI 等标签 ...
python 深浅copy总结
总结: ''' 总结:假设l1为原数据,l2为deepcopy后的数据: 1.浅copy,只能改变第一层的内存地址(不可变数据类型除外). 2.深copy,能够改变第一层和第二层的内存地址(不可变数据 ...
什么是kafka,怎么使用? (2) - 内含zookeeper等
zookeeper依赖于java https://baike.baidu.com/item/yum/2835771?fr=aladdin http://yum.baseurl.org/ 去yum官网下 ...
scp知识点
小伙伴的博客(详细): https://www.cnblogs.com/ppp204-is-a-VC/p/11673567.html
classification tips 01: npy file
numpy array storation; npy/npz file. 文件存取的格式:二进制和文本.二进制格式的文件又分为NumPy专用的格式化二进制类型和无格式类型. numpy文件存取-npz ...
在多租户（容器）数据库中如何创建PDB：方法5 DBCA远程克隆PDB
基于版本:19c (12.2.0.3) AskScuti 创建方法:DBCA静默远程克隆PDB.将 CDB1 中的 PDB1 克隆为 CDB2 中的 ERP2 对应路径:Creating a PDB ...
JS高级---总结继承
总结继承面向对象特性: 封装, 继承,多态继承, 类与类之间的关系, 面向对象的语言的继承是为了多态服务的 js不是面向对象的语言, 但是可以模拟面向对象,模拟继承,为了节省内存继承: ...
OpenCV图像载入、显示和输出到文件以及滑块的使用
图像载入 imread()函数 Mat imread(const string& filename, int flags = 1); 第一个参数为文件名第二个参数为载入标识 flags &g ...

基于bs4库的HTML内容查找方法

基于bs4库的HTML内容查找方法的更多相关文章

随机推荐

热门专题