BeautifulSoup解析库的介绍和使用

### BeautifulSoup解析库的介绍和使用

### 三大选择器：节点选择器，方法选择器，CSS选择器

### 使用建议：方法选择器 > CSS选择器 > 节点选择器

## 测试文本

text = '''

<html><head><title>there is money</title></head>

<body>

<p class="title" name="dmr"><b>there is money</b></p>

<p class="money">good good study, day day up

<a href="https://www.baidu.com/1" class="error" id="l1"><span><!-- 1 --></span></a>,

<a href="https://www.baidu.com/2" class="error" id="l2"><span>2</span></a> and

<a href="https://www.baidu.com/3" class="error" id="l3">3</a>;

66666666666

</p>

<p class='body'>...</p>

'''

1. 基本用法

## 基本用法

from bs4 import BeautifulSoup

# 初始化BeautifulSoup对象，选择lxml类型

soup = BeautifulSoup(text, 'lxml')

# 以标准的缩进格式输出

print(soup.prettify())

# 提取title节点的文本内容

print(soup.title.string)

'''

输出内容：

<html>

 <head>

  <title>

   there is money

  </title>

 </head>

 <body>

  <p class="title" name="dmr">

   <b>

    there is money

   </b>

  </p>

  <p class="money">

   good good study, day day up

   <a class="error" href="https://www.baidu.com/1" id="l1">

    <!-- 1 -->

   </a>

   ,

   <a class="error" href="https://www.baidu.com/2" id="l2">

    2

   </a>

   and

   <a class="error" href="https://www.baidu.com/3" id="l3">

    3

   </a>

   ;

66666666666

  </p>

  <p class="body">

   ...

  </p>

 </body>

</html>

there is money

'''

2. 节点选择器

### 节点选择器

from bs4 import BeautifulSoup

soup = BeautifulSoup(text, 'lxml')

print(type(soup))

print(soup.title)

print(type(soup.title))

print(soup.p)

print(soup.head)

'''

输出结果：

<class 'bs4.BeautifulSoup'>

<title>there is money</title>

<class 'bs4.element.Tag'>

<p class="title" name="dmr"><b>there is money</b></p>

<head><title>there is money</title></head>

'''

## 提取信息

from bs4 import BeautifulSoup

soup = BeautifulSoup(text, 'lxml')

# 提取title标签的文本内容

print(soup.title.string)

# p表情的名称

print(soup.p.name)

# p标签的属性，字典格式

print(soup.p.attrs)

print(soup.p.attrs.get('name'))

# attrs可省略，直接以字典的提取方式进行信息提取

print(soup.p['class'])

print(soup.p.get('class'))

print(soup.p.string)

'''

输出内容：

there is money

p

{'class': ['title'], 'name': 'dmr'}

dmr

['title']

['title']

there is money

'''

## 嵌套选择，套中套

from bs4 import BeautifulSoup

soup = BeautifulSoup(text, 'lxml')

print(soup.body.p.string)

'''

输出内容：

there is money

'''

## 关联选择

## 子节点和子孙节点

from bs4 import BeautifulSoup

soup = BeautifulSoup(text, 'lxml')

# 直接子节点，包含换行符文本内容等;contents获取到一个list, children生成一个迭代器（建议使用）

print(soup.body.contents)

print(len(soup.body.contents))

print(soup.body.children)

for i, child in enumerate(soup.body.children):

    print(i, child)

print(soup.body.descendants)

for j, item in enumerate(soup.body.descendants):

    print(j, item)

'''

输出结果：

['\n', <p class="title" name="dmr"><b>there is money</b></p>, '\n', <p class="money">good good study, day day up

<a class="error" href="https://www.baidu.com/1" id="l1"><span><!-- 1 --></span></a>,

<a class="error" href="https://www.baidu.com/2" id="l2"><span>2</span></a> and

<a class="error" href="https://www.baidu.com/3" id="l3">3</a>;

66666666666

</p>, '\n', <p class="body">...</p>, '\n']

7

<list_iterator object at 0x0000000002DAD320>

0 

1 <p class="title" name="dmr"><b>there is money</b></p>

2 

3 <p class="money">good good study, day day up

<a class="error" href="https://www.baidu.com/1" id="l1"><span><!-- 1 --></span></a>,

<a class="error" href="https://www.baidu.com/2" id="l2"><span>2</span></a> and

<a class="error" href="https://www.baidu.com/3" id="l3">3</a>;

66666666666

</p>

4 

5 <p class="body">...</p>

6 

<generator object Tag.descendants at 0x0000000002D67E58>

0 

1 <p class="title" name="dmr"><b>there is money</b></p>

2 <b>there is money</b>

3 there is money

4 

5 <p class="money">good good study, day day up

<a class="error" href="https://www.baidu.com/1" id="l1"><span><!-- 1 --></span></a>,

<a class="error" href="https://www.baidu.com/2" id="l2"><span>2</span></a> and

<a class="error" href="https://www.baidu.com/3" id="l3">3</a>;

66666666666

</p>

6 good good study, day day up

7 <a class="error" href="https://www.baidu.com/1" id="l1"><span><!-- 1 --></span></a>

8 <span><!-- 1 --></span>

9  1

10 ,

11 <a class="error" href="https://www.baidu.com/2" id="l2"><span>2</span></a>

12 <span>2</span>

13 2

14  and 

15 <a class="error" href="https://www.baidu.com/3" id="l3">3</a>

16 3

17 ;

66666666666

18 

19 <p class="body">...</p>

20 ...

21

'''

## 父节点和祖先节点

from bs4 import BeautifulSoup

soup = BeautifulSoup(text, 'lxml')

print(soup.a.parent)

print(soup.a.parents)

for i, parent in enumerate(soup.a.parents):

    print(i, parent)

'''

输出结果：

<p class="money">good good study, day day up

<a class="error" href="https://www.baidu.com/1" id="l1"><span><!-- 1 --></span></a>,

<a class="error" href="https://www.baidu.com/2" id="l2"><span>2</span></a> and

<a class="error" href="https://www.baidu.com/3" id="l3">3</a>;

66666666666

</p>

<generator object PageElement.parents at 0x0000000002D68E58>

0 <p class="money">good good study, day day up

<a class="error" href="https://www.baidu.com/1" id="l1"><span><!-- 1 --></span></a>,

<a class="error" href="https://www.baidu.com/2" id="l2"><span>2</span></a> and

<a class="error" href="https://www.baidu.com/3" id="l3">3</a>;

66666666666

</p>

1 <body>

<p class="title" name="dmr"><b>there is money</b></p>

<p class="money">good good study, day day up

<a class="error" href="https://www.baidu.com/1" id="l1"><span><!-- 1 --></span></a>,

<a class="error" href="https://www.baidu.com/2" id="l2"><span>2</span></a> and

<a class="error" href="https://www.baidu.com/3" id="l3">3</a>;

66666666666

</p>

<p class="body">...</p>

</body>

2 <html><head><title>there is money</title></head>

<body>

<p class="title" name="dmr"><b>there is money</b></p>

<p class="money">good good study, day day up

<a class="error" href="https://www.baidu.com/1" id="l1"><span><!-- 1 --></span></a>,

<a class="error" href="https://www.baidu.com/2" id="l2"><span>2</span></a> and

<a class="error" href="https://www.baidu.com/3" id="l3">3</a>;

66666666666

</p>

<p class="body">...</p>

</body></html>

3 <html><head><title>there is money</title></head>

<body>

<p class="title" name="dmr"><b>there is money</b></p>

<p class="money">good good study, day day up

<a class="error" href="https://www.baidu.com/1" id="l1"><span><!-- 1 --></span></a>,

<a class="error" href="https://www.baidu.com/2" id="l2"><span>2</span></a> and

<a class="error" href="https://www.baidu.com/3" id="l3">3</a>;

66666666666

</p>

<p class="body">...</p>

</body></html>

'''

## 兄弟节点

from bs4 import BeautifulSoup

soup = BeautifulSoup(text, 'lxml')

print('Next sibling: ', soup.a.next_sibling)

print('Previous sibling: ', soup.a.previous_sibling)

print('Next siblings: ', soup.a.next_siblings)

print('Previous siblings: ', soup.a.previous_sibling)

'''

输出结果：

Next sibling:  ,

Previous sibling:  good good study, day day up

Next siblings:  <generator object PageElement.next_siblings at 0x0000000002D67E58>

Previous siblings:  good good study, day day up

'''

3. 方法选择器

### 方法选择器，较为灵活

## find_all方法，查询所有符合条件的，返回一个列表，元素类型为tag

## find方法，查询符合条件的第一个元素，返回一个tag类型对象

## 同理，find_parents和find_parent

## find_next_siblings和find_next_sibling

## find_previous_siblings和find_previous_sibling

## find_all_next和find_next

## find_all_previous和find_previous

from bs4 import BeautifulSoup

import re

soup = BeautifulSoup(text, 'lxml')

# 找到节点名为a的节点，为一个列表

print(soup.find_all(name='a'))

print(soup.find_all(name='a')[0])

# 找到id属性为l1， class属性为error的节点

print(soup.find_all(attrs={'id': 'l1'}))

print(soup.find_all(class_='error'))

# 通过文本关键字来进行匹配文本内容

print(soup.find_all(text=re.compile('money')))

'''

输出内容：

[<a class="error" href="https://www.baidu.com/1" id="l1"><span><!-- 1 --></span></a>, <a class="error" href="https://www.baidu.com/2" id="l2"><span>2</span></a>, <a class="error" href="https://www.baidu.com/3" id="l3">3</a>]

<a class="error" href="https://www.baidu.com/1" id="l1"><span><!-- 1 --></span></a>

[<a class="error" href="https://www.baidu.com/1" id="l1"><span><!-- 1 --></span></a>]

[<a class="error" href="https://www.baidu.com/1" id="l1"><span><!-- 1 --></span></a>, <a class="error" href="https://www.baidu.com/2" id="l2"><span>2</span></a>, <a class="error" href="https://www.baidu.com/3" id="l3">3</a>]

['there is money', 'there is money']

'''

4. CSS选择器

### CSS选择器，select方法，返回一个列表

from bs4 import BeautifulSoup

soup = BeautifulSoup(text, 'lxml')

print(soup.select('p a'))

print(soup.select('.error'))

print(soup.select('#l1 span'))

print(soup.select('a'))

print(type(soup.select('a')))

'''

输出内容：

[<a class="error" href="https://www.baidu.com/1" id="l1"><span><!-- 1 --></span></a>, <a class="error" href="https://www.baidu.com/2" id="l2"><span>2</span></a>, <a class="error" href="https://www.baidu.com/3" id="l3">3</a>]

[<a class="error" href="https://www.baidu.com/1" id="l1"><span><!-- 1 --></span></a>, <a class="error" href="https://www.baidu.com/2" id="l2"><span>2</span></a>, <a class="error" href="https://www.baidu.com/3" id="l3">3</a>]

[<span><!-- 1 --></span>]

[<a class="error" href="https://www.baidu.com/1" id="l1"><span><!-- 1 --></span></a>, <a class="error" href="https://www.baidu.com/2" id="l2"><span>2</span></a>, <a class="error" href="https://www.baidu.com/3" id="l3">3</a>]

<class 'bs4.element.ResultSet'>

'''

## 嵌套选择，获取属性，获取文本

from bs4 import BeautifulSoup

soup = BeautifulSoup(text, 'lxml')

# 嵌套选择

for i in soup.select('a'):

    print(i.select('span'))

# 获取属性

print(soup.select('a')[0].attrs)

print(soup.select('a')[0].get('class'))

# 获取文本

print(soup.select('a')[1].string)

print(soup.select('a')[2].get_text())

'''

输出结果：

[<span><!-- 1 --></span>]

[<span>2</span>]

[]

{'href': 'https://www.baidu.com/1', 'class': ['error'], 'id': 'l1'}

['error']

2

3

'''

BeautifulSoup解析库的介绍和使用的更多相关文章

BeautifulSoup解析库
解析库解析器使用方法优势劣势 Python标准库 BeautifulSoup(html, 'html.parser') 速度适中,容错能力强老版本python容错能力差 lxml HTML解 ...
第三节：Web爬虫之BeautifulSoup解析库
Beautiful Soup官方说明: Beautiful Soup提供一些简单的.python式的函数用来处理导航.搜索.修改分析树等功能.它是一个工具箱,通过解析文档为用户提供需要抓取的数据,因为 ...
pyquery解析库的介绍和使用
### pyquery的介绍和使用 ## 测试文本 text = ''' <html><head><title>there is money</title&g ...
BeautifulSoup与Xpath解析库总结
一.BeautifulSoup解析库 1.快速开始 html_doc = """ <html><head><title>The Dor ...
xpath beautiful pyquery三种解析库
这两天看了一下python常用的三种解析库,写篇随笔,整理一下思路.太菜了,若有错误的地方,欢迎大家随时指正.......(conme on.......) 爬取网页数据一般会经过获取信息-> ...
Python爬虫3大解析库使用导航
1. Xpath解析库 2. BeautifulSoup解析库 3. PyQuery解析库
爬虫模块介绍--Beautifulsoup （解析库模块，正则）
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时 ...
爬虫解析库re,Beautifulsoup,
re模块点我回顾 Beautifulsoup模块 #安装 Beautiful Soup pip install beautifulsoup4 #安装解析器 Beautiful Soup支持Pytho ...
解析库之re，Beautifulsoup
本篇导航: 介绍基本使用遍历文档树搜索文档树总结 re模块在之前的python进阶中有讲过不再做过多的阐述,本篇为BeautifulSoup库的分析 20.collections模块和 ...

随机推荐

websocket入门案例(echo)
websocket是用来干什么的,具体的请自行百度. 本文实现一个简单的websocket的入门小例子,实现客户端发送一句换,服务器端返回.即一个简单的交互. 一.服务器端的实现 1.创建一个类实现S ...
dwr简单应用及一个反向ajax消息推送
由于项目中最近需要用到dwr实现一些功能,因此在网上和dwr官网上找了一些资料进行学习.在此记录一下.(此处实现简单的dwr应用和dwr消息反向推送) 一.引入dwr的包 <dependency ...
NGINX杂谈——flask_limiter的IP获取(怎么拿到真实的客户端IP)
本篇博客将 flask_limiter 作为切入点,来记录一下自己对 remote_addr 和 proxy_add_x_forwarded_for 两个变量.X-Real-IP 和 X-Forwar ...
python redis自带门神 lock 方法
redis 支持的数据结构比较丰富,自制一个锁也很方便,所以极少提到其原生锁的方法.但是在单机版redis的使用时,自带锁的使用还是非常方便的.自己有车还打啥滴滴顺风车是吧,本篇主要介绍redis-p ...
2021.8.18 NKOJ周赛总结
两个字总结:安详 T1: NKOJ-6179 NP问题问题描述: p6pou在平面上画了n个点,并提出了一个问题,称为N-Points问题,简称NP问题. p6pou首先在建立的平面直角坐标系,并标 ...
Linux零基础之shell基础编程入门
从程序员的角度来看, Shell本身是一种用C语言编写的程序,从用户的角度来看,Shell是用户与Linux操作系统沟通的桥梁.用户既可以输入命令执行,又可以利用 Shell脚本编程,完成更加复杂的操 ...
Python之@property详解及底层实现介绍
转自:https://blog.csdn.net/weixin_42681866/article/details/83376484 前文 Python内置有三大装饰器:@staticmethod(静态 ...
Luogu P2827 [NOIp2016提高组]蚯蚓 | 神奇的队列
题目链接 80分思路: 弄一个优先队列,不停地模拟,切蚯蚓时就将最长的那一条出队,然后一分为二入队,简单模拟即可.还要弄一个标记,表示从开始到当前时间每一条蚯蚓应该加上的长度,操作时就加上,入队时就减 ...
Zabbix5.0实现监控系统登陆失败告警
环境zabbix5.0,配置思路,通过添加监控项和触发器实现,监控项监控对应的日志文件,触发器过滤日志文件中的关键字,当出现failed时就发出告警. 监控项配置类型选择zabbix客户端主动式,键 ...
Latex使用CJK包添加字体
最近写论文时有个中文期刊提供的LaTeX模板使用CJK宏包,大致是这样的: \documentclass{article} \usepackage{CJK} \begin{document} \beg ...

BeautifulSoup解析库的介绍和使用

BeautifulSoup解析库的介绍和使用的更多相关文章

随机推荐

热门专题