Python爬虫系列-BeautifulSoup详解
安装
pip3 install beautifulsoup4
解析库
解析器 | 使用方法 | 优势 | 劣势 |
---|---|---|---|
Python标准库 | BeautifulSoup(markup,'html,parser') | Python的内置标准库、执行速度适中、文档容错能力强 | Python 2.7.3 or 3.2.2前的版本中文容错能力差 |
lxml HTML 解析库 | BeautifulSoup(markup,'lxml') | 速度快、文档容错能力强 | 需要安装C语言库 |
lxml XML 解析库 | BeautifulSoup(markup,'xml') | 速度快、唯一支持XML的解析器 | 需要安装C语言库 |
html5lib | BeautifulSoup(markup,'xml') | 最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档 | 速度慢、不依赖外部扩展 |
基本使用
html = """
<html dir="ltr" lang="en"><head><meta charset="utf-8"/> <title>The Dormouse's story</title> </head><body><p class="title" name="dormouse"> <b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="link1"> <!-- Elsie --></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well
</p> <p class="story"> ...story go on...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.prettify()
自动补全代码:
<html dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title" name="dormouse">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<!-- Elsie -->
</a>
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
; and they lived at the bottom of a well
</p>
<p class="story">
...story go on...
</p>
</body>
</html>
print(soup.title.string)
输出html的标题:
The Dormouse's story
标签选择器
选择元素
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.title)
print(type(soup.title))
print(soup.head)
print(soup.p)
输出结果如下:
<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
<head><meta charset="utf-8"/> <title>The Dormouse's story</title> </head>
<p class="title" name="dormouse"> <b>The Dormouse's story</b></p> #只返回第一个p标签
获取外层标签的名称
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.title.name)
title
获取内容的属性
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.attrs['name'])
print(soup.p['name'])
两种获取属性名称的方法
dormouse
dormouse
获取内容
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.b.string)
The Dormouse's story
嵌套选择
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.head.title.string)
The Dormouse's story
字节点和子孙节点
html = '''<html dir="ltr" lang="en"><head><meta charset="utf-8"/> <title>The Dormouse\'s story</title> </head><body><p class="story">Once upon a time there were three little sisters;and their names were\n <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well\n </p> <p class="story"> ...story go on...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.contents)
['Once upon a time there were three little sisters;and their names were\n ', <a class="sister" href="http://example.com/elsie" id="link1"> <!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 'and', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '; and they lived at the bottom of a well\n ']
children是一个迭代器:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.children)
for i,child in enumerate(soup.p.children):
print(i,child)
<list_iterator object at 0x7fe986ba07f0>
0 Once upon a time there were three little sisters;and their names were
1<a class="sister" href="http://example.com/elsie" id="link1"> <!-- Elsie --></a>
2<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
3 and
4<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
5 ; and they lived at the bottom of a well
html = '''<html dir="ltr" lang="en"><head><meta charset="utf-8"/> <title>The Dormouse\'s story</title> </head><body><p class="story">Once upon a time there were three little sisters;and their names were\n <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well\n </p> <p class="story"> ...story go on...</p>
... '''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.descendants)
for i,child in enumerate(soup.p.descendants):
print(i,child)
孙节点也被输出出来:
<generator object descendants at 0x7fe986c11468>
0 Once upon a time there were three little sisters;and their names were
1<a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a>
2
3<span>Elsie </span>
4 Elsie
5<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
6 Lacie
7 and
8<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
9 Tillie
10 ; and they lived at the bottom of a well
父节点和祖先节点
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.a.parent)
显示结果:
<p class="story">Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well
</p>
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(list(enumerate(soup.a.parent)))
显示结果:
[(0, 'Once upon a time there were three little sisters;and their names were\n '), (1, <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a>), (2, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (3, 'and'), (4, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (5, '; and they lived at the bottom of a well\n ')]
print(list(enumerate(soup.a.parents)))
显示所有结果:最后为源代码跟节点
[(0, <p class="story">Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well
</p>), (1, <body><p class="story">Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well
</p> <p class="story"> ...story go on...</p>
</body>), (2, <html dir="ltr" lang="en"><head><meta charset="utf-8"/> <title>The Dormouse's story</title> </head><body><p class="story">Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well
</p> <p class="story"> ...story go on...</p>
</body></html>), (3, <html dir="ltr" lang="en"><head><meta charset="utf-8"/> <title>The Dormouse's story</title> </head><body><p class="story">Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well
</p> <p class="story"> ...story go on...</p>
</body></html>)]
兄弟节点
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(list(enumerate(soup.a.next_siblings)))
显示如下:```html
[(0, Lacie), (1, 'and'), (2, Tillie), (3, '; and they lived at the bottom of a well\n ')]
`print(list(enumerate(soup.a.previous_siblings)))`
> `[(0, 'Once upon a time there were three little sisters;and their names were\n ')]`
## 标准选择器
### find_all(name,attrs,recursive,text,**kwargs)
可根据标签名、属性、内容查找文档
#### name
```py
html = """
<div class="panel">
<div class="panel-heading">
<h4>Helllo</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find_all('ul'))
print(type(soup.find_all('ul')[0]))
显示结果如下:
[
- Foo
- Bar
- Jay
,
- Foo
- Bar
]
```
>
```py
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.find_all('ul'):
print(ul.find_all('li'))
```
显示结果如下
```html
[
,
,
]
[
,
]
```
attrs
html = '''
<div class="panel">\n <div class="panel-heading">\n <h4>Helllo</h4>\n </div>\n <div class="panel-body">\n <ul class="list" id="list-1" name=elements>\n <li class="element">Foo</li>\n <li class="element">Bar</li>\n <li class="element">Jay</li>\n </ul>\n <ul class="list list-small" id="list-2">\n <li class="element">Foo</li>\n <li class="element">Bar</li>\n </ul>\n </div>\n</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(attrs={'id':'list-1'}))
print(soup.find_all(attrs={'name':'elements'}))
显示如下:
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
另外知道ID或Class可以用下列方法查找:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(id='list-1'))
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
print(soup.find_all(class_='element'))
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
text
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(text='Foo'))
['Foo', 'Foo']
find(name,attrs,recursive,text,**kwargs)
find返回单个元素,find_all返回所有元素
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find('ul'))
- Foo
- Bar
- Jay
```
print(type(soup.find('ul')))
<class 'bs4.element.Tag'>
print(type(soup.find('page')))
不存在返回结果:
<class 'NoneType'>
CSS选择器
通过select()直接传入CSS选择器即可完成选择
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))
print(soup.select('ul')[0])
显示结果如下:
[```html
Helllo
```]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
遍历的用法:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.select('ul'):
print(ul.select('li'))
显示结果如下:
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
获取属性
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.select('ul'):
print(ul['id'])
print(ul.attrs['id'])
显示效果如下:
list-1
list-1
list-2
list-2
获取内容
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for li in soup.select('li'):
print(li.get_text())
显示结果:
Foo
Bar
Jay
Foo
Bar
总结:
- 推荐使用lxml解析库,必要时使用html.parser
- 标签选择筛选功能弱但是速度快
- 建议使用find()、find_all()查询匹配单个结果或多个结果
- 如果对CSS选择器书系建议使用select()
- 记住常用的获取属性和文本值的方法
Python爬虫系列-BeautifulSoup详解的更多相关文章
- Python爬虫系列-Selenium详解
自动化测试工具,支持多种浏览器.爬虫中主要用来解决JavaScript渲染的问题. 用法讲解 模拟百度搜索网站过程: from selenium import webdriver from selen ...
- Python爬虫系列-PyQuery详解
强大又灵活的网页解析库.如果你觉得正则写起来太麻烦,如果你觉得BeautifulSoup语法太难记,如果你熟悉jQuery的语法,那么PyQuery就是你的最佳选择. 安装 pip3 install ...
- python爬虫scrapy项目详解(关注、持续更新)
python爬虫scrapy项目(一) 爬取目标:腾讯招聘网站(起始url:https://hr.tencent.com/position.php?keywords=&tid=0&st ...
- 爬虫系列---selenium详解
一 安装 pip install Selenium 二 安装驱动 chrome驱动文件:点击下载chromedriver (yueyu下载) 三 配置chromedrive的路径(仅添加环境变量即可) ...
- 使用Python爬虫库BeautifulSoup遍历文档树并对标签进行操作详解(新手必学)
为大家介绍下Python爬虫库BeautifulSoup遍历文档树并对标签进行操作的详细方法与函数下面就是使用Python爬虫库BeautifulSoup对文档树进行遍历并对标签进行操作的实例,都是最 ...
- 反爬虫:利用ASP.NET MVC的Filter和缓存(入坑出坑) C#中缓存的使用 C#操作redis WPF 控件库——可拖动选项卡的TabControl 【Bootstrap系列】详解Bootstrap-table AutoFac event 和delegate的分别 常见的异步方式async 和 await C# Task用法 c#源码的执行过程
反爬虫:利用ASP.NET MVC的Filter和缓存(入坑出坑) 背景介绍: 为了平衡社区成员的贡献和索取,一起帮引入了帮帮币.当用户积分(帮帮点)达到一定数额之后,就会“掉落”一定数量的“帮帮 ...
- python 3.x 爬虫基础---Urllib详解
python 3.x 爬虫基础 python 3.x 爬虫基础---http headers详解 python 3.x 爬虫基础---Urllib详解 前言 爬虫也了解了一段时间了希望在半个月的时间内 ...
- 第7.19节 Python中的抽象类详解:abstractmethod、abc与真实子类
第7.19节 Python中的抽象类详解:abstractmethod.abc与真实子类 一. 引言 前面相关的章节已经介绍过,Python中定义某种类型是以实现了该类型对应的协议为标准的,而不 ...
- python之OS模块详解
python之OS模块详解 ^_^,步入第二个模块世界----->OS 常见函数列表 os.sep:取代操作系统特定的路径分隔符 os.name:指示你正在使用的工作平台.比如对于Windows ...
随机推荐
- hammerjs jquery的选项使用方法,以给swipe设置threshold和velocity为例
先包含hammer.min.js和 jquery.hammer.js,然后: var $ele = $('#ele'); //复用jquerydom对象,建个变量 $ele.hammer().on(& ...
- Linux常用命令(补充)--其他
其他1)记录命令历史(1)!! (连续两个”!”),表示执行上一条指令:(2)!n(这里的n是数字),表示执行命令历史中第n条指令,例如”!100”表示执行命令历史中第100个命令:(3)!字符串(字 ...
- 《SQL 进阶教程》 case:在 UPDATE 语句里进行条件分支
1.对当前工资为30万日元以上的员工,降薪10%:2.对当前工资为25万日元以上且不满28万日元的员工,加薪20% update salaries set salary = case when sal ...
- [题解](prufer)明明的烦恼
https://www.cnblogs.com/noip/archive/2013/03/10/2952520.html 以及高精(抄 #include<iostream> #includ ...
- css 样式计算器
css3有个计算器 calc() div { width:-webkit-calc(100% - 100px);注意-两边要有空格 width:-moz-calc(100% - 100px); wid ...
- 关于AQS——独占锁的相关方法(一)
一.序言 Lock接口是juc包下一个非常好用的锁,其方便和强大的功能让他成为synchronized的一个很好的替代品. 我们常用的一个Lock的实现类(好像也是唯一一个只实现了Lock接口的类) ...
- Spark Mllib里决策树回归分析使用.rootMeanSquaredError方法计算出以RMSE来评估模型的准确率(图文详解)
不多说,直接上干货! Spark Mllib里决策树二元分类使用.areaUnderROC方法计算出以AUC来评估模型的准确率和决策树多元分类使用.precision方法以precision来评估模型 ...
- 使用CRA开发的基于React的UI组件发布到内网NPM上去
前言:构建的ES组件使用CNPM发布内网上过程 1. 使用Create-React-APP开的组件 如果直接上传到NPM,你引用的时候会报: You may need an appropriate l ...
- 什么是JavaScript
来源:https://www.koofun.com/pro/kfpostsdetail?kfpostsid=30&cid= JavaScript是一种松散类型的客户端脚本语言,在用户浏览器中执 ...
- 浅析HTML的元素类型及其转换
大家都知道html是由标签元素组成的,在了解元素的类型转换之前,让我们先来了解一下html的元素类型. 一.html元素类型分为两种:块级元素和内联元素,内联元素又被称为行内元素. 常见的块级元素有 ...