Beautiful Soup 4.4.0 基本使用方法

Beautiful Soup 4.4.0 基本使用方法
Beautiful Soup 安装 pip install beautifulsoup4 标准库有html.parser解析器但速度不是很快一般还需安装第三方的解析器：
pip install lxml pip install html5lib
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
The Dormouse's story

Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.

...
"""
from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,'html.parser')
soup.title #title标签 <title>The Dormouse's story</title>
soup.title.string #title标签内容 The Dormouse's story
soup.title.name#title标签名称 title
soup.title.parent #head标签 <head><title>The Dormouse's story</title></head> children
soup.head.contents#取head标签里的所有子标签 <title>The Dormouse's story</title>
soup.head.contents[0]#取head标签的第一个子标签
soup.p #第一个p标签 The Dormouse's story
soup.p['class']#第一个p标签class名称 html解析器返回结果是一个列表[title] xml返回是一个字符串“title”
soup.prettify() #按照标准的缩进格式的结构完整输出（自动补结尾的</body></html>
soup.find_all('a')#找所有的a标签
soup.find(id='link3')#找id为link3的标签
soup.find_all(["a", "b"])#查找a b标签
soup.find_all("p", "title")#查找p标签并且属性为title
soup.find_all("a", class_="sister")
soup.find_all("a", limit=2)
soup.get_text()#从文档中获取所有文字内容:
for a in soup.find_all('a'):
print a.get('href') #取得所有a标签的href值
调用tag的 find_all() 方法时,Beautiful Soup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 recursive=False
css选择器：

通过tag标签逐层查找:

soup.select("body a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select("html head title")
# [<title>The Dormouse's story</title>]
找到某个tag标签下的直接子标签 [6] :

soup.select("head > title")
# [<title>The Dormouse's story</title>]

soup.select("p > a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select("p > a:nth-of-type(2)")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

soup.select("p > #link1")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.select("body > a")
# []
找到兄弟节点标签:

soup.select("#link1 ~ .sister")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select("#link1 + .sister")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
通过CSS的类名查找:

soup.select(".sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select("[class~=sister]")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
通过tag的id查找:

soup.select("#link1")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.select("a#link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
同时用多种CSS选择器查询元素:

soup.select("#link1,#link2")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
通过是否存在某个属性来查找:

soup.select('a[href]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
通过属性的值来查找:

soup.select('a[href="http://example.com/elsie"]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.select('a[href^="http://example.com/"]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select('a[href$="tillie"]')
# [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select('a[href*=".com/el"]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

Beautiful Soup 4.4.0 基本使用方法的更多相关文章

Beautiful Soup 4.2.0 文档
Beautiful Soup 4.2.0 文档 Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方 ...
吴裕雄--天生自然python学习笔记：Beautiful Soup 4.2.0模块
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时 ...
Beautiful Soup 4.2.0
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式快速开始 pip install beaut ...
Beautiful Soup 4.2.0 doc_tag、Name、Attributes、多值属性
找到了bs4的中文文档,对昨天爬虫程序里所涉及的bs4库进行学习.这篇代码涉及到tag.Name.Attributes以及多值属性. ''' 对象的种类 Beautiful Soup将复杂HTML文档 ...
Beautiful Soup 4.2.0 文档（一）
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时 ...
转：Beautiful Soup
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时 ...
Beautiful Soup 定位指南
Reference: http://blog.csdn.net/abclixu123/article/details/38502993 网页中有用的信息通常存在于网页中的文本或各种不同标签的属性值,为 ...
Beautiful Soup 学习手册
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式快速开始下面的一段HTML代码将作为例 ...
爬虫-Beautiful Soup模块
阅读目录一介绍二基本使用三遍历文档树四搜索文档树五修改文档树六总结一介绍 Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通 ...

随机推荐

laravel的elixir和gulp用来对前端施工
使用laravel elixer npm install --global gulp ok 然后在安装好的laravel 下 npm install 以安装 laravel-elixir subli ...
KMP字符串匹配算法翔解❤
看了Angel_Kitty学姐的博客,我豁然开朗,写下此文: 那么首先我们知道,kmp算法是一种字符串匹配算法,那么我们来看一个例子. 比方说,现在我有两段像这样子的字符串: 分别是T和P,很明显,P ...
用vs2008和vs2005创建win32 console application
http://blog.sina.com.cn/s/blog_4900be890100s735.html 对于经常使用vc6.0的人来说,创建一个win32 console application很简 ...
/proc/sys/shm/drop_caches
author:skate time:2012/02/22 手工释放linux内存--/proc/sys/vm/drop_cache 转载一篇文章 linux的内存查看: [root@localhost ...
7.添加OpenStack计算服务
添加计算服务安装和配置控制器节点创建数据库 mysql -uroot -ptoyo123 CREATE DATABASE nova; GRANT ALL PRIVILEGES ON nova.* ...
以最简单的方式了解--Github
大概是从寒假的时候开始正式的赚取github,从github上面学习一些开源的文档,我记得我注册github账号到现在已经9个月了,但只有最近的2个月才发现github这个新世界,写这篇文章是为了刚入 ...
洛谷——P1067 多项式输出
P1067 多项式输出题目描述一元 n 次多项式可用如下的表达式表示: 其中,aixi称为 i 次项,ai 称为 i 次项的系数.给出一个一元多项式各项的次数和系数,请按照如下规定的格式要求输出该 ...
oracle数据迁移之Exp和Expdp导出数据的性能对比与优化
https://wangbinbin0326.github.io/2017/03/31/oracle%E6%95%B0%E6%8D%AE%E8%BF%81%E7%A7%BB%E4%B9%8BExp%E ...
DQL数据查询语言
--查询全表select * from t_hq_ryxx; --查询字段select xingm as 姓名 ,gongz as 工资 from t_hq_ryxx; --链接字段查询select ...
Could not find com.android.tools.build:gradle:3.0.0-alpha3
最近使用Android Studio 3.0 canary 3 时新建项目遇到标题所示错误,后网上找到解决办法.记录如下: 在项目的build.gradle文件中添加如下内容即可解决. reposit ...

Beautiful Soup 4.4.0 基本使用方法

Beautiful Soup 4.4.0 基本使用方法的更多相关文章

随机推荐

热门专题