Python模块学习之bs4

1、安装bs4

我用的ubuntu14.4，直接用apt-get命令就行

sudo apt-get install Python-bs4

2、安装解析器

Beautiful Soup支持Python标准库中的HTML解析器，还支持一些第三方的解析器，其中一个是lxml。

sudo apt-get install Python-lxml

3、如何使用

将一段文档传入BeautifulSoup的构造方法，就能得到一个文档的对象，可以传入一段字符串或一个文件句柄。

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("index.html"))

soup = BeautifulSoup("<html>data</html>")

4、对象的种类

Beautfiful Soup将复杂HTML文档转换成一个复杂的树形结构，每个节点都是Python对象，所有对象可以归纳为4种：tag，NavigableString，BeautifulSoup，Comment。

tag

Tag对象与XML或HMTL原生文档中的tag相同：

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')

tag = soup.b

type(tag)

# <class 'bs4.element.Tag'>

每个tag都有自己的名字，通过.name来获取：

tag.name

# u'b'

一个tag可能有很多属性。

tag['class']

# u'boldest'

tag.attrs

# {u'class': u'boldest'}

NavigableString

字符串常被包含在tag内。

tag.string

# u'Extremely bold'

type(tag.string)

# <class 'bs4.element.NavigableString'>

BeautifulSoup

BeautifulSoup对象表示的是一个文档的全部内容。

soup

<html><body><b class="boldest">Extremely bold</b></body></html>

type(soup)

<class 'bs4.BeautifulSoup'>

Comment

一般表示的是文档的注释部分。

5、遍历文档树

tag的名字

可以通过点取属性的方式获取tag，并且可以多次调用。

soup.head

# <head><title>The Dormouse's story</title></head>

soup.title

# <title>The Dormouse's story</title>

通过点取属性的方式只能获取当前名字的第一个tag：

soup.a

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

如果想获取所有的a标签

soup.find_all('a')

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

6、搜索文档树

Beautiful Soup最重要的搜索方法有两个：find（）,find_all()。

过滤器

最简单的过滤器是字符串

soup.find_all('b')

# [<b>The Dormouse's story</b>]

通过传入正则表达式来作为参数

import re

for tag in soup.find_all(re.compile("^b")):

    print(tag.name)

# body

# b

传入列表参数

soup.find_all(["a", "b"])

# [<b>The Dormouse's story</b>,

#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

如果没有合适的过滤器，还可以自定义方法

find_all()

find_all( name , attrs , recursive , text , **kwargs )

name参数

name参数可以查找所有名字为name的tag，比如title\head\body\p等等

keyword参数

如果一个指定名字的参数不是搜索内置的参数名,搜索时会把该参数当作指定名字tag的属性来搜索,如果包含一个名字为 id 的参数,Beautiful Soup会搜索每个tag的”id”属性.

soup.find_all(id='link2')

# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

如果传入 href 参数,Beautiful Soup会搜索每个tag的”href”属性:

soup.find_all(href=re.compile("elsie"))

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

搜索指定名字的属性时可以使用的参数值包括字符串 , 正则表达式 , 列表, True .

下面的例子在文档树中查找所有包含 id 属性的tag,无论 id 的值是什么:

soup.find_all(id=True)

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

使用多个指定名字的参数可以同时过滤tag的多个属性:

soup.find_all(href=re.compile("elsie"), id='link1')

# [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]

按css搜索

class由于与Python关键字冲突，因此在beatifulsoup中为class_

class_ 参数同样接受不同类型的 过滤器 ,字符串,正则表达式,方法或 True

text参数

text参数可以搜索文档中的字符串内容。与 name 参数的可选值一样, text 参数接受字符串 , 正则表达式 , 列表, True。

像调用 `find_all()` 一样调用tag

find_all() 几乎是Beautiful Soup中最常用的搜索方法,所以我们定义了它的简写方法. BeautifulSoup 对象和 tag 对象可以被当作一个方法来使用,这个方法的执行结果与调用这个对象的 find_all() 方法相同,下面两行代码是等价的:

soup.find_all("a")

soup("a")

这两行代码也是等价的:

soup.title.find_all(text=True)

soup.title(text=True)

CSS选择器

Beautiful Soup支持大部分的CSS选择器 [6] ,在 Tag 或 BeautifulSoup 对象的 .select() 方法中传入字符串参数,即可使用CSS选择器的语法找到tag:

soup.select("title")

# [<title>The Dormouse's story</title>]

soup.select("p nth-of-type(3)")

# [<p class="story">...</p>]

通过tag标签逐层查找:

soup.select("body a")

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

#  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,

#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select("html head title")

# [<title>The Dormouse's story</title>]

找到某个tag标签下的直接子标签 [6] :

soup.select("head > title")

# [<title>The Dormouse's story</title>]

soup.select("p > a")

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

#  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,

#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select("p > a:nth-of-type(2)")

# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

soup.select("p > #link1")

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.select("body > a")

# []

找到兄弟节点标签:

soup.select("#link1 ~ .sister")

# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

#  <a class="sister" href="http://example.com/tillie"  id="link3">Tillie</a>]

soup.select("#link1 + .sister")

# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

通过CSS的类名查找:

soup.select(".sister")

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select("[class~=sister]")

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

通过tag的id查找:

soup.select("#link1")

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.select("a#link2")

# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

通过是否存在某个属性来查找:

soup.select('a[href]')

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

通过属性的值来查找:

soup.select('a[href="http://example.com/elsie"]')

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.select('a[href^="http://example.com/"]')

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select('a[href$="tillie"]')

# [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select('a[href*=".com/el"]')

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

Python模块学习之bs4的更多相关文章

【转】Python模块学习 - fnmatch & glob
[转]Python模块学习 - fnmatch & glob 介绍 fnmatch 和 glob 模块都是用来做字符串匹配文件名的标准库. fnmatch模块大部分情况下使用字符串匹配查找特 ...
【目录】Python模块学习系列
目录:Python模块学习笔记 1.Python模块学习 - Paramiko - 主机管理 2.Python模块学习 - Fileinput - 读取文件 3.Python模块学习 - Confi ...
Python模块学习filecmp文件比较
Python模块学习filecmp文件比较 filecmp模块用于比较文件及文件夹的内容,它是一个轻量级的工具,使用非常简单.python标准库还提供了difflib模块用于比较文件的内容.关于dif ...
python模块学习第 0000 题
将你的 QQ 头像(或者微博头像)右上角加上红色的数字,类似于微信未读信息数量那种提示效果. 类似于图中效果: 好可爱>%<! 题目来源:https://github.com/Yixiao ...
Python模块学习：logging 日志记录
原文出处: DarkBull 许多应用程序中都会有日志模块,用于记录系统在运行过程中的一些关键信息,以便于对系统的运行状况进行跟踪.在.NET平台中,有非常著名的第三方开源日志组件log4net ...
解惑Python模块学习，该如何着手操作...
Python模块晚上和朋友聊天,说到公司要求精兵计划,全员都要有编程能力.然后C.Java.Python-对于零基础入门的,当然是选择Python的人较多了.可朋友说他只是看了简单的语法,可pyth ...
Python模块学习
6. Modules If you quit from the Python interpreter and enter it again, the definitions you have made ...
Python模块学习系列
python模块-time python模块-datetime python模块-OS模块详解
Python模块学习遇到的问题
Python使用import导入模块时报ValueError: source code string cannot contain null bytes的解决方案 Python使用import导入模块 ...

随机推荐

Python 函数常用方法总结
一.函数的定义与优势: 函数是组织好的,可重复使用的,用来实现单一,或相关联功能的代码段.函数能提高应用的模块性,和代码的重复利用率. Python提供了许多内建函数,比如print(),但也可以自己 ...
使用tar+lz4/pigz+ssh更快的数据传输
使用tar+lz4/pigz+ssh更快的数据传输 -- | :41分类:Linux,MySQL | 前面一篇介绍了如何最大限度的榨取SCP的传输速度,有了这个基础,就可以进一步的使用压缩来加速传输速 ...
IntelliJ IDEA代码编码区提示库源不匹配字节码解决办法
在使用IntelliJ IDEA进行开发时,可能会在代码编辑区出现此提示:library source does not match the bytecode for class HelloWorld ...
httpclient 优化
(1)采用单例模式(重用HttpClient实例) 对于一个通信单元甚至是整个应用程序,Apache强烈推荐只使用一个HttpClient的实例.例如: private static HttpC ...
MVVM 实战之计算器
MVVM 实战之计算器 android DataBinding MVVM calculator Model View 布局文件 Fragment ViewModel 结束语前些日子,一直在学习基于 ...
django 模板报错
"Requested setting TEMPLATE_DEBUG, but settings are not configured. You must either define the ...
浅谈Java中的补零扩展和补符号位扩展
今天,魏屌出了一道题,题目如下: 定义一个大头序的byte[]a={-1,-2,-3,-4},转换成short[]b.问b[0]和b[1]分别是多少? 乍一看,这题不难,无非就是移位操作,再进行组合. ...
BT下载原理分析
版权声明:本文为博主原创文章,未经博主允许不得转载. BitTorrent协议. BT全名为BitTorrent,是一个p2p软件,你在下载download的同时,也在为其他用户提供上传upload, ...
从thinkphp到php到ajax
因为thinkphp的ajax非常麻烦,所以采用了php辅助,辅助的过程必然要只有一个连接字符串,但是却不能异步了如果单独的在页面写连接字符串,不引用,那么页面又返回正常
String类的常用成员方法
1. 构造方法: String(byte[] byte,int offset,int length);这个在上面已经用到. 2. equalsIgnoreCase:忽略大小写的比较,上例中如果您输 ...

Python模块学习之bs4

像调用 find_all() 一样调用tag

Python模块学习之bs4的更多相关文章

随机推荐

热门专题

像调用 `find_all()` 一样调用tag