beautifulSoup《转》

beautifulSoup基本用法及find选择器

　　总结来源于官方文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#find-all

示例代码段

html_doc = """ <html>     <head><title>The Dormouse's story <!--Hey, buddy. Want to buy a used parser?-->     <a><!--Hey, buddy. Want to buy a used parser?--></a></title>     </head> <body>     <p class="title">         <b>The Dormouse's story</b>         <a><!--Hey, buddy. Want to buy a used parser?--></a>     </p>     <p class="story">Once upon a time there were three little sisters; and their names were         <a href="http://example.com/elsie" class="sister" id="link1 link4">Elsie</a>,         <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and         <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;         and they lived at the bottom of a well.     </p>     <p class="story">...</p> """

　　1、快速操作：

soup.title  == soup.find('title')

# <title>The Dormouse's story</title>

soup.title.name

# u'title'

soup.title.string  == soup.title.text  == soup.title.get_text()

# u'The Dormouse's story'

soup.title.parent.name

# u'head'

soup.p   == soup.find('p')  # . 点属性，只能获取当前标签下的第一个标签

# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']

# u'title'

soup.a  == soup.find('a')

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

 soup.find_all(['a','b'])  # 查找所有的a标签和b标签 soup.find_all(id=["link1","link2"])  # 查找所有id=link1 和id=link2的标签

soup.find(id="link3")

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

　　2、Beautiful Soup对象有四种类型：

　　　　1、BeautifulSoup

　　　　2、tag：标签

　　　　3、NavigableString : 标签中的文本，可包含注释内容

　　　　4、Comment ：标签中的注释，纯注释，没有正文内容

　　标签属性的操做跟字典是一样一样的

　　html多值属性(xml不适合)：

　　　　意思为一个属性名称，它是多值的，即包含多个属性值，即使属性中只有一个值也返回值为list，

　　　　如：class,rel , rev , accept-charset , headers , accesskey

　　　　其它属性为单值属性，即使属性值中有多个空格隔开的值，也是反回一个字符串

soup.a['class']  #['sister']

id_soup = BeautifulSoup('<p id="my id"></p>')

id_soup.p['id']  #'my id'

　　3、html中tag内容输出：　

　　　　string:输出单一子标签文本内容或注释内容（选其一，标签中包含两种内容则输出为None）

　　　　strings: 返回所有子孙标签的文本内容的生成器（不包含注释）

　　　　stripped_strings:返回所有子孙标签的文本内容的生成器（不包含注释,并且在去掉了strings中的空行和空格）

　　　　text:只输出文本内容，可同时输出多个子标签内容

　　　　get_text():只输出文本内容，可同时输出多个子标签内容

　　string:

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"

soup = BeautifulSoup(markup, 'html.parser')

comm = soup.b.string

print(comm)  # Hey, buddy. Want to buy a used parser?

print(type(comm))  #<class 'bs4.element.Comment'>

　　strings:

head_tag = soup.body

for s in head_tag.strings:

    print(repr(s))

结果：

'\n'

"The Dormouse's story"

'\n'

'Once upon a time there were three little sisters; and their names were\n        '

'Elsie'

',\n        '

'Lacie'

' and\n        '

'Tillie'

';\n        and they lived at the bottom of a well.\n    '

'\n'

'...'

'\n'

　　stripped_strings:

head_tag = soup.body

for s in head_tag.stripped_strings:

    print(repr(s))

结果：

"The Dormouse's story"

'Once upon a time there were three little sisters; and their names were'

'Elsie'

','

'Lacie'

'and'

'Tillie'

';\n        and they lived at the bottom of a well.'

'...'

　　text:

soup = BeautifulSoup(html_doc, 'html.parser')

head_tag = soup.body

print(head_tag.text)

结果：

The Dormouse's story

Once upon a time there were three little sisters; and their names were

        Elsie,

        Lacie and

        Tillie;

        and they lived at the bottom of a well.

...

soup = BeautifulSoup(html_doc, 'html.parser')

head_tag = soup.body

print(repr(head_tag.text))

结果：

"\nThe Dormouse's story\nOnce upon a time there were three little sisters; and their names were\n        Elsie,\n        Lacie and\n        Tillie;\n        and they lived at the bottom of a well.\n    \n...\n"

　　4、返回子节点列表：

　　　　.contents: 以列表的方式返回节点下的直接子节点

　　　　.children:以生成器的方式反回节点下的直接子节点

soup = BeautifulSoup(html_doc, 'html.parser')

head_tag = soup.head

print(head_tag)

print(head_tag.contents)

print(head_tag.contents[0])

print(head_tag.contents[0].contents)

for ch in head_tag.children:

    print(ch)

结果：

<head><title>The Dormouse's story</title></head>

[<title>The Dormouse's story</title>]

<title>The Dormouse's story</title>

["The Dormouse's story"]

<title>The Dormouse's story</title>

　　5、返回子孙节点的生成器：

　　　　　.descendants: 以列表的方式返回标签下的子孙节点

for ch in head_tag.descendants:

    print(ch)

结果：

<title>The Dormouse's story</title>

The Dormouse's story

　　6、父标签（parent）：如果是bs4对象，不管本来是标签还是文本都可以找到其父标签，但是文本对象不能找到父标签

soup = BeautifulSoup(html_doc, 'html.parser')

tag_title = soup.b  # b标签

print(tag_title.parent)  # b标签的父标签 p

print(type(tag_title.string))  # b标签中的文本的类型,文本中有注释时结果为None <class 'bs4.element.NavigableString'>

print(tag_title.string.parent)  # b标签中文本的父标签 b

print(type(tag_title.text))  # b 标签中的文本类型为str，无bs4属性找到父标签

　　7、递归父标签（parents）：递归得到元素的所有父辈节点

soup = BeautifulSoup(html_doc, 'html.parser')

link = soup.a

for parent in link.parents:

    print(parent.name)
结果：

p body html [document]

　　8、前后节点查询(不是前后标签哦，文本也是节点之一):previous_sibling,next_sibling

　　9、以生成器的方式迭代返回所有兄弟节点

for sib in soup.a.next_siblings:

    print(sib)

    print("---------")

结果：

-------------

,

---------

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

---------

---------

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

---------

;

        and they lived at the bottom of a well.

---------

　　10、搜索文档树

　　　　过滤器：

　　　　　　1、字符串

　　　　　　2、正则表达式

　　　　　　3、列表

　　　　　　4、True

　　　　　　5、方法

html_doc = """<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were</p>

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.

<p class="story">...</p>

</body>

"""

from bs4 import BeautifulSoup

import re

soup = BeautifulSoup(html_doc, 'html.parser')

soup.find_all("a")  # 字符串参数

soup.find_all(re.compile("^b"))  # 正则参数

soup.find_all(re.compile("a"))  # 正则参数

soup.find_all(re.compile("l$"))  # 正则参数

soup.find_all(["a", "b"])  # 标签的列表参数

soup.find_all(True)  # 返回所有标签

def has_class_no_id(tag):

    return tag.has_attr("class") and not tag.has_attr("id")

soup.find_all(has_class_no_id)  # 方法参数

　　11、find选择器：

　　　　语法：

　　　　# find_all( name , attrs , recursive , text , **kwargs )

　　　　#  name :要查找的标签名

　　　　#  attrs: 标签的属性

　　　　#  recursive: 递归

　　　　#  text: 查找文本

　　　　# **kwargs :其它 键值参数
　　特殊情况: 　　　　data-foo="value",因中横杠不识别的原因，只能写成attrs={"data-foo":"value"},

　　　　class="value",因class是关键字，所以要写成class_="value"或attrs={"class":"value"}

from bs4 import BeautifulSoup

import re

html_doc = """

<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

# find_all( name , attrs , recursive , text , **kwargs )

#  name :要查找的标签名（字符串、正则、方法、True）

#  attrs: 标签的属性

#  recursive: 递归

#  text: 查找文本

# **kwargs :其它 键值参数

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.find_all('p', 'title')) # p标签且class="title"

soup.find_all('title')  # 以列表形式返回 所有title标签a

soup.find_all(attrs={"class":"sister"})  # 以列表形式返回 所有class属性==sister的标签

soup.find_all(id='link2')  # 返回所有id属性==link2的标签

soup.find_all(href=re.compile("elsie")) # 返回所有href属性包含elsie的标签

soup.find_all(id=True)  # 返回 所有包含id属性的标签

soup.find_all(id="link1", href=re.compile('elsie'))  #  id=link1且href包含elsie

关于class的搜索

soup = BeautifulSoup(html_doc, 'html.parser')

css_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html.parser')

css_soup.find_all("p", class_="body")  # 多值class,指定其中一个即可

css_soup.find_all("p", class_="strikeout")

css_soup.find_all("p", class_="body strikeout")  # 精确匹配

# text 参数可以是字符串，列表、方法、True

soup.find_all("a", text="Elsie")  # text="Elsie"的a标签

　　12、父节点方法：

　　　　find_parents( name , attrs , recursive , text , **kwargs )

　　　　find_parent( name , attrs , recursive , text , **kwargs )

html_doc = """<html>

    <head>

        <title>The Dormouse's story</title>

    </head>

<body>

    <p class="title"><b>The Dormouse's story</b></p>

    <p class="story">Once upon a time there were three little sisters; and their names were</p>

    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

    <p>

        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

    </p>

    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

    and they lived at the bottom of a well.

    <p class="story">...</p>

</body>

"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, 'html.parser')

a_string = soup.find(text="Lacie")  # 文本为Lacie的节点

type(a_string), a_string  # <class 'bs4.element.NavigableString'> Lacie

a_parent = a_string.find_parent()  # a_string的父节点中的第一个节点

a_parent = a_string.find_parent("p")  # a_string的父节点中的第一个p节点

a_parents = a_string.find_parents()  # a_string的父节点

a_parents = a_string.find_parents("a")  # a_string的父点中所有a节点

　　13、后面的邻居节点：

　　　　find_next_siblings( name , attrs , recursive , text , **kwargs )

　　　　find_next_sibling( name , attrs , recursive , text , **kwargs )

html_doc = """<html><head><title>The Dormouse's story</title></head>

<body>

    <p class="title"><b>The Dormouse's story</b></p>

    <p class="story">Once upon a time there were three little sisters; and their names were</p>

    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

    <b href="http://example.com/elsie" class="sister" id="link1">Elsie</b>,

    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

        and they lived at the bottom of a well.

    <p class="story">...</p>

</body>

"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, 'html.parser')

first_link = soup.a  # 第一个a标签

a_sibling = first_link.find_next_sibling()  # 后面邻居的第一个

a_sibling = first_link.find_next_sibling("a")  # 后面邻居的第一个a

a_siblings = first_link.find_next_siblings()  # 后面的所有邻居

a_siblings = first_link.find_next_siblings("a")  # 后面邻居的所有a邻居

　　14、前面的邻居节点：

　　　　find_previous_siblings( name , attrs , recursive , text , **kwargs )

　　　　find_previous_sibling( name , attrs , recursive , text , **kwargs )

　　15、后面的节点：

　　　　find_all_next( name , attrs , recursive , text , **kwargs )

　　　　find_next( name , attrs , recursive , text , **kwargs )

html_doc = """<html>

    <head>

        <title>The Dormouse's story</title>

    </head>

<body>

    <p class="title"><b>The Dormouse's story</b></p>

    <p class="story">Once upon a time there were three little sisters; and their names were</p>

    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

    <p>

        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

    </p>

    <p>

        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

    </p>

        and they lived at the bottom of a well.

    <p class="story">...</p>

</body>

"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, 'html.parser')

a_string = soup.find(text="Lacie")

a_next = a_string.find_next()  # 后面所有子孙标签的第一个

a_next = a_string.find_next('a')  # 后面所有子孙标签的第一个a标签

a_nexts = a_string.find_all_next()  # 后面的所有子孙标签

a_nexts = a_string.find_all_next('a')  # 后面的所有子孙标签中的所有a标签

　　16、前面的节点：

　　　　find_all_previous( name , attrs , recursive , text , **kwargs )

　　　　find_previous( name , attrs , recursive , text , **kwargs )

　　17、解析部分文档：

　　　　如果仅仅因为想要查找文档中的<a>标签而将整片文档进行解析,实在是浪费内存和时间.最快的方法是从一开始就把<a>标签以外的东西都忽略掉. SoupStrainer 类可以定义文档的某段内容,这样搜索文档时就不必先解析整篇文档,只会解析在 SoupStrainer 中定义过的文档. 创建一个 SoupStrainer 对象并作为 parse_only 参数给 BeautifulSoup 的构造方法即可。

　　SoupStrainer 类参数：name , attrs , recursive , text , **kwargs

html_doc = """<html>

    <head>

        <title>The Dormouse's story</title>

    </head>

<body>

    <p class="title"><b>The Dormouse's story</b></p>

    <p class="story">Once upon a time there were three little sisters; and their names were

        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

    </p>

        and they lived at the bottom of a well.

    <p class="story">...</p>

</body>

"""

from bs4 import SoupStrainer

a_tags = SoupStrainer('a')  # 所有a标签

id_tags = SoupStrainer(id="link2")  # id=link2的标签

def is_short_string(string):

    return len(string) < 10  # string长度小于10，返回True

short_string = SoupStrainer(text=is_short_string)  # 符合条件的文本

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, 'html.parser', parse_only=a_tags).prettify()

soup = BeautifulSoup(html_doc, 'html.parser', parse_only=id_tags).prettify()

soup = BeautifulSoup(html_doc, 'html.parser', parse_only=short_string).prettify()

beautifulSoup《转》的更多相关文章

Python爬虫小白入门（三）BeautifulSoup库
# 一.前言 *** 上一篇演示了如何使用requests模块向网站发送http请求,获取到网页的HTML数据.这篇来演示如何使用BeautifulSoup模块来从HTML文本中提取我们想要的数据. ...
使用beautifulsoup与requests爬取数据
1.安装需要的库 bs4 beautifulSoup requests lxml如果使用mongodb存取数据,安装一下pymongo插件 2.常见问题 1> lxml安装问题如果遇到lxm ...
BeautifulSoup ：功能使用
# -*- coding: utf-8 -*- ''' # Author : Solomon Xie # Usage : 测试BeautifulSoup一些用法及容易出bug的地方 # Envirom ...
BeautifulSoup研究一
BeautifulSoup的文档见 https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/ 其中.contents 会将换行也记录为一个子节 ...
BeautifulSoup
参考:http://www.freebuf.com/news/special/96763.html 相关资料:http://www.jb51.net/article/65287.htm 1.Pytho ...
BeautifulSoup Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
BeautifulSoup很赞的东西最近出现一个问题:Python 3.3 soup=BeautifulSoup(urllib.request.urlopen(url_path),"htm ...
beautifulSoup(1)
import re from bs4 import BeautifulSoupdoc = ['<html><head><title>Page title</t ...
python BeautifulSoup模块的简要介绍
常用介绍: pip install beautifulsoup4 # 安装模块 from bs4 import BeautifulSoup # 导入模块 soup = BeautifulSoup(ht ...
BeautifulSoup 的用法
转自:http://cuiqingcai.com/1319.html Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,如果我们不安装它,则 Python ...
BeautifulSoup的选择器
用BeautifulSoup查找指定标签(元素)的时候,有几种方法: soup=BeautifulSoup(html) 1.soup.find_all(tagName),返回一个指定Tag元素的列表 ...

随机推荐

hadoop之安装
在Linux环境下安装Hadoop: 一.安装环境硬件:虚拟机操作系统:Centos 6.4 64位 IP:192.168.153.130主机名:imooc安装用户:root ...
vs 设置自动缩进tab转换成空格
工具选项文本编辑器如下图选中插入空格使用技巧: 按Ctrl+K+F组合键,可以自动进行代码对齐.
css 常用布局
「前端那些事儿」③ CSS 布局方案我们在日常开发中经常遇到布局问题,下面罗列几种常用的css布局方案话不多说,上代码! 居中布局以下居中布局均以不定宽为前提,定宽情况包含其中 1.水平居中 a ...
c函数创建文件和路径
bool NewFileName(const char* filename) { size_t len; < (len = strlen(filename))) { char* tmpbuf, ...
除了Udacity，全球最聪明的那群人还上哪些网站？
01. ***,与世界相连 WikiWand——打开维基百科的新方式 http://www.wikiwand.com/ InsightfulQuestions(subreddit)——跨越界限的智力讨 ...
python全栈开发笔记---------字符串格式化
字符串格式化 %s 可以接收任何值, %d只能接收整形 .其他类型报错 msg ='i am %s my body' %'ales' print(msg) #i am ales my body msg ...
BigDecimal加减乘除
import java.math.BigDecimal; public class Testmath { public static void main(String[] args) { String ...
centos7 操作防火墙
原文:https://blog.csdn.net/u012498149/article/details/78772058 1.firewalld的基本使用启动: systemctl start fi ...
L330 Black hole picture captured for first time in space ‘breakthrough’
Black hole picture captured for first time in space ‘breakthrough’ Astronomers have captured the fir ...
Cordova编译报AAPT错误的解决方法
因为项目中同时使用cordova-hot-code-push-plugin和phonegap-plugin-barcodescanner,编译时报错:AAPT: error: resource and ...

beautifulSoup《转》

beautifulSoup基本用法及find选择器

beautifulSoup《转》的更多相关文章

随机推荐

热门专题