python beautiful soup

官方文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

　　使用前需要先安装模块，并安装解析器

pip install beautifulsoup4

pip install lxml

pip install html5lib

　　安装完成后倒入模块

from bs4 import BeautifulSoup

　　选择解析器创建对象

html = urllib.request.urlopen(url).read()

bs = BeautifulSoup(html,'lxml')

　　格式化输出

print(bs.prettify())

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种: Tag , NavigableString , BeautifulSoup , Comment

1.Tag：Tag有很多方法和属性，最重要的属性为name和attributes

>>> soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')

>>> tag = soup.b

>>> type(tag)

<class 'bs4.element.Tag'>

　　name属性：每个tag都有自己的名字,通过 .name 来获取:

>>> tag.name

'b'

#如果改变了tag的name,那将影响所有通过当前Beautiful Soup对象生成的HTML文档

>>> tag.name = "blockquote"

>>> tag

<blockquote class="boldest">Extremely bold</blockquote>

　　attributes属性：获取标签内的属性

#可以直接通过.attrs获取

>>> tag.attrs

{'class': ['boldest']}

#也可以以字典的方式获取

>>> tag['class']

['boldest']

#tag的属性可以被添加,删除或修改

>>> tag['class'] = 'verybold'   #修改

>>> tag['id'] = 1                    #添加

>>> tag

<blockquote class="verybold" id="1">Extremely bold</blockquote>

>>> del tag['class']                #删除

>>> del tag['id']                     #删除

>>> tag

<blockquote>Extremely bold</blockquote>

2.NavigableString：获取标签内的字符串

>>> soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')

>>> tag = soup.b

>>> tag.string

'Extremely bold'

>>> type(tag.string)

<class 'bs4.element.NavigableString'>

tag中包含的字符串不能编辑,但是可以被替换成其它的字符串,用 replace_with() 方法

>>> tag.string.replace_with("No longer bold")

'Extremely bold'

>>> tag

<blockquote>No longer bold</blockquote>

3.BeautifulSoup：

BeautifulSoup 对象表示的是一个文档的全部内容.大部分时候,可以把它当作 Tag对象,它支持遍历文档树和搜索文档树中描述的大部分的方法.

因为 BeautifulSoup 对象并不是真正的HTML或XML的tag,所以它没有name和attribute属性.但有时查看它的 .name 属性是很方便的,所以 BeautifulSoup 对象包含了一个值为 “[document]” 的特殊属性 .name

>>> soup.name

'[document]'

4.comment：注释部分

Tag , NavigableString , BeautifulSoup 几乎覆盖了html和xml中的所有内容,但是还有一些特殊对象.容易让人担心的内容是文档的注释部分

>>> markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"

>>> soup = BeautifulSoup(markup)

>>> comment = soup.b.string

>>> type(comment)

<class 'bs4.element.Comment'>

>>> comment

'Hey, buddy. Want to buy a used parser?'

findall方法：搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件，find方法相当于find_all(limit=1)

find_all( name , attrs , recursive , text , **kwargs )

name 参数：

可以查找所有名字为 name 的tag,字符串对象会被自动忽略掉.简单的用法如下

soup.find_all("title")

# [<title>The Dormouse's story</title>]

keyword 参数：

　　如果一个指定名字的参数不是搜索内置的参数名,搜索时会把该参数当作指定名字tag的属性来搜索,如果包含一个名字为 id 的参数,Beautiful Soup会搜索每个tag的”id”属性.

soup.find_all(id='link2')

# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

　　如果属性名和关键字重复，可以在属性名后加上_，例如class_

soup.find_all('div',class_="p12" )

　　如果传入 href 参数,Beautiful Soup会搜索每个tag的”href”属性:

soup.find_all(href=re.compile("elsie"))

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

　　下面的例子在文档树中查找所有包含 id 属性的tag,无论 id 的值是什么

soup.find_all(id=True)

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

　　使用多个指定名字的参数可以同时过滤tag的多个属性

soup.find_all(href=re.compile("elsie"), id='link1')

# [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]

　　有些tag属性在搜索不能使用,比如HTML5中的 data-* 属性

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')

data_soup.find_all(data-foo="value")

# SyntaxError: keyword can't be an expression

　　但是可以通过 find_all() 方法的 attrs 参数定义一个字典参数来搜索包含特殊属性的tag:

data_soup.find_all(attrs={"data-foo": "value"})

# [<div data-foo="value">foo!</div>]

text 参数：

　　通过 text 参数可以搜搜文档中的字符串内容.与 name 参数的可选值一样, text参数接受字符串 , 正则表达式 , 列表, True . 看例子:　　

soup.find_all(text="Elsie")

# [u'Elsie']

soup.find_all(text=["Tillie", "Elsie", "Lacie"])

# [u'Elsie', u'Lacie', u'Tillie']

soup.find_all(text=re.compile("Dormouse"))

[u"The Dormouse's story", u"The Dormouse's story"]

def is_the_only_string_within_a_tag(s):

    ""Return True if this string is the only child of its parent tag.""

    return (s == s.parent.string)

soup.find_all(text=is_the_only_string_within_a_tag)

# [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...']

limit 参数：

　　find_all() 方法返回全部的搜索结构,如果文档树很大那么搜索会很慢.如果我们不需要全部结果,可以使用 limit 参数限制返回结果的数量.效果与SQL中的limit关键字类似,当搜索到的结果数量达到 limit 的限制时,就停止搜索返回结果.

　　文档树中有3个tag符合搜索条件,但结果只返回了2个,因为我们限制了返回数量:

soup.find_all("a", limit=2)

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

查找文档中所有的<b>标签　　

soup.find_all('b')

如果传入正则表达式作为参数,Beautiful Soup会通过正则表达式的 match() 来匹配内容.下面例子中找出所有以b开头的标签,这表示<body>和<b>标签都应该被找到

import re

for tag in soup.find_all(re.compile("^b")):

    print(tag.name)

如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回.下面代码找到文档中所有<a>标签和<b>标签

soup.find_all(["a", "b"])

True 可以匹配任何值,下面代码查找到所有的tag,但是不会返回字符串节点

for tag in soup.find_all(True):

    print(tag.name)

# html

# head

# title

# body

# p

# b

# p

# a

# a

# a

# p

python beautiful soup的更多相关文章

推荐一些python Beautiful Soup学习网址
前言:这几天忙着写分析报告,实在没精力去研究django,虽然抽时间去看了几遍中文文档,还是等实际实践后写几篇操作文章吧! 正文:以下是本人前段时间学习bs4库找的一些网址,在学习的可以参考下,有点多 ...
Python Beautiful Soup学习之HTML标签补全功能
Beautiful Soup是一个非常流行的Python模块.该模块可以解析网页,并提供定位内容的便捷接口. 使用下面两个命令安装: pip install beautifulsoup4 或者 sud ...
python beautiful soup库的超详细用法
原文地址https://blog.csdn.net/love666666shen/article/details/77512353 参考文章https://cuiqingcai.com/1319.ht ...
Python Beautiful Soup 解析库的使用
Beautiful Soup 借助网页的结构和属性等特性来解析网页,这样就可以省去复杂的正则表达式的编写. Beautiful Soup是Python的一个HTML或XML的解析库. 1.解析器解析 ...
python Beautiful Soup的使用
上一节我们介绍了正则表达式,它的内容其实还是蛮多的,如果一个正则匹配稍有差池,那可能程序就处在永久的循环之中,而且有的小伙伴们也对写正则表达式的写法用得不熟练,没关系,我们还有一个更强大的工具,叫B ...
(17)python Beautiful Soup 4.6
一.安装 1.登陆官网:https://www.crummy.com/software/BeautifulSoup/ 2.下载 3.解压 4.安装 cmd找到文件路径,运行 setup.py buil ...
Python Beautiful Soup 4
Beautiful Soup 是一个灵活方便的网页解析库,利用它不用编写正则表达式即可方便地提取的网页信息官方文档:https://www.crummy.com/software/Beautiful ...
python Beautiful Soup 采集it books pdf,免费下载
http://www.allitebooks.org/ 是我见过最良心的网站,所有书籍免费下载周末无聊,尝试采集此站所有Pdf书籍. 采用技术 python3.5 Beautiful soup 分享 ...
Python Beautiful Soup库
Beautiful Soup库 Beautiful Soup库:https://www.crummy.com/software/BeautifulSoup/ 安装Beautiful Soup: 使用B ...

随机推荐

[python]变量和赋值
1. python的变量名以字母开头,包含字母.数字.下划线. 2. python是动态类型语言,即不需要预先声明变量的类型.变量的类型和值在赋值的时候被初始化. 变量赋值通过等号来执行. 代码: c ...
2019牛客暑期多校训练营（第二场） - J - Go on Strike! - 前缀和预处理
题目链接:https://ac.nowcoder.com/acm/contest/882/C 来自:山东大学FST_stay_night的的题解,加入一些注释帮助理解神仙代码. 好像题解被套了一次又一 ...
P1726 上白泽慧音 tarjan 模板
P1726 上白泽慧音这是一道用tarjan做的模板,要求找到有向图中最大的联通块. #include <algorithm> #include <iterator> #in ...
SDU暑期集训排位（9）
SDU暑期集训排位(9) G. Just Some Permutations 基础 DP 练习部分定义 \(f(S)\),表示让 S 中的人全 happy 的方案数. \(dp[i][j]\) 表示 ...
牛客Wannafly挑战赛23 B.游戏
游戏题目描述小N和小O在玩游戏.他们面前放了n堆石子,第i堆石子一开始有ci颗石头.他们轮流从某堆石子中取石子,不能不取.最后无法操作的人就输了这个游戏.但他们觉得这样玩太无聊了,更新了一下规则. ...
Codeforces Round #484 (Div. 2) B. Bus of Characters（STL+贪心）982B
原博主:https://blog.csdn.net/amovement/article/details/80358962 B. Bus of Characters time limit per tes ...
【LeetCode】[0002] 【两数之和】
题目描述思路分析测试用例 Java代码代码链接题目描述给出两个非空的链表用来表示两个非负的整数.其中,它们各自的位数是按照逆序的方式存储的,并且它们的每个节点只能存储一位数字.如果 ...
基于soot的java方法名生成报告
0.生成XML格式文件笔者使用soot将java文件解析生成xml格式文档,具体操作流程不再赘述.本文讨论执行结果的用途.笔者第一次采用的java文件如下:文件名为test.java 用soot解析 ...
java教程系列一：什么是Java语言？
海上生明月,天涯共此时. Java是一种通用的计算机编程语言,它具有卓越的通用性.高效性.平台移植性和安全性.它旨在让应用程序开发人员"write once, run anywhere&qu ...
C#中 CS1752无法嵌入互操作类型"OPCServerClass"。请改用适用的接口。
使用C#+VS开发OPC程序是,调用Interop.OPCAutomation中的类时,提示无法嵌入互操作类型"OPCServerClass".请改用适用的接口. 首先说一下它的含 ...

python beautiful soup

python beautiful soup的更多相关文章

随机推荐

热门专题