beautifulsoup之CSS选择器

BeautifulSoup支持大部分的CSS选择器，其语法为：向tag或soup对象的.select()方法中传入字符串参数，选择的结果以列表形式返回。

　　tag.select("string")

　　BeautifulSoup.select("string")

源代码示例：

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="title" name="dromouse">
            <b>The Dormouse's story</b>
        </p>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="mysis" href="http://example.com/elsie" id="link1">
                <b>the first b tag<b>
                Elsie
            </a>,
            <a class="mysis" href="http://example.com/lacie" id="link2" myname="kong">
                Lacie
            </a>and
            <a class="mysis" href="http://example.com/tillie" id="link3">
                Tillie
            </a>;and they lived at the bottom of a well.
        </p>
        <p class="story">
            myStory
            <a>the end a tag</a>
        </p>
        <a>the p tag sibling</a>
    </body>
</html>
"""

soup = BeautifulSoup(html,'lxml')

　　1、通过标签选择

# 选择所有title标签

soup.select("title")

# 选择所有p标签中的第三个标签

soup.select("p:nth-of-type(3)") 相当于soup.select(p)[2]

# 选择body标签下的所有a标签

soup.select("body a")

# 选择body标签下的直接a子标签

soup.select("body > a")

# 选择id=link1后的所有兄弟节点标签

soup.select("#link1 ~ .mysis")

# 选择id=link1后的下一个兄弟节点标签

soup.select("#link1 + .mysis")

　　2、通过类名查找

# 选择a标签，其类属性为mysis的标签

soup.select("a.mysis")

　　3、通过id查找

# 选择a标签，其id属性为link1的标签

soup.select("a#link1")

　　4、通过【属性】查找，当然也适用于class

# 选择a标签，其属性中存在myname的所有标签

soup.select("a[myname]")

# 选择a标签，其属性href=http://example.com/lacie的所有标签

soup.select("a[href='http://example.com/lacie']")

# 选择a标签，其href属性以http开头

soup.select('a[href^="http"]')

# 选择a标签，其href属性以lacie结尾

soup.select('a[href$="lacie"]')

# 选择a标签，其href属性包含.com

soup.select('a[href*=".com"]')

# 从html中排除某标签，此时soup中不再有script标签

[s.extract() for s in soup('script')]

# 如果想排除多个呢

[s.extract() for s in soup(['script','fram']

　　5、获取文本及属性

html_doc = """<html>

    <head>

        <title>The Dormouse's story</title>

    </head>

<body>

    <p class="title"><b>The Dormouse's story</b></p>

    <p class="story">Once upon a time there were three little sisters; and their names were

        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

    </p>

        and they lived at the bottom of a well.

    <p class="story">...</p>

</body>

"""

from bs4 import BeautifulSoup

'''

以列表的形式返回

'''

soup = BeautifulSoup(html_doc, 'html.parser')

s = soup.select('p.story')

s[].get_text()  # p节点及子孙节点的文本内容

s[].get_text("|")  # 指定文本内容的分隔符

s[].get_text("|", strip=True)  # 去除文本内容前后的空白

print(s[].get("class"))  # p节点的class属性值列表（除class外都是返回字符串）

　　6、UnicodeDammit.detwingle() 方法只能解码包含在UTF-8编码中的Windows-1252编码内容,

new_doc = UnicodeDammit.detwingle(doc)

print(new_doc.decode("utf8"))

# ☃☃☃“I like snowmen!”

在创建 BeautifulSoup 或 UnicodeDammit 对象前一定要先对文档调用 UnicodeDammit.detwingle() 确保文档的编码方式正确.如果尝试去解析一段包含Windows-1252编码的UTF-8文档,就会得到一堆乱码,比如: â˜ƒâ˜ƒâ˜ƒ“I like snowmen!”.

　　7、其它：

html_doc = """<html>

    <head>

        <title>The Dormouse's story</title>

    </head>

<body>

    <p class="title"><b>The Dormouse's story</b></p>

    <p class="story">Once upon a time there were three little sisters; and their names were

        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

    </p>

        and they lived at the bottom of a well.

    <p class="story">...</p>

</body>

"""

from bs4 import BeautifulSoup

'''

以列表的形式返回

'''

soup = BeautifulSoup(html_doc, 'html.parser')

soup.select('title')  # title标签

soup.select("p:nth-of-type(3)")  # 第三个p节点

soup.select('body a')  # body下的所有子孙a节点

soup.select('p > a')  # 所有p节点下的所有a直接节点

soup.select('p > #link1')  # 所有p节点下的id=link1的直接子节点

soup.select('#link1 ~ .sister')  # id为link1的节点后面class=sister的所有兄弟节点

soup.select('#link1 + .sister')  # id为link1的节点后面class=sister的第一个兄弟节点

soup.select('.sister')  # class=sister的所有节点

soup.select('[class="sister"]')  # class=sister的所有节点

soup.select("#link1")  # id=link1的节点

soup.select("a#link1")  # a节点，且id=link1的节点

soup.select('a[href]')  # 所有的a节点，有href属性

soup.select('a[href="http://example.com/elsie"]')  # 指定href属性值的所有a节点

soup.select('a[href^="http://example.com/"]')  # href属性以指定值开头的所有a节点

soup.select('a[href$="tillie"]')  # href属性以指定值结尾的所有a节点

soup.select('a[href*=".com/el"]')  # 支持正则匹配

beautifulsoup之CSS选择器的更多相关文章

【网络爬虫入门04】彻底掌握BeautifulSoup的CSS选择器
[网络爬虫入门04]彻底掌握BeautifulSoup的CSS选择器广东职业技术学院欧浩源 2017-10-21 1.引言目前,除了官方文档之外,市面上及网络详细介绍BeautifulSoup ...
bs4 CSS选择器
#https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#find-all #beautifulSoup可以解析HTML ...
爬虫之BeautifulSoup， CSS
1. Beautiful Soup的简介 2. Beautiful Soup 安装可以利用 pip 或者 easy_install 来安装,以下两种方法均可 easy_install beautif ...
六、CSS 选择器：BeautifulSoup4
和 lxml 一样,Beautiful Soup 也是一个HTML/XML的解析器,主要的功能也是如何解析和提取 HTML/XML 数据. lxml 只会局部遍历,而Beautiful Soup 是基 ...
如何利用CSS选择器抓取京东网商品信息
前几天小编分别利用Python正则表达式.BeautifulSoup.Xpath分别爬取了京东网商品信息,今天小编利用CSS选择器来为大家展示一下如何实现京东商品信息的精准匹配~~ CSS选择器目前 ...
使用requests爬取梨视频、bilibili视频、汽车之家，bs4遍历文档树、搜索文档树，css选择器
今日内容概要使用requests爬取梨视频 requests+bs4爬取汽车之家 bs4遍历文档树 bs4搜索文档树 css选择器内容详细 1.使用requests爬取梨视频 # 模拟发送http ...
前端极易被误导的css选择器权重计算及css内联样式的妙用技巧
记得大学时候,专业课的网页设计书籍里面讲过css选择器权重的计算:id是100,class是10,html标签是5等等,然后全部加起来的和进行比较... 我只想说:真是误人子弟,害人不浅! 最近,在前 ...
css选择器
常用css选择器,希望对大家有所帮助,不喜勿喷. 1.*:通用选择器 * { margin: 0; padding: 0; } 选择页面上的全部元素,通常用于清除浏览器默认样式,不推荐使用. 2.#i ...
dynamic-css 动态 CSS 库，使得你可以借助 MVVM 模式动态生成和更新 css，从 js 事件和 css 选择器的苦海中脱离出来
dynamic-css 使得你可以借助 MVVM 模式动态生成和更新 css,从而将本插件到来之前,打散.嵌套在 js 中的修改样式的代码剥离出来.比如你要做元素跟随鼠标移动,或者根据滚动条位置的变化 ...

随机推荐

CSS中的层叠上下文和层叠顺序
一.什么是层叠上下文和层叠水平层叠上下文和层叠水平有一点儿抽象.我们可以吧层叠上下问想象成一张桌子,如果有另一个桌子在他旁边,则代表了另一个层叠上下文. Stacking context 1由文件根 ...
CentOS常用命令、快照、克隆大揭秘
不多说,直接上干货! cat是查看文件内容, cp –cp是连目录及件文件都拷贝 cp是拷贝文件 a.txt里的内容是, abc def ghi cat a.txt |grep –v gh ...
Java性能调优：利用JMC进行性能分析
JMC, 即Java任务控制(Java Mission Control)是从Java7(7u40)和 Java8 的商业版本包括一项新的监控和控制特性. JMC 程序 (JDK_HOME\bin目录下 ...
kafka报错处理
Kafka报错处理 1. 记一次kafka报错处理 Kafka停止后,再启动的时候发生了报错: [2017-10-27 09:43:18,313] INFO Recovering unflus ...
elasticSearch6源码分析(2)模块化管理
elasticsearch里面的组件基本都是用Guice的Injector进行注入与获取实例方式进行模块化管理. 在node的构造方法中 /** * Constructs a node * * @pa ...
SSH和SSL比较
一.SSH介绍什么是SSH? 传统的网络服务程序,如:ftp.pop和telnet在本质上都是不安全的,因为它们在网络上用明文传送口令和数据, 别有用心的人非常容易就可以截获这些口令和数据.而且, ...
JavaScript 作用域（Scope）详解
先对需要用到的名词解释一下,再通过例子深入理解一.什么是作用域(Scope) [[scope]]:每个javascript函数都是一个对象,对象中有些属性我们可以访问,但有些不可以,这些属性仅供ja ...
【转】30+有用的CSS代码片段
来自:WEB资源网链接:http://webres.wang/31-css-code-snippets-to-make-you-a-better-coder/ 原文:http://www.desig ...
mysql replace语句
语法 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 REPLACE [LOW_PRIORITY | DELAYED] [INTO] tbl_name [ ...
win7下使用IIS服务器及自定义服务器端包含模块（SSI）步骤
配置完过段时间就容易忘记,特此记录. 1.开启IIS服务器. 默认没有安装,需要先安装. 打开控制面板–> 打开“程序和功能”–> 左侧选择“启用或关闭windows功能”–> 找到 ...

beautifulsoup之CSS选择器

beautifulsoup之CSS选择器的更多相关文章

随机推荐

热门专题