转python爬虫：BeautifulSoup 使用select方法详解

 1 html = """

 2 <html><head><title>The Dormouse's story</title></head>

 3 <body>

 4 <p class="title" name="dromouse"><b>The Dormouse's story</b></p>

 5 <p class="story">Once upon a time there were three little sisters; and their names were

 6 <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

 7 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

 8 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

 9 and they lived at the bottom of a well.</p>

10 <p class="story">...</p>

11 """

我们在写 CSS 时，标签名不加任何修饰，类名前加点，id名前加 #，在这里我们也可以利用类似的方法来筛选元素，用到的方法是 soup.select()，返回类型是 list
（1）通过标签名查找

print soup.select('title')

#[<title>The Dormouse's story</title>]

print soup.select('a')

#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print soup.select('b')

#[<b>The Dormouse's story</b>]

（2）通过类名查找

print soup.select('.sister')

#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

（3）通过 id 名查找

print soup.select('#link1')

#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

（4）组合查找

组合查找即和写 class 文件时，标签名与类名、id名进行的组合原理是一样的，例如查找 p 标签中，id 等于 link1的内容，二者需要用空格分开

print soup.select('p #link1')

#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

直接子标签查找

print soup.select("head > title")

#[<title>The Dormouse's story</title>]

（5）属性查找

查找时还可以加入属性元素，属性需要用中括号括起来，注意属性和标签属于同一节点，所以中间不能加空格，否则会无法匹配到。

print soup.select("head > title")

#[<title>The Dormouse's story</title>]

print soup.select('a[href="http://example.com/elsie"]')

#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

同样，属性仍然可以与上述查找方式组合，不在同一节点的空格隔开，同一节点的不加空格

print soup.select('p a[href="http://example.com/elsie"]')

#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

转python爬虫：BeautifulSoup 使用select方法详解的更多相关文章

BeautifulSoup 使用select方法详解（通过标签名，类名， id，组合，属性查找）
import requestsfrom bs4 import BeautifulSoup blslib="html5lib"user_agent="Mozilla/5.0 ...
python中requests库使用方法详解
目录 python中requests库使用方法详解官方文档什么是Requests 安装Requests库基本的GET请求带参数的GET请求解析json 添加headers 基本POST请求 ...
Python爬虫之selenium库使用详解
Python爬虫之selenium库使用详解本章内容如下: 什么是Selenium selenium基本使用声明浏览器对象访问页面查找元素多个元素查找元素交互操作交互动作执行JavaS ...
Python操作SQLite数据库的方法详解
Python操作SQLite数据库的方法详解本文实例讲述了Python操作SQLite数据库的方法.分享给大家供大家参考,具体如下: SQLite简单介绍 SQLite数据库是一款非常小巧的嵌入式开 ...
Python学习笔记：魔术方法详解
准备工作为了确保类是新型类,应该把 _metaclass_=type 入到你的模块的最开始. class NewType(Object): mor_code_here class OldType: ...
一文秒懂！Python字符串格式化之format方法详解
format是字符串内嵌的一个方法,用于格式化字符串.以大括号{}来标明被替换的字符串,一定程度上与%目的一致.但在某些方面更加的方便 1.基本用法 1.按照{}的顺序依次匹配括号中的值 s = &q ...
python的内置模块re模块方法详解以及使用
正则表达式一.普通字符 . 通配符一个.只匹配一个字符匹配任意除换行符"\n"外的字符(在DOTALL模式中也能匹配换行符 >>> import re ...
python中os模块函数方法详解最全最新
os.getcwd() 获取当前工作目录,即当前python脚本工作的目录路径 import os print(os.getcwd()) os.chdir("dirname") 改 ...
python MethodType方法详解和使用
python 中MethodType方法详解和使用废话不多说,直接上代码 #!/usr/bin/python # -*-coding:utf-8-*- from types import Metho ...

随机推荐

java.util.Arrays类
前言:java.util.Arrays类的技术文档请查看Oracle官网 1.Arrays类常见方法: 本文参考资料:百度文库:Oracle官网:第三方中文技术文档
FTP publisher plugin插件
说明:这个插件可以将构建的产物(例如:Jar)发布到FTP中去. 官方说明:FTP publisher plugin 安装步骤: 系统管理→管理插件→可选插件→Artifact Uploaders→F ...
浅谈lvs和nginx的一些优点和缺点
借鉴一些网上资料整理了简单的比较: LVS的负载能力强,因为其工作方式逻辑非常简单,仅进行请求分发,而且工作在网络的第4层,没有流量,所以其效率不需要有过多的忧虑. LVS基本能支持所有应用,因为工作 ...
JSP制作简单登陆
JSP制作简单登陆界面运行环境 eclipse+tomcat+MySQL 不知道的可以参考Jsp运行环境--Tomcat 项目列表这里我先把jsp文件先放在Web-INF外面访问需要建立的几个文 ...
异常解决：Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
异常描述这个异常通常有如下信息: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failu ...
[硬件]_ELVE_VS2015下opencv3.3的配置问题
0x00 引言最近想搞一下摄像头,但是我的Windows版本是64位的,opencv3.3貌似也只支持64位系统了,所以就配置一下win10+vs2015+opencv3.3的环境变量,具体下载和 ...
HAproxy部署配置
HAproxy部署配置拓扑图说明: haproxy服务器IP:172.16.253.200/16 (外网).192.168.29.140/24(内网) 博客服务器组IP:192.168.29.13 ...
git远程仓库之从远程库克隆
上次我们讲了先有本地库,后有远程库的时候,如何关联远程库. 现在,假设我们从零开发,那么最好的方式是先创建远程库,然后,从远程库克隆. 首先,登陆GitHub,创建一个新的仓库,名字叫gitskill ...
移动端https抓包那些事--进阶篇
上一次和大家介绍了手机端https抓包的初级篇,即在手机未root或者未越狱的情况下如何抓取https流量,但是当时分析应用时会发现,好多应用的https的流量还是无法抓取到,这是为什么呢? 主要原因 ...
POJ 2593 Max Sequence
Max Sequence Time Limit: 3000MS Memory Limit: 65536K Total Submissions: 17678 Accepted: 7401 Des ...

转python爬虫：BeautifulSoup 使用select方法详解

转python爬虫：BeautifulSoup 使用select方法详解的更多相关文章

随机推荐

热门专题