爬虫库之BeautifulSoup学习（五）

css选择器：

我们在写 CSS 时，标签名不加任何修饰，类名前加点，id名前加 #，在这里我们也可以利用类似的方法来筛选元素，用到的方法是 soup.select()，返回类型是 list

1）通过标签名查找

print soup.select('title')

#[<title>The Dormouse's story</title>]

print soup.select('a')

#[<a class="sister" href="http://example.com/elsie" id="link1"></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

2)通过类名查找

print soup.select('.sister')

3)通过id名查找

print soup.select('#link1')
#[<a class="sister" href="http://example.com/elsie" id="link1"></a>]

4)组合查找

组合查找即和写 class 文件时，标签名与类名、id名进行的组合原理是一样的，例如查找 p 标签中，id 等于 link1的内容，二者需要用空格分开

print soup.select('p #link1')
#[<a class="sister" href="http://example.com/elsie" id="link1"></a>]

直接子标签查找

print soup.select("head>title")

#[<title>The Dormouse's story</title>]

5、属性查找

查找时还可以加入属性元素，属性需要用中括号括起来，注意属性和标签属于同一节点，所以中间不能加空格，否则会无法匹配到。

print soup.select('a[class="sister"]')
#[<a class="sister" href="http://example.com/elsie" id="link1"></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print soup.select('a[href="http://example.com/elsie"]')
#[<a class="sister" href="http://example.com/elsie" id="link1"></a>]

print soup.select('a[href="http://example.com/elsie"]')
#[<a class="sister" href="http://example.com/elsie" id="link1"></a>]
同样，属性仍然可以与上述查找方式组合，不在同一节点的空格隔开，同一节点的不加空格

print soup.select('p a[href="http://example.com/elsie"]')
#[<a class="sister" href="http://example.com/elsie" id="link1"></a>]

以上的 select 方法返回的结果都是列表形式，可以遍历形式输出，然后用 get_text() 方法来获取它的内容。

soup = BeautifulSoup(html, 'lxml')
print type(soup.select('title'))
print soup.select('title')[0].get_text()

for title in soup.select('title'):
print title.get_text()

soup = BeautifulSoup(html, 'lxml')
print type(soup.select('title'))
print soup.select('title')[0].get_text()

for title in soup.select('title'):
print title.get_text()

好，这就是另一种与 find_all 方法有异曲同工之妙的查找方法，是不是感觉很方便？

【实战练习】：

爬取kugou top 500排名、歌手、歌曲、时间

#-*-coding:utf-8-*-

import requests

from bs4 import BeautifulSoup

import time

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) \

Chrome/66.0.3359.181 Safari/537.36 XiaoBai/10.0.1708.542 (XBCEF)'}

def get_info(url):

　　wb_data = requests.get(url,headers=headers)

　　soup = BeautifulSoup(wb_data.text,'lxml')

　　ranks = soup.select('span.pc_temp_num') #获取排名情况

　　titles = soup.select('div.pc_temp_songlist > ul > li > a') #获取歌曲标题

　　times = soup.select('span.pc_temp_tips_r > span') #获取歌曲时间

　　for rank,title,time in zip(ranks,titles,times):

　　　　data = {

　　　　'rank':rank.get_text().strip(),

　　　　'singer':title.get_text().split('-')[0],

　　　　'song':title.get_text().split('-')[-1],

　　　　'time':time.get_text().strip()

　　　　}

　　　　print (data)

if __name__ == '__main__':

　　urls = ['http://www.kugou.com/yy/rank/home/{}-8888.html'.format(number) for number in range(1,24)] #构造多页url

　　for url in urls:

　　　　get_info(url) #循环调用get_info函数

　　time.sleep(1)

爬虫库之BeautifulSoup学习（五）的更多相关文章

爬虫库之BeautifulSoup学习（一）
Beautiful Soup的简介简单来说,Beautiful Soup是python的一个库,最主要的功能是从网页抓取数据. 官方解释如下: Beautiful Soup提供一些简单的.pytho ...
爬虫库之BeautifulSoup学习（四）
探索文档树: find_all(name,attrs,recursive,text,**kwargs) 方法搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件 1.name参数,可以查找所有 ...
爬虫库之BeautifulSoup学习（二）
BeautifulSoup官方介绍文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html 四大对象种类: Beau ...
爬虫库之BeautifulSoup学习（三）
遍历文档树: 1.查找子节点 .contents tag的.content属性可以将tag的子节点以列表的方式输出. print soup.body.contents print type(soup. ...
【Python】在Pycharm中安装爬虫库requests , BeautifulSoup , lxml 的解决方法
BeautifulSoup在学习Python过程中可能需要用到一些爬虫库例如:requests BeautifulSoup和lxml库前面的两个库,用Pychram都可以通过 File--> ...
使用Python爬虫库BeautifulSoup遍历文档树并对标签进行操作详解（新手必学）
为大家介绍下Python爬虫库BeautifulSoup遍历文档树并对标签进行操作的详细方法与函数下面就是使用Python爬虫库BeautifulSoup对文档树进行遍历并对标签进行操作的实例,都是最 ...
python爬虫解析库之Beautifulsoup模块
一介绍 Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会 ...
java_web学习(五) JSTL标准标签库
1.什么是JSTL JSP标准标签库(JSTL)是一个JSP标签集合,它封装了JSP应用的通用核心功能. JSTL支持通用的.结构化的任务,比如迭代,条件判断,XML文档操作,国际化标签,SQL标签. ...
【爬虫入门手记03】爬虫解析利器beautifulSoup模块的基本应用
[爬虫入门手记03]爬虫解析利器beautifulSoup模块的基本应用 1.引言网络爬虫最终的目的就是过滤选取网络信息,因此最重要的就是解析器了,其性能的优劣直接决定这网络爬虫的速度和效率.Bea ...

随机推荐

如何重建一个损坏的调用堆栈（callstack）
原文作者:Aaron Ballman原文时间:2011年07月04日原文地址:http://blog.aaronballman.com/2011/07/reconstructing-a-corrupt ...
MOS简介
功率半导体器件机能 MOS管(击穿原因),它采用“超级结”(Super-Junction)结构,故又称超结功率MOSFET.全数字控制是发展趋势,已经在很多功率变换设备中得到应用.既管理了对电网的谐波 ...
关于 ++x 和 x++ 比较难的一个例子
public class testMain { static{ int x = 5;//如果后面有static int x, 前面的定义就没有用x会被重新定义为0 } static int y; st ...
laravel 将数组转化成字符串再把字符串转化成数组
这是在给阮少翔改代码的时候用的方法, 开始的数据用explored转化成数组不是想要的结果, 我就自己写了一个方法把有用的信息提取出来拼接成一个字符串, 再用explored将字符串转化成数组. ...
caffe学习--使用caffe中的imagenet对自己的图片进行分类训练(超级详细版) -----linux
http://blog.csdn.net/u011244794/article/details/51565786 标签: caffeimagenet 2016-06-02 12:57 9385人阅读 ...
加载jsp页面报#{} is not allowed in template text
问题是在引进jQueryUI时遇到解决方法: 在page指令添加: deferredSyntaxAllowedAsLiteral="true" 例如:&l ...
Python学习笔记14：标准库之信号量（signal包）
signal包负责在Python程序内部处理信号.典型的操作包含预设信号处理函数,暂停并等待信号,以及定时发出SIGALRM等. 要注意,signal包主要是针对UNIX平台(比方Linux, MAC ...
封装EF code first用存储过程的分页方法
一年半没有做过MVC的项目了,还是很怀念(因为现在项目还是原来的ASPX),个人还是喜欢mvc,最近又开始重拾MVC,感觉既熟悉又陌生. 记录一下封装好的分页代码首先先说下我使用EF codefi ...
Android自己定义控件--下拉刷新的实现
我们在使用ListView的时候.非常多情况下须要用到下拉刷新的功能.为了了解下拉刷新的底层实现原理,我採用自己定义ListView控件的方式来实现效果. 实现的基本原理是:自己定义ListView, ...
zoj 2711 - Regular Words
题目:求由A.B.C构成的有序传中长度为n.且每一个B前面的A的个数不少于当前B,每一个C前面的B的个数不少于当前C的个数. 分析:dp,求排列组合数. 考虑二维的状况: 假设 A>=B 则在 ...

爬虫库之BeautifulSoup学习（五）

爬虫库之BeautifulSoup学习（五）的更多相关文章

随机推荐

热门专题