Python爬虫基础之BeautifulSoup

一、BeautifulSoup的基本使用

 from bs4 import BeautifulSoup

 from bs4 import SoupStrainer

 import re

 html_doc = """

 <html>

  <head>

   <title>

    The Dormouse's story

   </title>

  </head>

  <body>

   <p class="title">

    <b>

     The Dormouse's story

    </b>

   </p>

   <p class="story">

    Once upon a time there were three little sisters; and their names were

    <a class="sister" href="http://example.com/elsie" id="link1">

     Elsie

    </a>

    ,

    <a class="sister" href="http://example.com/lacie" id="link2">

     Lacie

    </a>

    and

    <a class="sister" href="http://example.com/tillie" id="link3">

     Tillie

    </a>

    ; and they lived at the bottom of a well.

   </p>

   <p class="story">

    ...

   </p>

  </body>

 </html>

 """

 soup = BeautifulSoup(html_doc, "html.parser")

 # print(soup.prettify()) # 打印所有标准化html code

 print('-----------------------------')

 print(soup.title)

 print('----------------------------')

 print(soup.title.name)

 print('----------------------------')

 print(soup.title.string)

 print('----------------------------')

 print(soup.title.parent.name)

 print('----------------------------')

 print(soup.p)

 # item_b = soup.p.

 print('----------------------------')

 print(soup.p['class'])

 print('----------------------------')

 print(soup.find_all('a'))

 print('----------------------------')

 print(soup.find(id='link3'))

 print(soup.find(id='link3')['class'])

 print(soup.find(id='link3')['href'])  # 打印指定属性文本

 print(soup.find(id='link3')['id'])

 print(soup.find(id='link3').get_text())  # 打印文本

 # find_all(name, attrs, recursive, text, limit, **kwargs)

 # name 参数

 soup.find_all('title')

 # keyword参数

 soup.find_all(id='link2')

 soup.find_all(href=re.compile("elsie"))

 soup.find_all(id=True) # 在文档树中查找所有包含 id 属性的tag,无论 id 的值是什么

 soup.find_all(href=re.compile("elsie"), id='link1') # 多个指定名字的参数可以同时过滤tag的多个属性

 soup.find_all(attrs={"data-foo": "value"}) # 可以通过 find_all() 方法的 attrs 参数定义一个字典参数来搜索包含特殊属性的tag:

 soup.find_all('a', limit=2)  # 当搜索结果到达limit个数，就停止搜索

 # 按CSS搜索

 soup.find_all("a", class_="sister")

 soup.find_all(class_=re.compile("itl"))  # class_ 参数同样接受不同类型的 过滤器 ,字符串,正则表达式

 # CSS选择器

 title_list = soup.select('head > title') # 查找所有满足条件的元素

 title_list_one = soup.select_one('head > title')  # 查找单个满足条件的元素

 print(title_list)  # 打印 [<title> The Dormouse's story</title>]

 print(title_list[0].string)  # 打印The Dormouse's story<

 # 文档中找到所有<a>标签的链接：

 for link in soup.find_all('a'):

     print(link.get('href'))

 # http://example.com/elsie

 # http://example.com/lacie

 # http://example.com/tillie

 # find查找元素第一个类样式未story的p标签

 p_story = soup.find('p',class_='story')

 # print(p_story.a)

 # 使用正则表达式

 p_re_all = soup.find_all(re.compile('p'))

 print(p_re_all)

 # find_all查找所有class_=True匹配任何类样式的p标签

 p_all = soup.find_all('p', class_=True)

 # print(p_all)  # 打印数组

 # [<p class="title">

 # <b>

 #     The Dormouse's story

 #    </b>

 # </p>, <p class="story">

 #    Once upon a time there were three little sisters; and their names were

 #    <a class="sister" href="http://example.com/elsie" id="link1">

 #     Elsie

 #    </a>

 #    ,

 #    <a class="sister" href="http://example.com/lacie" id="link2">

 #     Lacie

 #    </a>

 #    and

 #    <a class="sister" href="http://example.com/tillie" id="link3">

 #     Tillie

 #    </a>

 #    ; and they lived at the bottom of a well.

 #   </p>, <p class="story">

 #    ...

 #   </p>]

二、BeautifulSoup的实际应用

1.解析网易云音乐html源码

这是网易云音乐华语歌曲的分类链接http://music.163.com/#/discover/playlist/?order=hot&cat=华语&limit=35&offset=0，打开Chrome F12的Elements查看到页面源码，我们发现每页的歌单都在一个iframe浮窗上面，每首单曲的信息构成一个li标签，包含歌单图片、

歌单链接、歌单名称等。

首先提取一段html源码出来

  <ul class="m-cvrlst f-cb" id="m-pl-container">

    <li>

     <div class="u-cover u-cover-1">

      <img class="j-flag" src="http://p1.music.126.net/FGe-rVrHlBTbnOvhMR99PQ==/109951162989189558.jpg?param=140y140" />

      <a title="【说唱】留住你一面，画在我心间" href="/playlist?id=832790627" class="msk"></a>

      <div class="bottom">

       <a class="icon-play f-fr" title="播放" href="javascript:;" data-res-type="13" data-res-id="832790627" data-res-action="play"></a>

       <span class="icon-headset"></span>

       <span class="nb">1615</span>

      </div>

     </div> <p class="dec"> <a title="【说唱】留住你一面，画在我心间" href="/playlist?id=832790627" class="tit f-thide s-fc0">【说唱】留住你一面，画在我心间</a> </p> <p><span class="s-fc4">by</span> <a title="JediMindTricks" href="/user/home?id=17647877" class="nm nm-icn f-thide s-fc3">JediMindTricks</a> <sup class="u-icn u-icn-84 "></sup> </p> </li>

    <li>

     <div class="u-cover u-cover-1">

      <img class="j-flag" src="http://p1.music.126.net/If644P7ZrfPm_qcvtYyfzg==/18936888765458653.jpg?param=140y140" />

      <a title="鞋子好看｜国产自赏摇滚噪音流行" href="/playlist?id=721462105" class="msk"></a>

      <div class="bottom">

       <a class="icon-play f-fr" title="播放" href="javascript:;" data-res-type="13" data-res-id="721462105" data-res-action="play"></a>

       <span class="icon-headset"></span>

       <span class="nb">77652</span>

      </div>

     </div> <p class="dec"> <a title="鞋子好看｜国产自赏摇滚噪音流行" href="/playlist?id=721462105" class="tit f-thide s-fc0">鞋子好看｜国产自赏摇滚噪音流行</a> </p> <p><span class="s-fc4">by</span> <a title="原创君" href="/user/home?id=201586" class="nm nm-icn f-thide s-fc3">原创君</a> <sup class="u-icn u-icn-1 "></sup> </p> </li>

   </ul>

开始解析html源码

首先实例化一个BeautifulSoup对象，指定解析器为html.parser,通过BeautifulSoup对象的CSS选择器select_one()，这里用ID选择器搜索到无序列表ul，再通过find_all获取ul下的所有li标签，接着遍历li，获取到歌单的图片链接，歌单列表链接和歌单名称。

 from bs4 import BeautifulSoup

 html = '''上面提取的html源码'''

 soup = BeautifulSoup(html, 'html.parser')

 ul = soup.select_one('#m-pl-container')

 for li in ul.find_all('li'):

     img_url = li.img['src']

     a_msk = li.find('a', class_='msk')

     musicList_url = 'http:/%s' % a_msk['href']

     musicList_name = a_msk['title']

     print(img_url,musicList_url,musicList_name)  # 打印 http://p1.music.126.net/FGe-rVrHlBTbnOvhMR99PQ==/109951162989189558.jpg?param=140y140 http://playlist?id=832790627 【说唱】留住你一面，画在我心间

三、Beautiful Soup 4.4.0

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.详细使用请转移官网 http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

Python爬虫基础之BeautifulSoup的更多相关文章

Python爬虫基础
前言 Python非常适合用来开发网页爬虫,理由如下: 1.抓取网页本身的接口相比与其他静态编程语言,如java,c#,c++,python抓取网页文档的接口更简洁:相比其他动态脚本语言,如perl ...
python爬虫-基础入门-python爬虫突破封锁
python爬虫-基础入门-python爬虫突破封锁 >> 相关概念 >> request概念:是从客户端向服务器发出请求,包括用户提交的信息及客户端的一些信息.客户端可通过H ...
python爬虫-基础入门-爬取整个网站《3》
python爬虫-基础入门-爬取整个网站<3> 描述: 前两章粗略的讲述了python2.python3爬取整个网站,这章节简单的记录一下python2.python3的区别 python ...
python爬虫-基础入门-爬取整个网站《2》
python爬虫-基础入门-爬取整个网站<2> 描述: 开场白已在<python爬虫-基础入门-爬取整个网站<1>>中描述过了,这里不在描述,只附上 python3 ...
python爬虫-基础入门-爬取整个网站《1》
python爬虫-基础入门-爬取整个网站<1> 描述: 使用环境:python2.7.15 ,开发工具:pycharm,现爬取一个网站页面(http://www.baidu.com)所有数 ...
Python爬虫基础之认识爬虫
一.前言爬虫Spider什么的,老早就听别人说过,感觉挺高大上的东西,爬网页,爬链接~~~dos黑屏的数据刷刷刷不断地往上冒,看着就爽,漂亮的校花照片,音乐网站的歌曲,笑话.段子应有尽有,全部都过来 ...
Python爬虫：用BeautifulSoup进行NBA数据爬取
爬虫主要就是要过滤掉网页中没用的信息.抓取网页中实用的信息一般的爬虫架构为: 在python爬虫之前先要对网页的结构知识有一定的了解.如网页的标签,网页的语言等知识,推荐去W3School: W3s ...
python 爬虫基础知识一
网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动的抓取万维网信息的程序或者脚本. 网络爬虫必备知识点 1. Python基础知识2. P ...
Python爬虫基础（一）——HTTP
前言因特网联系的是世界各地的计算机(通过电缆),万维网联系的是网上的各种各样资源(通过超文本链接),如静态的HTML文件,动态的软件程序······.由于万维网的存在,处于因特网中的每台计算机可以很 ...

随机推荐

mysql联合主键，也就是两个数据字段一起做主键的情况
一个数据表,需要两个字段联合起来一块做主键的时候.举例如下: 直接用sql语句的话如下 ALTER TABLE `表名` ADD PRIMARY KEY ( `表中字段名1` , `表中字段名2` ) ...
[转帖]Oracle 补丁体系(PSR/PSU/CPU) 及 opatch 工具介绍
Oracle 补丁体系(PSR/PSU/CPU) 及 opatch 工具介绍原文:http://blog.csdn.net/tianlesoftware/article/details/58095 ...
vscode常用快捷键
一.vs code 的常用快捷键列表 1.注释: a) 单行注释:[ctrl+k,ctrl+c] 或 ctrl+/ b) 取消单行注释:[ctrl+k,ctrl+u] (按下ctrl不放,再按k + ...
17年iPhone炫酷铃声，mp3、m4r格式下载
下载链接: https://pan.baidu.com/s/11aj9dBm9upNWpE5jWBgYog
【UOJ453】【集训队作业2018】围绕着我们的圆环线性基 DP
题目大意有一个 \(n\times k\) 的 01矩阵 \(C\),求有多少个 \(n\times m\) 的矩阵 \(A\) 和 \(m\times k\) 的矩阵 \(B\),满足 \(A\t ...
Codeforces #402
目录 Codeforces #402 Codeforces #402 Codeforces 779A Pupils Redistribution 链接:http://codeforces.com/co ...
Linux中查看TCP连接数
一.查看哪些IP连接本机 netstat -an 二.查看TCP连接数 1)统计80端口连接数netstat -nat|grep -i "80"|wc -l 2)统计httpd协议 ...
kms访问数据库的方式（该篇只是作为个人笔记，不具有任何公共参考意图）
项目类型:winform 语言:C# 服务程序:webservice(webservice我本人也不了解,在下一章中会总结一下对它的概念的简单理解) 情景描述:简单创建一个窗体,实现学生信息(姓名.性 ...
关于 redis 的数据类型和内存模型
该文章是在读了公众号 : java 后端技术之后做的一个小记录原文网址 : https://mp.weixin.qq.com/s/mI3nDtQdlVlLv2uUTxJegA 作者文章写的 ...
pytest 15 fixture之autouse=True
前言平常写自动化用例会写一些前置的fixture操作,用例需要用到就直接传该函数的参数名称就行了.当用例很多的时候,每次都传这个参数,会比较麻烦.fixture里面有个参数autouse,默认是Fa ...