安装：pip install BeautifulSoup4

下表列出了主要的解析器,以及它们的优缺点:看个人习惯选取自己喜欢的解析方式

 # 获取html代码

 import requests

 r = requests.get('http://www.python123.io/ws/demo.html')

 demo = r.text

 from bs4 import BeautifulSoup

 soup = BeautifulSoup(demo,'html.parser')

 print(soup.prettify()) #按照标准的缩进格式的结构输出，代码如下

 <html>

  <head>

   <title>

    This is a python demo page

   </title>

  </head>

  <body>

   <p class="title">

    <b>

     The demo python introduces several python courses.

    </b>

   </p>

   <p class="course">

    Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

    <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">

     Basic Python

    </a>

    and

    <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">

     Advanced Python

    </a>

    .

   </p>

  </body>

 </html>

简单浏览数据化方法的用法

#demo的源代码

html_d="""

<html><head><title>This is a python demo page</title></head>

<body>

<p class="title"><b>The demo python introduces several python courses.</b></p>

<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>

</body></html>

"""

from bs4 import BeautifulSoup

soup=BeautifulSoup(html_d,'html.parser')

# 获取title标签

print(soup.title)

#获取文本内容

print(soup.text)

#获取标签名称

print(soup.title.name)

#获取标签属性

print(soup.title.attrs)

#获取head标签的子节点

print(soup.p.contents)

print(soup.p.children)

#获取所有的a标签

print(soup.find_all('a'))

常用解析方法

#demo的源代码

html_d="""

<html><head><title>This is a python demo page</title></head>

<body>

<p class="title"><b>The demo python introduces several python courses.</b></p>

<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>

</body></html>

"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_d,"lxml")

#p下面所有的子节点

print(soup.p.contents)

soup.contents[0].name

#children本身没有子节点，得到一个迭代器,包含p下所有子节点

print(soup.p.children)

for child in enumerate(soup.p.children):

    print(child)

#子孙节点p下面所有的标签都会出来

print(soup.p.descendants)

for i in enumerate(soup.p.children):

  print(i)

# string 下面有且只有一个子节皆可以取出，如有多个字节则返回为none

print(soup.title.string)

# strings 如果有多个字符串

for string in soup.strings:

    print(repr(string))

#去掉空白

for line in soup.stripped_strings:

    print(line)

#获取a标签的父节点

print(soup.a.parent)

#找到a标签的父辈节点

print(soup.a.parents)

#兄弟节点

print(soup.a.next_sibling) #同一个兄弟

print(soup.a.next_sibling) #上一个兄弟

print(soup.a.next_sibling) #下一个兄弟

find_all的用法( name, attrs, recursive, text, **kwargs)

import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_d,"lxml")
# name
for tag in soup.find_all(re.compile('b')):
print(tag.name)
#attrs
print(soup.find_all('p','course'))
#keyword
print(soup.find_all(id='link1'))
#recursive
# print(soup.find_all('a',recursive=False))
# string
# print(soup.find_all(string=re.compile('python')))

小案例

import requests

from bs4 import BeautifulSoup

import bs4

#获取URL里面信息

def getHtmlText(url):

    try:

        r= requests.get(url,timeout=30 )

        r.encoding=r.apparent_encoding

        return r.text

    except:

      return ""

#提起网页数据

def fillunivList(ulist,html):

    soup = BeautifulSoup(html,"html.parser")

    for tr in soup.find('tbody').children:

        if isinstance(tr,bs4.element.Tag):

            tds = tr('td')

            ulist.append([tds[0].string,tds[1].string,tds[2].string,tds[3].string])

    pass

#打印数据结果

def printUnivList(ulist,num):

    # tplt = "{0:^10}\t{1:{3}^10}\t{2:^10}\t{:^10}"

    # print(tplt.format('排名', '学校名称', '省份','总分',chr(12288)))

    # for i in range(num):

    #     u = ulist[i]

    #     print(tplt.format(u[0], u[1], u[2],u[3],chr(12288)))

    print("{:^10}\t{:^6}\t{:^10}\t{:^10}".format('排名', '学校名称', '地区', '总分'))

    for i in range(num):

         u = ulist[i]

         print("{:^10}\t{:^6}\t{:^10}\t{:^10}".format(u[0], u[1], u[2], u[3]))

    return

def main():

    unifo = []

    url = 'http://www.zuihaodaxue.cn/zuihaodaxuepaiming2019.html'

    html = getHtmlText(url)

    fillunivList(unifo,html)

    printUnivList(unifo,20) #打印前20所

main()

爬虫之BeautifulSoup类的更多相关文章

爬虫模块BeautifulSoup
中文文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html# 1.1 安装BeautifulSoup模块 ...
BeautifulSoup类
from bs4 import BeautifulSoup soup1 = BeautifulSoup("<html>data</html>"," ...
使用Python爬虫库BeautifulSoup遍历文档树并对标签进行操作详解（新手必学）
为大家介绍下Python爬虫库BeautifulSoup遍历文档树并对标签进行操作的详细方法与函数下面就是使用Python爬虫库BeautifulSoup对文档树进行遍历并对标签进行操作的实例,都是最 ...
【Python爬虫】BeautifulSoup网页解析库
BeautifulSoup 网页解析库阅读目录初识Beautiful Soup Beautiful Soup库的4种解析器 Beautiful Soup类的基本元素基本使用标签选择器节点操作 ...
爬虫笔记之刷小怪练级：yymp3爬虫（音乐类爬虫）
一.目标爬取http://www.yymp3.com网站歌曲相关信息,包括歌曲名字.作者相关信息.歌曲的音频数据.歌曲的歌词数据. 二.分析 2.1 歌曲信息.歌曲音频数据下载地址的获取随便打开一 ...
爬虫之 BeautifulSoup与Xpath
知识预览 BeautifulSoup xpath BeautifulSoup 一简介简单来说,Beautiful Soup是python的一个库,最主要的功能是从网页抓取数据.官方解释如下: '' ...
Python 爬虫—— requests BeautifulSoup
本文记录下用来爬虫主要使用的两个库.第一个是requests,用这个库能很方便的下载网页,不用标准库里面各种urllib:第二个BeautifulSoup用来解析网页,不然自己用正则的话很烦. req ...
python爬虫之BeautifulSoup
爬虫有时候写正则表达式会有假死现象就是正则表达式一直在进行死循环查找例如:https://social.msdn.microsoft.com/forums/azure/en-us/3f4390ac ...
Python开发爬虫之BeautifulSoup解析网页篇：爬取安居客网站上北京二手房数据
目标:爬取安居客网站上前10页北京二手房的数据,包括二手房源的名称.价格.几室几厅.大小.建造年份.联系人.地址.标签等. 网址为:https://beijing.anjuke.com/sale/ B ...

随机推荐

PyCharm4.5 中文破解版破解步骤
1.在下载之家下载PyCharm4.5中文版软件包,然后右击软件安装包选择解压到“pycharm4.5.3”. 2.在解压文件夹中找到pycharm-professional-4.5.3,右击打开. ...
python后端面试第一部分：python基础--长期维护
1. 为什么学习Python? 2. 通过什么途径学习的Python? 3. Python和Java.PHP.C.C#.C++等其他语言的对比? 4. 简述解释型和编译型编程语言? https:/ ...
elasticsearch5.4安装
1.从官网下载ES 安装包: elasticsearch-.tar.gz 2.解压到要安装的目录注意:一定要切换用户,不能用root用户解压,不能用root用户启动 tar -zxvf elasti ...
left join on和where 限制查询的区别在于
left join on: 会显示前表的所有数据,不满足显示为null或者为0 . 而where显示的为满足条件的记录,不满足但是存在的数据不显示. 做统计数据的时候,用join on比较合理.
elasticsearch用法
基本原理搜索引擎的索引倒排序由value查找key 数据库的索引由key查找value 用于解决分库分表后的排序分页 like查找性能问题日志库的全文搜索 spring集成时使用的不是re ...
《你不知道的Javascript》学习笔记
简介众所周知,JavaScript 既是一门充满吸引力.简单易用的语言,又是一门具有许多复杂微妙技术的语言,即使是经验丰富的JavaScript 开发者,如果没有认真学习的话也无法真正理解它们. 如 ...
Nginx for windows 访问路径包含中文
转载自http://blog.csdn.net/five824/article/details/48261213 Nginx for windows 访问路径包含中文原创 2015年09月07日 0 ...
IPFS问题总结
1.安装包下载 ipfs安装版本下载:https://github.com/ipfs/go-ipfs/releases,这是IPFS的go语言实现版,目前实现的还有js版本. 2.安装与启动 linu ...
吴裕雄--天生自然 R语言开发学习：基本图形（续一）
#---------------------------------------------------------------# # R in Action (2nd ed): Chapter 6 ...
HashMap底层分析
以下基于 JDK1.7 分析. 如图所示,HashMap 底层是基于数组和链表实现的.其中有两个重要的参数: 容量负载因子容量的默认大小是 16,负载因子是 0.75,当 HashMap 的 si ...

爬虫之BeautifulSoup类

下表列出了主要的解析器,以及它们的优缺点:看个人习惯选取自己喜欢的解析方式

简单浏览数据化方法的用法

常用解析方法

小案例

爬虫之BeautifulSoup类的更多相关文章

随机推荐

热门专题