Python BeautifulSoup 使用

BS4库简单使用:

1.最好配合LXML库，下载：pip install lxml

2.最好配合Requests库，下载：pip install requests

3.下载bs4：pip install bs4

4.直接输入pip没用？解决：环境变量->系统变量->Path->新建：C:\Python27\Scripts

案例：获取网站标题

# -*- coding:utf-8 -*-

from bs4 import BeautifulSoup

import requests

url = "https://www.baidu.com"

response = requests.get(url)

soup = BeautifulSoup(response.content, 'lxml')

print soup.title.text

标签识别

示例1：

# -*- coding:utf-8 -*-

from bs4 import BeautifulSoup

html = '''

<html>

<head><title>The Dormouse's story</title></head>

<body>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

</body>

</html>

'''

soup = BeautifulSoup(html, 'lxml')

# BeautifulSoup中有内置的方法来实现格式化输出

print(soup.prettify())

# title标签内容

print(soup.title.string)

# title标签的父节点名

print(soup.title.parent.name)

# 标签名为p的内容

print(soup.p)

# 标签名为p的class内容

print(soup.p["class"])

# 标签名为a的内容

print(soup.a)

# 查找所有的字符a

print(soup.find_all('a'))

# 查找id='link3'的内容

print(soup.find(id='link3'))

示例2：

# -*- coding:utf-8 -*-

from bs4 import BeautifulSoup

html = '''

<html>

<head><title>The Dormouse's story</title></head>

<body>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

</body>

</html>

'''

soup = BeautifulSoup(html, 'lxml')

# 将p标签下的所有子标签存入到了一个列表中

print (soup.p.contents)

find_all示例:

# -*- coding:utf-8 -*-

from bs4 import BeautifulSoup

html = '''

<h4>Hello</h4>

</div>

</ul>

</ul>

</div>

'''

soup = BeautifulSoup(html, 'lxml')

# 查找所有的ul标签内容

print(soup.find_all('ul'))

# 针对结果再次find_all,从而获取所有的li标签信息

for ul in soup.find_all('ul'):

print(ul.find_all('li'))

# 查找id为list-1的内容

print(soup.find_all(attrs={'id': 'list-1'}))

# 查找class为element的内容

print(soup.find_all(attrs={'class': 'element'}))

# 查找所有的text='Foo'的文本

print(soup.find_all(text='Foo'))

CSS选择器示例：

# -*- coding:utf-8 -*-

from bs4 import BeautifulSoup

html = '''

<h4>Hello</h4>

</div>

</ul>

</ul>

</div>

'''

soup = BeautifulSoup(html, 'lxml')

# 获取class名为panel下panel-heading的内容

print(soup.select('.panel .panel-heading'))

# 获取class名为ul和li的内容

print(soup.select('ul li'))

# 获取class名为element，id为list-2的内容

print(soup.select('#list-2 .element'))

# 使用get_text()获取文本内容

for li in soup.select('li'):

print(li.get_text())

# 获取属性的时候可以通过[属性名]或者attrs[属性名]

for ul in soup.select('ul'):

print(ul['id'])

# print(ul.attrs['id'])

Python BeautifulSoup 使用的更多相关文章

【转】Python BeautifulSoup 中文乱码解决方法
这篇文章主要介绍了Python BeautifulSoup中文乱码问题的2种解决方法,需要的朋友可以参考下解决方法一: 使用python的BeautifulSoup来抓取网页然后输出网页标题,但是输 ...
Python -- BeautifulSoup的学习使用
BeautifulSoup4.3 的使用下载和安装 # 下载 http://www.crummy.com/software/BeautifulSoup/bs4/download/ # 解压后使用r ...
Python beautifulsoup模块
BeautifulSoup中文文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/ BeautifulSoup下载:http://w ...
Python - BeautifulSoup 安装
BeautifulSoup 3.x 1. 下载 BeautifulSoup. [huey@huey-K42JE python]$ wget http://www.crummy.com/software ...
Python BeautifulSoup中文乱码问题的2种解决方法
解决方法一: 使用python的BeautifulSoup来抓取网页然后输出网页标题,但是输出的总是乱码,找了好久找到解决办法,下面分享给大家首先是代码 from bs4 import Beautif ...
python BeautifulSoup库的基本使用
Beautiful Soup 是用Python写的一个HTML/XML的解析器,它可以很好的处理不规范标记并生成剖析树(parse tree). 它提供简单又常用的导航(navigating),搜索以 ...
python BeautifulSoup的简单使用
官网:https://www.crummy.com/software/BeautifulSoup/bs4/doc/ 参考:https://www.cnblogs.com/yupeng/p/336203 ...
python BeautifulSoup 介绍--安装
Python中,专门用于HTML/XML解析的库: 特点是: 即使是有bug,有问题的html代码,也可以解析. BeautifulSoup主要有两个版本 BeautifulSoup 3 之前的,比较 ...
python BeautifulSoup库用法总结
1. Beautiful Soup 简介简单来说,Beautiful Soup是python的一个库,最主要的功能是从网页抓取数据.官方解释如下: Beautiful Soup提供一些简单的.pyt ...
python beautifulsoup/xpath/re详解
自己在看python处理数据的方法,发现一篇介绍比较详细的文章转自:http://blog.csdn.net/lingojames/article/details/72835972 20170531 ...

随机推荐

对DOM，SAX，JDOM，DOM4J四种方法解析XML文件的分析
1.DOM 与平台无关的官方解析方式 DOM是一次性把xml文件加载到内存中,形成一个节点树对内存有要求 2.SAX java提供的基于事件驱动的解析方式每次遇到一个标签,会触发相应的事件方法 3 ...
解决Spark On Yarn yarn-cluster模式下的No Suitable Driver问题
Spark版本:2.2.0_2.11 我们在项目中通过Spark SQL JDBC连接MySQL,在启动Driver/Executor执行的时候都碰到了这个问题.网上解决方案我们全部都试过了,奉上我们 ...
【漫画】程序员永远修不好的Bug——情人节
盼望着,盼望着,周五来了情人节的脚步近了一切都像热恋时的样子飘飘然放开了买购物车满起来了…… 不要指望着能在女生面前蒙混过关是时候展现真正的技术了这道坎过去了是情人节过不去就是清明节了 ...
内核中的 ACCESS_ONCE()
参考资料: https://blog.csdn.net/ganggexiongqi/article/details/24603363 这个真特么玄学了...
Numpy 创建数组
ndarray 数组除了可以使用底层 ndarray 构造器来创建外, 也可以通过以下几种方式来创建. numpy.empty numpy.empty 方法用来创建一个指定形状(shape),数据类型 ...
windows任务栏IDEA图标变白色快速解决方法
方案1:同时按Windows键+R键打开运行对话框,输入ie4uinit.exe -show然后回车即可修复. 方案2:打开计算机(Win7),此电脑(Win10)或任意文件夹,然后在地址栏输入cmd ...
UI行业发展预测 & 系列规划的调整
又双叒叕拖更了,上一篇还是1月22号更新的,这都3月9号了…… 前面几期把职业规划.能力分析.几个分析用的设计理论都写完了,当然实际工作中用到的方法论不止上面这些,后续会接着学习: 如果你的目标是一线 ...
[Java]异常在项目中的使用
自己经历过的两个项目都有自定义异常,网上找了项目中自定义异常的例子: https://blog.csdn.net/aiyaya_/article/details/78989226. 这个例子基本上来说 ...
JAVA获取微信小程序openid和获取公众号openid，以及通过openid获取用户信息
一,首先说明下这个微信的openid 为了识别用户,每个用户针对每个公众号会产生一个安全的OpenID,如果需要在多公众号.移动应用之间做用户共通,则需前往微信开放平台,将这些公众号和应用绑定到一个开 ...
ERC20数字货币ProxyOverflow存在漏洞
ERC20的ProxyOverflow漏洞造成影响广泛,本文将对其攻击方法进行分析,以便于智能合约发布者提高自身代码安全性以及其他研究人员进行测试.本文选择传播广泛.影响恶劣的SMT漏洞(CVE-20 ...

Python BeautifulSoup 使用

Python BeautifulSoup 使用的更多相关文章

随机推荐

热门专题