Python BeautifulSoup 使用

BS4库简单使用:

1.最好配合LXML库，下载：pip install lxml

2.最好配合Requests库，下载：pip install requests

3.下载bs4：pip install bs4

4.直接输入pip没用？解决：环境变量->系统变量->Path->新建：C:\Python27\Scripts

案例：获取网站标题

# -*- coding:utf-8 -*-

from bs4 import BeautifulSoup

import requests

url = "https://www.baidu.com"

response = requests.get(url)

soup = BeautifulSoup(response.content, 'lxml')

print soup.title.text

标签识别

示例1：

# -*- coding:utf-8 -*-

from bs4 import BeautifulSoup

html = '''

<html>

<head><title>The Dormouse's story</title></head>

<body>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

</body>

</html>

'''

soup = BeautifulSoup(html, 'lxml')

# BeautifulSoup中有内置的方法来实现格式化输出

print(soup.prettify())

# title标签内容

print(soup.title.string)

# title标签的父节点名

print(soup.title.parent.name)

# 标签名为p的内容

print(soup.p)

# 标签名为p的class内容

print(soup.p["class"])

# 标签名为a的内容

print(soup.a)

# 查找所有的字符a

print(soup.find_all('a'))

# 查找id='link3'的内容

print(soup.find(id='link3'))

示例2：

# -*- coding:utf-8 -*-

from bs4 import BeautifulSoup

html = '''

<html>

<head><title>The Dormouse's story</title></head>

<body>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

</body>

</html>

'''

soup = BeautifulSoup(html, 'lxml')

# 将p标签下的所有子标签存入到了一个列表中

print (soup.p.contents)

find_all示例:

# -*- coding:utf-8 -*-

from bs4 import BeautifulSoup

html = '''

<h4>Hello</h4>

</div>

</ul>

</ul>

</div>

'''

soup = BeautifulSoup(html, 'lxml')

# 查找所有的ul标签内容

print(soup.find_all('ul'))

# 针对结果再次find_all,从而获取所有的li标签信息

for ul in soup.find_all('ul'):

print(ul.find_all('li'))

# 查找id为list-1的内容

print(soup.find_all(attrs={'id': 'list-1'}))

# 查找class为element的内容

print(soup.find_all(attrs={'class': 'element'}))

# 查找所有的text='Foo'的文本

print(soup.find_all(text='Foo'))

CSS选择器示例：

# -*- coding:utf-8 -*-

from bs4 import BeautifulSoup

html = '''

<h4>Hello</h4>

</div>

</ul>

</ul>

</div>

'''

soup = BeautifulSoup(html, 'lxml')

# 获取class名为panel下panel-heading的内容

print(soup.select('.panel .panel-heading'))

# 获取class名为ul和li的内容

print(soup.select('ul li'))

# 获取class名为element，id为list-2的内容

print(soup.select('#list-2 .element'))

# 使用get_text()获取文本内容

for li in soup.select('li'):

print(li.get_text())

# 获取属性的时候可以通过[属性名]或者attrs[属性名]

for ul in soup.select('ul'):

print(ul['id'])

# print(ul.attrs['id'])

Python BeautifulSoup 使用的更多相关文章

【转】Python BeautifulSoup 中文乱码解决方法
这篇文章主要介绍了Python BeautifulSoup中文乱码问题的2种解决方法,需要的朋友可以参考下解决方法一: 使用python的BeautifulSoup来抓取网页然后输出网页标题,但是输 ...
Python -- BeautifulSoup的学习使用
BeautifulSoup4.3 的使用下载和安装 # 下载 http://www.crummy.com/software/BeautifulSoup/bs4/download/ # 解压后使用r ...
Python beautifulsoup模块
BeautifulSoup中文文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/ BeautifulSoup下载:http://w ...
Python - BeautifulSoup 安装
BeautifulSoup 3.x 1. 下载 BeautifulSoup. [huey@huey-K42JE python]$ wget http://www.crummy.com/software ...
Python BeautifulSoup中文乱码问题的2种解决方法
解决方法一: 使用python的BeautifulSoup来抓取网页然后输出网页标题,但是输出的总是乱码,找了好久找到解决办法,下面分享给大家首先是代码 from bs4 import Beautif ...
python BeautifulSoup库的基本使用
Beautiful Soup 是用Python写的一个HTML/XML的解析器,它可以很好的处理不规范标记并生成剖析树(parse tree). 它提供简单又常用的导航(navigating),搜索以 ...
python BeautifulSoup的简单使用
官网:https://www.crummy.com/software/BeautifulSoup/bs4/doc/ 参考:https://www.cnblogs.com/yupeng/p/336203 ...
python BeautifulSoup 介绍--安装
Python中,专门用于HTML/XML解析的库: 特点是: 即使是有bug,有问题的html代码,也可以解析. BeautifulSoup主要有两个版本 BeautifulSoup 3 之前的,比较 ...
python BeautifulSoup库用法总结
1. Beautiful Soup 简介简单来说,Beautiful Soup是python的一个库,最主要的功能是从网页抓取数据.官方解释如下: Beautiful Soup提供一些简单的.pyt ...
python beautifulsoup/xpath/re详解
自己在看python处理数据的方法,发现一篇介绍比较详细的文章转自:http://blog.csdn.net/lingojames/article/details/72835972 20170531 ...

随机推荐

5N - 考试排名
C++编程考试使用的实时提交系统,具有即时获得成绩排名的特点.它的功能是怎么实现的呢? 我们做好了题目的解答,提交之后,要么“AC”,要么错误,不管怎样错法,总是给你记上一笔,表明你曾经有过一次错误提 ...
Replace To Make Regular Bracket Sequence
Replace To Make Regular Bracket Sequence You are given string s consists of opening and closing brac ...
Python json.dumps 自定义序列化操作
def login_ajax(request): if request.method == "GET": return render(request, 'login_ajax.ht ...
form表单上传图片文件
import os def upload(request): if request.method == 'GET': img_list = models.Img.objects.all() retur ...
leetcode 刷题进展
最近没发什么博客了凑个数我的leetcode刷题进展 https://gitee.com/def/leetcode_practice 个人以为刷题在透不在多前200的吃透了足以应付非算法岗 ...
Newtonsoft.Json 时间格式化
时间序列化经常多个T:“2017-01-23T00:00:00” 解决方案: 日期格式化输出,指定IsoDateTimeConverter的DateTimeFormat即可 IsoDateTimeCo ...
metasploit渗透测试魔鬼训练营环境
metasploitable winxpensp2 owasp_broken_web_apps win2k3 metasploitable 链接:https://pan.baidu.com/s/1oZ ...
日程管理Demo4中的bug
Demo4的github地址模拟器有点慢之后贴图 https://git.oschina.net/annie_guo/study.git 在登陆注册界面中java的提示语言(Login.java) ...
【python-时间戳】时间与时间戳之间的转换
对于时间数据,如2016-05-05 20:28:54,有时需要与时间戳进行相互的运算,此时就需要对两种形式进行转换,在Python中,转换时需要用到time模块,具体的操作有如下的几种: 将时间转换 ...
linux（Redhat7）安装Apache
1.下载apache安装包以及安装依赖的包(apr.apr-util.pcre)wget https://mirrors.cnnic.cn/apache/httpd/httpd-2.4.37.tar. ...

Python BeautifulSoup 使用

Python BeautifulSoup 使用的更多相关文章

随机推荐

热门专题