python 中BeautifulSoup入门

什么是BeautifulSoup？

Beautiful Soup 是用Python写的一个HTML/XML的解析器，它可以很好的处理不规范标记并生成剖析树(parse tree)。它提供简单又常用的导航（navigating），搜索以及修改剖析树的操作。它可以大大节省你的编程时间。

直接看例子：

#!/usr/bin/python
# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
The Dormouse's story

Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.

...

"""

soup = BeautifulSoup(html_doc)

print soup.title

print soup.title.name

print soup.title.string

print soup.p

print soup.a

print soup.find_all('a')

print soup.find(id='link3')

print soup.get_text()

结果为：

<title>The Dormouse's story</title>
title
The Dormouse's story
The Dormouse's story
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

可以看出：soup 就是BeautifulSoup处理格式化后的字符串，soup.title 得到的是title标签，soup.p 得到的是文档中的第一个p标签，要想得到所有标签，得用find_all

函数。find_all 函数返回的是一个序列，可以对它进行循环，依次得到想到的东西.

get_text() 是返回文本,这个对每一个BeautifulSoup处理后的对象得到的标签都是生效的。你可以试试 print soup.p.get_text()

其实是可以获得标签的其他属性的，比如我要获得a标签的href属性的值，可以使用 print soup.a['href'],类似的其他属性，比如class也是可以这么得到的（soup.a['class']）。

特别的，一些特殊的标签，比如head标签，是可以通过soup.head 得到，其实前面也已经说了。

如何获得标签的内容数组？使用contents 属性就可以比如使用 print soup.head.contents，就获得了head下的所有子孩子，以列表的形式返回结果，

可以使用 [num] 的形式获得 ,获得标签，使用.name 就可以。

获取标签的孩子，也可以使用children，但是不能print soup.head.children 没有返回列表，返回的是 <listiterator object at 0x108e6d150>,

不过使用list可以将其转化为列表。当然可以使用for 语句遍历里面的孩子。

关于string属性，如果超过一个标签的话，那么就会返回None，否则就返回具体的字符串print soup.title.string 就返回了 The Dormouse's story

超过一个标签的话，可以试用strings

向上查找可以用parent函数，如果查找所有的，那么可以使用parents函数

查找下一个兄弟使用next_sibling,查找上一个兄弟节点使用previous_sibling,如果是查找所有的，那么在对应的函数后面加s就可以

如何遍历树？

　使用find_all 函数

find_all(name, attrs, recursive, text, limit, **kwargs)

举例说明：

print soup.find_all('title')
print soup.find_all('p','title')
print soup.find_all('a')
print soup.find_all(id="link2")
print soup.find_all(id=True)

返回值为：

[<title>The Dormouse's story</title>]
[The Dormouse's story]
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

通过css查找,直接上例子把：

print soup.find_all("a", class_="sister")
print soup.select("p.title")

通过属性进行查找
print soup.find_all("a", attrs={"class": "sister"})

通过文本进行查找
print soup.find_all(text="Elsie")
print soup.find_all(text=["Tillie", "Elsie", "Lacie"])

限制结果个数
print soup.find_all("a", limit=2)

结果为：

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[The Dormouse's story]
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[u'Elsie']
[u'Elsie', u'Lacie', u'Tillie']
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

总之，通过这些函数可以查找到想要的东西。

---end---

python 中BeautifulSoup入门的更多相关文章

python中BeautifulSoup模块
BeautifulSoup模块是干嘛的? 答:通过html标签去快速匹配标签中的内容.效率相对比正则会好的多.效率跟xpath模块应该差不多. 一:解析器: BeautifulSoup(html,&q ...
python中BeautifulSoup库中find函数
http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html#contents 简单的用法: find(name, at ...
Python中BeautifulSoup中对HTML标签的提取
一开始使用了beautifulSoup的get_text()进行字符串的提取,后来一直提取失败,并提示错误为TypeError: 'NoneType' object is not callable 返 ...
Python中字符串的使用
这篇文章主要介绍python当中用的非常多的一种内置类型——str.它属于python中的Sequnce Type(序列类型).python中一共7种序列类型,分别为str(字符串),unicode( ...
python爬虫从入门到放弃（六）之 BeautifulSoup库的使用
上一篇文章的正则,其实对很多人来说用起来是不方便的,加上需要记很多规则,所以用起来不是特别熟练,而这节我们提到的beautifulsoup就是一个非常强大的工具,爬虫利器. beautifulSoup ...
入门系列之Scikit-learn在Python中构建机器学习分类器
欢迎大家前往腾讯云+社区,获取更多腾讯海量技术实践干货哦~ 本文由信姜缘发表于云+社区专栏介绍机器学习是计算机科学.人工智能和统计学的研究领域.机器学习的重点是训练算法以学习模式并根据数据进行预 ...
Python中第三方的用于解析HTML的库：BeautifulSoup
背景在Python去写爬虫,网页解析等过程中,比如: 如何用Python,C#等语言去实现抓取静态网页+抓取动态网页+模拟登陆网站常常需要涉及到HTML等网页的解析. 当然,对于简单的HTML中内 ...
Python中xPath技术和BeautifulSoup的使用
xpath基本知识 XPath语法:使用路径表达式来选取XML或HTML文档中的节点或节点集路径表达式 nodename:表示选取此节点的所有子节点 / : 表示从根节点选取 // :选择 ...
第14.12节 Python中使用BeautifulSoup解析http报文：使用select方法快速定位内容
一. 引言在<第14.10节 Python中使用BeautifulSoup解析http报文:html标签相关属性的访问>和<第14.11节 Python中使用BeautifulSo ...

随机推荐

linux开发缩写
1.CONFIG_OF 在一些驱动中经常看到#ifdef CONFIG_OF,这里的OF是Open Firmware. Open Firmware. This was invented long ti ...
简单利用Scanner对文件进行解析
public class AvPrice{ static int count = 0; static int sum = 0; public static void main(Str ...
IE11 iframe alternative
<OBJECT classid=clsid:8856F961-340A-11D0-A96B-00C04FD705A2> <PARAM NAME=Location VALUE=http ...
谷歌input框黄色背景问题
input:-webkit-autofill,input:-webkit-autofill:hover,input:-webkit-autofill:focus,input:-webkit-autof ...
scrapy基础教程
1. 安装Scrapy包 pip install scrapy, 安装教程 Mac下可能会出现:OSError: [Errno 13] Permission denied: '/Library/Pyt ...
php : 基础(6)
数组数组基础含义: 数组就是一系列数据的集合体,他们按设定的顺序排列为一个"链的形状". 注意:php中的数组单元的顺序,跟下标无关! 数组定义(赋值): $arr1 = ar ...
CSUOJ_1001
/* * Title : A+B(II) * Data : 2016/11/09 * Author : Andrew */ #include <iostream> #include < ...
pyqt官方示例
文件夹 PATH 列表卷序列号为 00000058 F027:7BEC C:. ├─activeqt │ └─webbrowser │ ├─icons │ └─pycache ├─animation ...
注册码_EditPlus3
1.来自"http://jingyan.baidu.com/article/67508eb4d78cfe9cca1ce488.html" Name: www.cnzz.cc Co ...
Nodejs报错集
1.ReferenceError: userModule is not defined A:1>检查app.js文件中是否调用userModule所在的文件(const userModule=r ...

python 中BeautifulSoup入门

python 中BeautifulSoup入门的更多相关文章

随机推荐

热门专题