python 之 BeautifulSoup标签查找与信息提取

一、查找a标签

（1）查找所有a标签

>>> for x in soup.find_all('a'):

    print(x)

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

（2）查找所有a标签，且属性值href中需要保护关键字“”

>>> for x in soup.find_all('a',href = re.compile('lacie')):

    print(x)

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

（3）查找所有a标签，且字符串内容包含关键字“Elsie”

>>> for x in soup.find_all('a',string = re.compile('Elsie')):

    print(x)

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

（4）查找body标签的所有子标签，并循环打印输出

>>> for x in soup.find('body').children:

    if isinstance(x,bs4.element.Tag):        #使用isinstance过滤掉空行内容

        print(x)

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

二、信息提取（链接提取）

（1）解析信息标签结构，查找所有a标签，并提取每个a标签中href属性的值（即链接），然后存在空列表；

>>> linklist = []

>>> for x in soup.find_all('a'):

    link = x.get('href')

    if link:

        linklist.append(link)

>>> for x in linklist:        #验证：环打印出linklist列表中的链接

    print(x)

http://example.com/elsie

http://example.com/lacie

http://example.com/tillie

小结：链接提取 <---> 属性内容提取 <---> x.get('href')

（2）解析信息标签结构，查找所有a标签，且每个a标签中href中包含关键字“elsie”,然后存入空列表中；

>>> linklst = []

>>> for x in soup.find_all('a', href = re.compile('elsie')):

    link = x.get('href')

    if link:

        linklst.append(link)

>>> for x in linklst:        #验证：循环打印出linklist列表中的链接

    print(x)

http://example.com/elsie

小结：在进行a标签查找时，加入了对属性值href内容的正则匹配内容 <---> href = re.compile('elsie')

（3）解析信息标签结构，查询所有a标签，然后输出所有标签中的“字符串”内容；

>>> for x in soup.find_all('a'):

    string = x.get_text()

    print(string)

Elsie

Lacie

Tillie

python 之 BeautifulSoup标签查找与信息提取的更多相关文章

python之BeautifulSoup库
1. BeautifulSoup库简介和 lxml 一样,Beautiful Soup 也是一个HTML/XML的解析器,主要的功能也是如何解析和提取 HTML/XML 数据.lxml 只会局部遍历 ...
爬虫之标签查找补充及selenium模块的安装及使用与案例
今日内容概要 bs模块之标签查找过滤器 selenium模块今日内容详细 html_doc = """ <html> <head> <t ...
Python实例---beautifulsoup小Demo
豆瓣 # coding:utf - 8 from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen( ...
Python和BeautifulSoup进行网页爬取
在大数据.人工智能时代,我们通常需要从网站中收集我们所需的数据,网络信息的爬取技术已经成为多个行业所需的技能之一.而Python则是目前数据科学项目中最常用的编程语言之一.使用Python与Beaut ...
Python Download Image (python + requests + BeautifulSoup)
环境准备 1 python + requests + BeautifulSoup 页面准备主页面: http://www.netbian.com/dongman/ 图片伪地址: http://www ...
搭建基于python +opencv+Beautifulsoup+Neurolab机器学习平台
搭建基于python +opencv+Beautifulsoup+Neurolab机器学习平台 By 子敬叔叔最近在学习麦好的<机器学习实践指南案例应用解析第二版>,在安装学习环境的时候 ...
Python配合BeautifulSoup读取网络图片并保存在本地
本例为Python配合BeautifulSoup读取网络图片,并保存在本地. BeautifulSoup可代替正则表达式,更好地解析Html文本,获取其中的指定内容,如Tag.Property等 # ...
python glob 用通配符查找指定目录中的文件 - 开源中国社区
python glob 用通配符查找指定目录中的文件 - 开源中国社区 python glob 用通配符查找指定目录中的文件
python scrapy,beautifulsoup,regex,sgmparser,request,connection
In [2]: import requests In [3]: s = requests.Session() In [4]: s.headers 如果你是爬虫相关的业务?抓取的网站还各种各样, ...

随机推荐

BZOJ - 2844 线性基
题意:求给定的数在原数组中的异或组合中的排名(非去重) 因为线性基中$b[j]=1$表示该位肯定存在,所以给定的数如果含有该位,由严格递增和集合枚举可得,排名必然加上$2^j$(不是完全对角就 ...
让android系统中任意一个view变成进度条
1.效果 2.进度条背景drawable文件结束后可以恢复原背景. <?xml version="1.0" encoding="utf-8"?> ...
在ASP.NET Core Web API 项目里无法访问（wwwroot）下的文件
解决办法:在“ Startup.cs ” 文件里的 Configur方法里添加一句代码“ app.UseStaticFiles() ”,这样就可以访问wwwroot下的文件了. - 方法代码是: - ...
Caused by java.lang.IllegalStateException Not allowed to start service Intent { cmp=com.x.x.x/.x.x.xService }: app is in background uid UidRecord(一)
Caused by java.lang.IllegalStateException Not allowed to start service Intent { cmp=com.x.x.x/.x.x.x ...
Oracle ASM 常用命令
01, 查看磁盘路径 select name,path,group_number from v$asm_disk_stat; 02, 查看磁盘组信息 select state,name,type,to ...
linux 期中架构之 nginx 安装与排错
1, 安装 nginx 所需要的pcre库即:perl 兼容正则表达式 yum install pcre pcre-devel -y rpm -qa pcre pcre-devel 检查是否安装好p ...
HTML5之WebSocket && https://zhuanlan.zhihu.com/p/23467317
在认识websocket之前,我们必须了解的是websocket有什么用? 他能解决我们遇到的什么问题? 如果没用,那么我们就么有使用它的必要的. websocket就是建立起全双工协议的,提高了效率 ...
《X86汇编语言：从实模式到保护模式》读书笔记之引言
有幸结识了<X86汇编语言:从实模式到保护模式>一书.我觉得这本书非常好,语言活泼,通俗易懂,源码丰富,受益匪浅.读罢一遍,意犹未尽.于是打算再读一遍,并把自己的读书所学总结成笔记,一来给 ...
新建maven工程index.jsp页面报错
引入servlet依赖jar <dependency><groupId>javax.servlet</groupId><artifactId>servl ...
File upload error - unable to create a temporary file
php上传图片的时候会报错: File upload error - unable to create a temporary file 文件上传错误 - 无法创建一个临时文件你只需要打开你的php ...

python 之 BeautifulSoup标签查找与信息提取

python 之 BeautifulSoup标签查找与信息提取的更多相关文章

随机推荐

热门专题