python BeautifulSoup4解析网页

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>

and they lived at the bottom of a well.</p>

<p class="story">...</p></body></html>

"""

soup=BS(html,'html.parser')

for i in soup.find_all('a'):

    print('i.text:',i.text)#注释掉的内容就不打印了  str类型

    print('i.string:',i.string)  #注释掉的内容 都会打印出来，NavigableString对象

print('soup.head.contents:',soup.head.contents,type(soup.head.contents))

print('soup.head.children:',soup.head.children,type(soup.head.children))

print('soup.body.contents:',soup.body.contents)#返回一个子元素的列表

print('soup.body.children:',soup.body.children)#返回一个子元素的迭代器

for i in soup.body.children:

    print(i)

print('子孙节点 都显示出来')

for i in soup.body.descendants:

    print(i)

print('soup.body.string:',soup.body.string)

print('soup.body.strings:',soup.body.strings)

print('soup.body.stripped_strings:',soup.body.stripped_strings)  #过滤掉所有空格显示

print('去掉空格的body子元素：')

for i  in soup.body.stripped_strings:

    print(i)

print('soup.a.parent:',soup.a.parent)

print('soup.a.next_sibling:',soup.a.next_sibling)  #注意文本节点、换行\n都可能成为当前节点的上一个或者下一个同级节点

print('soup.a.previous_sibling:',soup.a.previous_sibling)

print('soup.a.next_element:',soup.a.next_element)  #下一个元素 不一定同级

print('soup.a.previous_element:',soup.a.previous_element)

print('打印所有后面的同级节点:\n')

for i in soup.a.next_siblings:

    print(i)

print('soup.a.next_element:',list(soup.a.next_elements)[1])

print('***********find_all*****')

print(soup.find_all('a'))

print('引入正则表达式：')

import re

print(soup.find_all(re.compile(r'^title')))  #正则匹配的是 标签的名字

print('列表的方式匹配：')

print(soup.find_all(['a','b']))

print('函数的方式匹配，类似filter')

def func(tag):

    if tag.has_attr('class') and re.search(r'^a',tag.name):

        return tag

print(soup.find_all(func))

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>

and they lived at the bottom of a well.</p>

<p class="story">...</p></body></html>

"""

soup=BS(html,'html.parser')

print('按属性值查找:')

print(soup.find_all(id='link1'))

print(soup.find_all('a',id='link1'))

print(soup.find_all(id='link2',href=re.compile(r'laci')))  #返回的都是列表

print(soup.find_all(class_='story')) #注意后面加的下划线

print(soup.find_all(attrs={'class':'sister'}))

print('按元素内容查找text参数：')

print(soup.find_all(text='Tillie'))

print(soup.find_all(text=['Tillie','Lacie']))  #返回的都是元素内容

print(soup.find_all(text=re.compile(r'ormous')))

print('通过内容元素 找到上级元素')

print(soup.find_all(text=re.compile(r'ormous'))[1].parent.parent)

#限制查找数量

print('limit:')

print(soup.find_all('a',limit=2))

print('只在子节点查找：')

print(soup.body.find_all('a',limit=2,recursive=False))  #只查找子节点 recursive循环的、递归的

print(soup.body.find_all(class_='story',recursive=False))

python BeautifulSoup4解析网页的更多相关文章

Python爬虫解析网页的4种方式值得收藏
用Python写爬虫工具在现在是一种司空见惯的事情,每个人都希望能够写一段程序去互联网上扒一点资料下来,用于数据分析或者干点别的事情. 我们知道,爬虫的原理无非是把目标网址的内容下载下来存储到内存 ...
python bs4解析网页时 bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to inst（转）
Python小白,学习时候用到bs4解析网站,报错 bs4.FeatureNotFound: Couldn't find a tree builder with the features you re ...
使用Python中的urlparse、urllib抓取和解析网页（一）（转）
对搜索引擎.文件索引.文档转换.数据检索.站点备份或迁移等应用程序来说,经常用到对网页(即HTML文件)的解析处理.事实上,通过Python 语言提供的各种模块,我们无需借助Web服务器或者Web浏览 ...
python网络爬虫之解析网页的BeautifulSoup(爬取电影图片)[三]
目录前言一.BeautifulSoup的基本语法二.爬取网页图片扩展学习后记前言本章同样是解析一个网页的结构信息在上章内容中(python网络爬虫之解析网页的正则表达式(爬取4k动漫图 ...
Python中的urlparse、urllib抓取和解析网页（一）
对搜索引擎.文件索引.文档转换.数据检索.站点备份或迁移等应用程序来说,经常用到对网页(即HTML文件)的解析处理.事实上,通过Python 语言提供的各种模块,我们无需借助Web服务器或者Web浏览 ...
python网络爬虫之解析网页的XPath(爬取Path职位信息)[三]
目录前言 XPath的使用方法 XPath爬取数据后言 @(目录) 前言本章同样是解析网页,不过使用的解析技术为XPath. 相对于之前的BeautifulSoup,我感觉还行,也是一个比较常用 ...
python网络爬虫-解析网页（六）
解析网页主要使用到3种方法提取网页中的数据,分别是正则表达式.beautifulsoup和lxml. 使用正则表达式解析网页正则表达式是对字符串操作的逻辑公式 .代替任意字符 . *匹配前0个或多 ...
Python爬虫之解析网页
常用的类库为lxml, BeautifulSoup, re(正则) 以获取豆瓣电影正在热映的电影名为例,url='https://movie.douban.com/cinema/nowplaying/ ...
[技术博客] BeautifulSoup4分析网页
[技术博客] BeautifulSoup4分析网页使用BeautifulSoup4进行网页文本分析前言进行网络爬虫时我们需要从网页源代码中提取自己所需要的信息,分析整理后存入数据库中. 在pyt ...

随机推荐

MySQL 过滤复制+复制映射配置方法
场景 node1 和 node2 为两台不同业务的MySQL服务器.业务方有个需求,需要将node1上的 employees库的departments .dept_manager 这2张表同步到 no ...
python 线程队列LifoQueue-LIFO（36）
在 python线程队列Queue-FIFO 文章中已经介绍了先进先出队列Queue,而今天给大家介绍的是第二种:线程队列LifoQueue-LIFO,数据先进后出类型,两者有什么区别呢? 一.队 ...
rdkafka swoole
1.yum install php-devel php-pear 2. wget http://pear.php.net/go-pear.phar 3.PHP go-pear.phar 4.cp /r ...
微信小程序之一：动态添加view（view包含picker,input）
<view wx:for="{{array}}" wx:key="this" class="borderContainer"> ...
原生js 实现better-scroll效果,饿了么菜单内容联动，即粘即用
<!DOCTYPE html> <html> <head> <meta http-equiv="Content-Type" content ...
C++ 计算定积分、不定积分、蒙特卡洛积分法
封装成了一个类,头文件和源文件如下: integral.h #pragma once //Microsoft Visual Studio 2015 Enterprise #include <io ...
Python27之集合
集合说:“在我的世界里,你就是唯一” 一.集合的概念和使用集合的概念和数学里数学里集合的概念是一致的,都是一组元素的集,且元素之间不能重复.元素必须是不可变的数据类型,例如元组也可以作为其中的一个元 ...
Symmetric Order
#include<stdio.h> int main() { ; ][]; ) { ;i<=n;i++) { scanf("%s",&str[i]); } ...
MySQL8.0新特性总览
1.消除了buffer pool mutex (Percona的贡献) 2.数据字典全部采用InnoDB引擎存储,支持DDL原子性.crash safe.metadata管理更完善(可以利用ibd2s ...
剪贴板神器：Ditto
ditto – 善用佳软免费开源的 Windows 管理剪贴板,让你处理文字更高效:Ditto - 少数派

python BeautifulSoup4解析网页

python BeautifulSoup4解析网页的更多相关文章

随机推荐

热门专题