吴裕雄--python学习笔记：BeautifulSoup模块

import re

import requests

from bs4 import BeautifulSoup

req_obj = requests.get('https://www.baidu.com')

soup = BeautifulSoup(req_obj.text,'lxml')

'''标签查找'''

print(soup.title)              #只是查找出第一个

print(soup.find('title'))      #效果和上面一样

print(soup.find_all('div'))    #查出所有的div标签

<title>ç¾åº¦ä¸ä¸ï¼ä½ å°±ç¥é</title>

<title>ç¾åº¦ä¸ä¸ï¼ä½ å°±ç¥é</title>

[<div id="wrapper"> <div id="head"> <div class="head_wrapper"> <div class="s_form"> <div class="s_form_wrapper"> <div id="lg"> <img height="129" hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270"/> </div> <form action="//www.baidu.com/s" class="fm" id="form" name="f"> <input name="bdorz_come" type="hidden" value="1"/> <input name="ie" type="hidden" value="utf-8"/> <input name="f" type="hidden" value="8"/> <input name="rsv_bp" type="hidden" value="1"/> <input name="rsv_idx" type="hidden" value="1"/> <input name="tn" type="hidden" value="baidu"/><span class="bg s_ipt_wr"><in

'''获取标签里的属性'''

tag = soup.div

print(tag)

# print(tag['class'])   #多属性的话，会返回一个列表

print(tag['id'])      #查找标签的id属性

print(tag.attrs)      #查找标签所有的属性，返回一个字典（属性名：属性值）

<div id="wrapper"> <div id="head"> <div class="head_wrapper"> <div class="s_form"> <div class="s_form_wrapper"> <div id="lg"> <img height="129" hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270"/> </div> <form action="//www.baidu.com/s" class="fm" id="form" name="f"> <input name="bdorz_come" type="hidden" value="1"/> <input name="ie" type="hidden" value="utf-8"/> <input name="f" type="hidden" value="8"/> <input name="rsv_bp" type="hidden" value="1"/> <input name="rsv_idx" type="hidden" value="1"/> <input name="tn" type="hidden" value="baidu"/>

'''标签包的字符串'''

tag = soup.title

print(tag.string)                 #获取标签里的字符串

print(tag.string.replace_with("哈哈"))    #字符串不能直接编辑，可以替换

'''子节点的操作'''

tag = soup.head

print(tag.title)     #获取head标签后再获取它包含的子标签

<title>哈哈</title>

'''contents 和 .children'''

tag = soup.body

print(tag.contents)        #将标签的子节点以列表返回

print([child for child in tag.children])      #输出和上面一样

[' ', <div id="wrapper"> <div id="head"> <div class="head_wrapper"> <div class="s_form"> <div class="s_form_wrapper"> <div id="lg"> <img height="129" hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270"/> </div> <form action="//www.baidu.com/s" class="fm" id="form" name="f"> <input name="bdorz_come" type="hidden" value="1"/> <input name="ie" type="hidden" value="utf-8"/> <input name="f" type="hidden" value="8"/> <input name="rsv_bp" type="hidden" value="1"/> <input name="rsv_idx" type="hidden" value="1"/> <input name="tn" type="hidden" value="baidu"/>

'''descendants'''

tag = soup.body

[print(child_tag) for child_tag in tag.descendants]    #获取所有子节点和子子节点

<div id="wrapper"> <div id="head"> <div class="head_wrapper"> <div class="s_form"> <div class="s_form_wrapper"> <div id="lg"> <img height="129" hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270"/> </div> <form action="//www.baidu.com/s" class="fm" id="form" name="f"> <input name="bdorz_come" type="hidden" value="1"/> <input name="ie" type="hidden" value="utf-8"/> <input name="f" type="hidden" value="8"/> <input name="rsv_bp" type="hidden" value="1"/> <input name="rsv_idx" type="hidden" value="1"/> <input name="tn" type="hidden" value="baidu"/>

'''strings和.stripped_strings'''

tag = soup.body

[print(str) for str in tag.strings]             #输出所有所有文本内容

[print(str) for str in tag.stripped_strings]    #输出所有所有文本内容，去除空格或空行

'''.parent和.parents'''

tag = soup.title

print(tag.parent)   #输出便签的父标签

<head><meta content="text/html;charset=utf-8" http-equiv="content-type"/><meta content="IE=Edge" http-equiv="X-UA-Compatible"/><meta content="always" name="referrer"/><link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css"/><title>哈哈</title></head>

[print(parent) for parent in tag.parents]  #输出所有的父标签

<head><meta content="text/html;charset=utf-8" http-equiv="content-type"/><meta content="IE=Edge" http-equiv="X-UA-Compatible"/><meta content="always" name="referrer"/><link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css"/><title>哈哈</title></head>

'''.next_siblings 和 .previous_siblings

    查出所有的兄弟节点

'''

'''.next_element 和 .previous_element

    下一个兄弟节点

'''

'''find_all的keyword 参数'''

soup.find_all(id='link2')                   #查找所有包含 id 属性的标签

soup.find_all(href=re.compile("elsie"))     #href 参数,Beautiful Soup会搜索每个标签的href属性:

soup.find_all(id=True)                       #找出所有的有id属性的标签

soup.find_all(href=re.compile("elsie"), id='link1')         #也可以组合查找

soup.find_all(attrs={"属性名": "属性值"})  #也可以通过字典的方式查找

吴裕雄--python学习笔记：BeautifulSoup模块的更多相关文章

吴裕雄--python学习笔记：sqlite3 模块
1 sqlite3.connect(database [,timeout ,other optional arguments]) 该 API 打开一个到 SQLite 数据库文件 database 的 ...
吴裕雄--python学习笔记：os模块的使用
在自动化测试中,经常需要查找操作文件,比如说查找配置文件(从而读取配置文件的信息),查找测试报告(从而发送测试报告邮件),经常要对大量文件和大量路径进行操作,这就依赖于os模块. 1.当前路径及路径下 ...
吴裕雄--python学习笔记：os模块函数
os.sep:取代操作系统特定的路径分隔符 os.name:指示你正在使用的工作平台.比如对于Windows,它是'nt',而对于Linux/Unix用户,它是'posix'. os.getcwd:得 ...
吴裕雄--python学习笔记：sqlite3 模块的使用与学生信息管理系统
import sqlite3 cx = sqlite3.connect('E:\\student3.db') cx.execute( '''CREATE TABLE StudentTable( ID ...
吴裕雄--python学习笔记：爬虫基础
一.什么是爬虫爬虫:一段自动抓取互联网信息的程序,从互联网上抓取对于我们有价值的信息. 二.Python爬虫架构 Python 爬虫架构主要由五个部分组成,分别是调度器.URL管理器.网页下载器.网 ...
吴裕雄--python学习笔记：爬虫包的更换
python 3.x报错:No module named 'cookielib'或No module named 'urllib2' 1. ModuleNotFoundError: No module ...
吴裕雄--python学习笔记：爬虫
import chardet import urllib.request page = urllib.request.urlopen('http://photo.sina.com.cn/') #打开网 ...
吴裕雄--python学习笔记：通过sqlite3 进行文字界面学生管理
import sqlite3 conn = sqlite3.connect('E:\\student.db') print("Opened database successfully&quo ...
Python学习笔记之模块与包
一.模块 1.模块的概念模块这一概念很大程度上是为了解决代码的可重用性而出现的,其实这一概念并没有多复杂,简单来说不过是一个后缀为 .py 的 Python 文件而已例如,我在某个工作中经常需要打 ...

随机推荐

吴裕雄--天生自然Linux操作系统：linux yum 命令
yum( Yellow dog Updater, Modified)是一个在Fedora和RedHat以及SUSE中的Shell前端软件包管理器. 基於RPM包管理,能够从指定的服务器自动下载RPM包 ...
python学习笔记-面向对象设计
前言 1.三大编程范式: 面向过程编程函数式编程面向对象编程 2.编程进化论 1.编程最开始就是无组织无结构,从简单控制流中按步写指令 2.从上述的指令中提取重复的代码块或逻辑,组织到一起,便实现 ...
WIFI无线协议802.11a/b/g/n/ac的演变以及区别
摘自:https://blog.csdn.net/Brouce__Lee/article/details/80956945 毫无疑问,WiFi的出现普及带给我们巨大的上网便利,所以了解一下WiFi对应 ...
基于serverless快速部署前端项目到腾讯云
腾讯云 COS 组件,可以快速部署静态网站页面到对象存储 COS 中,并生成域名供访问. 安装首先要安装 serverless 组件 npm install -g serverless 在项目的根目 ...
2019CSP-J游记
2019-10-19:开一个坑,今天初赛,我是我们考场唯一几个坚持到16:45收卷的人,我们是机试,竟然可以用编译器. 这次初赛总体感觉打得不错,卷面满分200,最后实际分数,就是卷面分除以二. 初赛 ...
实现迭代器(\_\_next\_\_和\_\_iter\_\_)
目录实现迭代器(__next__和__iter__) 一.简单示例二.StopIteration异常版三.模拟range 四.斐波那契数列实现迭代器(__next__和__iter__) 一. ...
SMO算法--SVM(3)
SMO算法--SVM(3) 利用SMO算法解决这个问题: SMO算法的基本思路: SMO算法是一种启发式的算法(别管启发式这个术语, 感兴趣可了解), 如果所有变量的解都满足最优化的KKT条件, 那么 ...
Centos7.6环境中安装zabbix3.4
官网链接:https://www.zabbix.com/documentation/3.4/zh/manual/installation/install_from_packages 部署环境虚拟机服 ...
Linux实验总结（第二周）
测试一--vi 每个.c一个文件,每个.h一个文件,文件名中最好有自己的学号用Vi输入图中代码,并用gcc编译通过在Vi中使用K查找printf的帮助文档提交vi编辑过程截图,要全屏,包含自己的 ...
静态、动态cell区别
静态cell:cell数目固定不变,图片/文字固定不变(如qq设置列表可使用静态cell加载) 动态cell:cell数目较多,且图片/文字可能会发生变化(如应网络请求,淘宝列表中某个物品名称或者图片 ...

吴裕雄--python学习笔记：BeautifulSoup模块

吴裕雄--python学习笔记：BeautifulSoup模块的更多相关文章

随机推荐

热门专题