python 解析网址信息

本篇文章主要讲述python 中如何解析一个url的信息.

1: requests获取网页信息

#!/usr/bin/python3

# -*- coding: UTF-8 -*-

"""

 @Author: zh

 @Time 2023/11/23 下午13:13

 @Describe:

"""

#导入requests库

import requests

url = "https://www.baidu.com/"

#使用requests库中的get方法获取网页数据

response = requests.get(url)

#指定编码

response.encoding = 'utf-8'

#获取网页数据的文本内容

text = response.text

print(text)

#获取网页数据的二进制内容

content = response.content

print(content)

requests库默认使用的是ISO-8859-1编码,所以我们需要修改编码为utf-8,以防止输出的内容乱码.

另外我们可以使用自动获取响应的编码方式,两者的输出效果一致 :

response.encoding = response.apparent_encoding

输出如下:

<!DOCTYPE html>

<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css><title>百度一下，你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus=autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn" autofocus></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>新闻</a> <a href=https://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地图</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>视频</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>贴吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登录</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');

                </script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多产品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>关于百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>使用百度前必读</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>意见反馈</a>&nbsp;京ICP证030173号&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>

b'<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css><title>\xe7\x99\xbe\xe5\xba\xa6\xe4\xb8\x80\xe4\xb8\x8b\xef\xbc\x8c\xe4\xbd\xa0\xe5\xb0\xb1\xe7\x9f\xa5\xe9\x81\x93</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus=autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=\xe7\x99\xbe\xe5\xba\xa6\xe4\xb8\x80\xe4\xb8\x8b class="bg s_btn" autofocus></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>\xe6\x96\xb0\xe9\x97\xbb</a> <a href=https://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>\xe5\x9c\xb0\xe5\x9b\xbe</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>\xe8\xa7\x86\xe9\xa2\x91</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>\xe8\xb4\xb4\xe5\x90\xa7</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>\xe7\x99\xbb\xe5\xbd\x95</a> </noscript> <script>document.write(\'<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=\'+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ \'" name="tj_login" class="lb">\xe7\x99\xbb\xe5\xbd\x95</a>\');\r\n                </script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">\xe6\x9b\xb4\xe5\xa4\x9a\xe4\xba\xa7\xe5\x93\x81</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>\xe5\x85\xb3\xe4\xba\x8e\xe7\x99\xbe\xe5\xba\xa6</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>\xe4\xbd\xbf\xe7\x94\xa8\xe7\x99\xbe\xe5\xba\xa6\xe5\x89\x8d\xe5\xbf\x85\xe8\xaf\xbb</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>\xe6\x84\x8f\xe8\xa7\x81\xe5\x8f\x8d\xe9\xa6\x88</a>&nbsp;\xe4\xba\xacICP\xe8\xaf\x81030173\xe5\x8f\xb7&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>\r\n'

2: 使用BeautifulSoup解析html网页

首先需要安装:

pip install bs4

from bs4 import BeautifulSoup

...

# 解析HTML页面

soup = BeautifulSoup(text, 'html.parser')

# 获取页面标题

title = soup.title

print(title)

# 获取页面中所有的链接

links = soup.find_all('a')

for link in links:

    print(link.get('href'))

可以在控制台获取到网页的标题,以及链接地址:

<title>百度一下，你就知道</title>

http://news.baidu.com

https://www.hao123.com

http://map.baidu.com

http://v.baidu.com

http://tieba.baidu.com

http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1

//www.baidu.com/more/

http://home.baidu.com

http://ir.baidu.com

http://www.baidu.com/duty/

http://jianyi.baidu.com/

3: 使用lxml解析网页地址

from lxml import etree

...

# 解析HTML页面

tree = etree.HTML(text)

# 获取页面标题

title = tree.xpath('//title/text()')[0]

print(title)

# 获取页面中所有的链接

links = tree.xpath('//a/@href')

for link in links:

    print(link)

输出如下:

百度一下，你就知道

http://news.baidu.com

https://www.hao123.com

http://map.baidu.com

http://v.baidu.com

http://tieba.baidu.com

http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1

//www.baidu.com/more/

http://home.baidu.com

http://ir.baidu.com

http://www.baidu.com/duty/

http://jianyi.baidu.com/

python 解析网址信息的更多相关文章

python爬虫，从hao123爬取网址信息
最近研究python的爬虫,小小程序,拿下来分享,本人使用python3.7,纯粹兴趣爱好,希望能帮助大家激发兴趣.从hao123,爬取各种网址信息,代码如下. import urllib.reque ...
「Python实用秘技08」一行代码解析地址信息
本文完整示例代码及文件已上传至我的Github仓库https://github.com/CNFeffery/PythonPracticalSkills 这是我的系列文章「Python实用秘技」的第8期 ...
使用Python解析JSON数据
使用Python解析百度API返回的JSON格式的数据 # coding:utf-8 # !/usr/bin/env python import matplotlib.pyplot as plt fr ...
使用Python解析JSON数据的基本方法
这篇文章主要介绍了使用Python解析JSON数据的基本方法,是Python入门学习中的基础知识,需要的朋友可以参考下: ----------------------------------- ...
js解析网址获取需要的数据
/** * 获取地址栏内容,返回pathnamearrneed对象 * @param {Object} len 从第几位开始获取你需要的值 */ function myLocationId(len) ...
Python解析Pcap包类源码学习
0x1.前言在现场取证遇到分析流量包的情况会比较少,虽然流量类设备原理是把数据都抓出来进行解析,很大一定程度上已经把人可以做的事情交给了机器自动完成. 可用于PCAP包分析的软件比如科来,W ...
深入学习Python解析并解密PDF文件内容的方法
前面学习了解析PDF文档,并写入文档的知识,那篇文章的名字为深入学习Python解析并读取PDF文件内容的方法. 链接如下:https://www.cnblogs.com/wj-1314/p/9429 ...
深入学习python解析并读取PDF文件内容的方法
这篇文章主要学习了python解析并读取PDF文件内容的方法,包括对学习库的应用,python2.7和python3.6中python解析PDF文件内容库的更新,包括对pdfminer库的详细解释和应 ...
python解析VOC的xml文件并转成自己需要的txt格式
在进行神经网络训练的时候,自己标注的数据集往往会有数据量不够大以及代表性不强等问题,因此我们会采用开源数据集作为训练,开源数据集往往具有特定的格式,如果我们想将开源数据集为我们所用的话,就需要对其格式 ...
Python解析Wav文件并绘制波形的方法
资源下载 #本文PDF版下载 Python解析Wav文件并绘制波形的方法 #本文代码下载 Wav波形绘图代码 #本文实例音频文件night.wav下载音频文件下载 (石进-夜的钢琴曲) 前言在现在 ...

随机推荐

nvm、node、vue安装、vue项目创建打包
nvm.node.vue安装.创建vue项目 nvm作用:可以管理多个版本的node,切换node版本,下载node. 前情提要参考:https://zhuanlan.zhihu.com/p/51 ...
Ds100p -「数据结构百题」总集
(来自 2021 的 ps:这个页面是几百年前写的,很丑,caution!) 前言 \(\qquad \qquad \qquad\)ljs搞了一个dp100题,然后lyc告诉我我们搞一个数据结构100 ...
Record -「NOIP-S 2020」赛后总结
不是特别想说伤心的事情. T1 一遍过完所有大样例,此时只过去了十几二十分钟,不过之前花了半个小时通读了整个 PDF 所以此时大概过了 1h. T2 大概花了十几分钟胡出了一个反着枚举就是正解的 n^ ...
k8s work节点无法使用kubectl命令
在Kubernetes的node节点上运行命令 [ kubectl ] 命令出现了如下错误 root@calico-work01:~# kubectl get nodes The connection ...
.NET周刊【9月第3期 2023-09-17】
国内文章在.NET 8 RC1 版本中 MAUI.ASP.NET Core 和 EF8 的新特性 https://www.cnblogs.com/shanyou/p/17698428.html 从年 ...
二进制部署k8s高可用
一.前置知识点部署中遇到问题请参考:http://blog.ctnrs.com/post/k8s-binary-install/ 1.1 生产环境可部署Kubernetes集群的两种方式目前生产部 ...
AGC044C Strange Dance 题解
在2020年A卷省选day2t2有类似建立trie的技巧. 题目链接显然是建一棵三叉trie树,代表0/1/2 对这棵trie树,我们需要支持子树交换和全局加1 考虑第一个操作怎么做?直接打个懒标记 ...
🔥🔥TCP协议：三次握手、四次挥手，你真的了解吗？
什么是TCP网络分层应⽤层应用层是网络协议栈中的最顶层,主要负责应用程序之间的通信.其中一种常见的应用层协议是HTTP协议,它定义了应用程序之间如何传递报文. 传输层传输层是为两台主机之间的应用 ...
Spring系列：基于XML的方式构建IOC
目录一.搭建模块spring6-ioc-xml 二.获取bean的三种方式三.基于setter注入四.基于构造器注入五.特殊值处理六.为对象类型属性赋值七.引入外部属性文件八.基于XML ...
看完包你搞懂Redis缓存穿透、击穿和雪崩！！！说到做到
缓存穿透缓存穿透是指当用户对Redis发出无效或者不存在的数据信息操作时,这条数据在Redis中不存在,Redis就会在MySQL数据库中查询,可时无效的信息在mysql数据库中也不存在,就会造成R ...

python 解析网址信息

python 解析网址信息

1: requests获取网页信息

2: 使用BeautifulSoup解析html网页

3: 使用lxml解析网页地址

python 解析网址信息的更多相关文章

随机推荐

热门专题