python 解析网址信息

本篇文章主要讲述python 中如何解析一个url的信息.

1: requests获取网页信息

#!/usr/bin/python3

# -*- coding: UTF-8 -*-

"""

 @Author: zh

 @Time 2023/11/23 下午13:13

 @Describe:

"""

#导入requests库

import requests

url = "https://www.baidu.com/"

#使用requests库中的get方法获取网页数据

response = requests.get(url)

#指定编码

response.encoding = 'utf-8'

#获取网页数据的文本内容

text = response.text

print(text)

#获取网页数据的二进制内容

content = response.content

print(content)

requests库默认使用的是ISO-8859-1编码,所以我们需要修改编码为utf-8,以防止输出的内容乱码.

另外我们可以使用自动获取响应的编码方式,两者的输出效果一致 :

response.encoding = response.apparent_encoding

输出如下:

<!DOCTYPE html>

<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css><title>百度一下，你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus=autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn" autofocus></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>新闻</a> <a href=https://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地图</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>视频</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>贴吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登录</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');

                </script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多产品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>关于百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>使用百度前必读</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>意见反馈</a>&nbsp;京ICP证030173号&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>

b'<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css><title>\xe7\x99\xbe\xe5\xba\xa6\xe4\xb8\x80\xe4\xb8\x8b\xef\xbc\x8c\xe4\xbd\xa0\xe5\xb0\xb1\xe7\x9f\xa5\xe9\x81\x93</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus=autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=\xe7\x99\xbe\xe5\xba\xa6\xe4\xb8\x80\xe4\xb8\x8b class="bg s_btn" autofocus></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>\xe6\x96\xb0\xe9\x97\xbb</a> <a href=https://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>\xe5\x9c\xb0\xe5\x9b\xbe</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>\xe8\xa7\x86\xe9\xa2\x91</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>\xe8\xb4\xb4\xe5\x90\xa7</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>\xe7\x99\xbb\xe5\xbd\x95</a> </noscript> <script>document.write(\'<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=\'+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ \'" name="tj_login" class="lb">\xe7\x99\xbb\xe5\xbd\x95</a>\');\r\n                </script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">\xe6\x9b\xb4\xe5\xa4\x9a\xe4\xba\xa7\xe5\x93\x81</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>\xe5\x85\xb3\xe4\xba\x8e\xe7\x99\xbe\xe5\xba\xa6</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>\xe4\xbd\xbf\xe7\x94\xa8\xe7\x99\xbe\xe5\xba\xa6\xe5\x89\x8d\xe5\xbf\x85\xe8\xaf\xbb</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>\xe6\x84\x8f\xe8\xa7\x81\xe5\x8f\x8d\xe9\xa6\x88</a>&nbsp;\xe4\xba\xacICP\xe8\xaf\x81030173\xe5\x8f\xb7&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>\r\n'

2: 使用BeautifulSoup解析html网页

首先需要安装:

pip install bs4

from bs4 import BeautifulSoup

...

# 解析HTML页面

soup = BeautifulSoup(text, 'html.parser')

# 获取页面标题

title = soup.title

print(title)

# 获取页面中所有的链接

links = soup.find_all('a')

for link in links:

    print(link.get('href'))

可以在控制台获取到网页的标题,以及链接地址:

<title>百度一下，你就知道</title>

http://news.baidu.com

https://www.hao123.com

http://map.baidu.com

http://v.baidu.com

http://tieba.baidu.com

http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1

//www.baidu.com/more/

http://home.baidu.com

http://ir.baidu.com

http://www.baidu.com/duty/

http://jianyi.baidu.com/

3: 使用lxml解析网页地址

from lxml import etree

...

# 解析HTML页面

tree = etree.HTML(text)

# 获取页面标题

title = tree.xpath('//title/text()')[0]

print(title)

# 获取页面中所有的链接

links = tree.xpath('//a/@href')

for link in links:

    print(link)

输出如下:

百度一下，你就知道

http://news.baidu.com

https://www.hao123.com

http://map.baidu.com

http://v.baidu.com

http://tieba.baidu.com

http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1

//www.baidu.com/more/

http://home.baidu.com

http://ir.baidu.com

http://www.baidu.com/duty/

http://jianyi.baidu.com/

python 解析网址信息的更多相关文章

python爬虫，从hao123爬取网址信息
最近研究python的爬虫,小小程序,拿下来分享,本人使用python3.7,纯粹兴趣爱好,希望能帮助大家激发兴趣.从hao123,爬取各种网址信息,代码如下. import urllib.reque ...
「Python实用秘技08」一行代码解析地址信息
本文完整示例代码及文件已上传至我的Github仓库https://github.com/CNFeffery/PythonPracticalSkills 这是我的系列文章「Python实用秘技」的第8期 ...
使用Python解析JSON数据
使用Python解析百度API返回的JSON格式的数据 # coding:utf-8 # !/usr/bin/env python import matplotlib.pyplot as plt fr ...
使用Python解析JSON数据的基本方法
这篇文章主要介绍了使用Python解析JSON数据的基本方法,是Python入门学习中的基础知识,需要的朋友可以参考下: ----------------------------------- ...
js解析网址获取需要的数据
/** * 获取地址栏内容,返回pathnamearrneed对象 * @param {Object} len 从第几位开始获取你需要的值 */ function myLocationId(len) ...
Python解析Pcap包类源码学习
0x1.前言在现场取证遇到分析流量包的情况会比较少,虽然流量类设备原理是把数据都抓出来进行解析,很大一定程度上已经把人可以做的事情交给了机器自动完成. 可用于PCAP包分析的软件比如科来,W ...
深入学习Python解析并解密PDF文件内容的方法
前面学习了解析PDF文档,并写入文档的知识,那篇文章的名字为深入学习Python解析并读取PDF文件内容的方法. 链接如下:https://www.cnblogs.com/wj-1314/p/9429 ...
深入学习python解析并读取PDF文件内容的方法
这篇文章主要学习了python解析并读取PDF文件内容的方法,包括对学习库的应用,python2.7和python3.6中python解析PDF文件内容库的更新,包括对pdfminer库的详细解释和应 ...
python解析VOC的xml文件并转成自己需要的txt格式
在进行神经网络训练的时候,自己标注的数据集往往会有数据量不够大以及代表性不强等问题,因此我们会采用开源数据集作为训练,开源数据集往往具有特定的格式,如果我们想将开源数据集为我们所用的话,就需要对其格式 ...
Python解析Wav文件并绘制波形的方法
资源下载 #本文PDF版下载 Python解析Wav文件并绘制波形的方法 #本文代码下载 Wav波形绘图代码 #本文实例音频文件night.wav下载音频文件下载 (石进-夜的钢琴曲) 前言在现在 ...

随机推荐

【算法】湖心岛上的数学梦--用c#实现一元多次方程的展开式
每天清晨,当第一缕阳光洒在湖面上,一个身影便会出现在湖心小岛上.她坐在一块大石头上,周围被茂盛的植物环绕,安静地沉浸在数学的世界中. 这个姑娘叫小悦,她的故事在这个美丽的湖心小岛上展开.每天早晨,她都 ...
KRPano插件解密大师更新支持最新版KRPano的XML/JS解密
KRPano插件解密大师是一款专业的全景解密工具,它可以帮助你轻松解密KRPano的XML/JS插件,还能分析下载静态和动态网站的资源.你无需任何编程知识,只需一键点击,就能快速完成解密,学习全景开发 ...
5-MySQL列定义
1.列定义说明:在MySQL中,列定义(Column Definition)是用于定义数据库表中每一列的结构的语句.它指定了列的名称.数据类型.长度.约束以及其他属性. 2.主键和自增主键:PRI ...
「codeforces - 542D」Superhero's Job
link. 容易发现,如果将 \(x\) 写作 \(\displaystyle \prod_{i = 1}^k p_i^{\alpha_i}\) 的形式,\(\displaystyle J(x) = ...
ArcGIS将遥感影像的0值设置为NoData
本文介绍在ArcMap软件中,将栅格图层中的0值或其他指定数值作为NoData值的方法. 在处理栅格图像时,有时会发现如下图所示的情况--我们对某一个区域的栅格数据进行分类着色后,其周边区域( ...
DB22
IBM官方网站提供了DB2 Express-C版本的软件免费下载: 下载地址 : http://www.ibm.com/developerworks/cn/downloads/im/udbexp/
umich cv-3-1
UMICH CV Neural Network 对于传统的线性分类器,分类效果并不好,所以这节引入了一个两层的神经网络,来帮助我们进行图像分类可以看出它的结构十分简单,x作为输入层,经过max(0, ...
每个后端都应该了解的OpenResty入门以及网关安全实战
简介在官网上对 OpenResty 是这样介绍的(http://openresty.org): "OpenResty 是一个基于 Nginx 与 Lua 的高性能 Web 平台,其内部集成 ...
生成CSR和自签名证书
CSR,全称Certificate Signing Request(证书签发请求),是一种包含了公钥和与主题(通常是实体的信息,如个人或组织)相关的其他信息的数据结构.CSR通常用于向证书颁发机构(C ...
LVS负载均衡群集——其二
LVS-DR 通信四元素:源IP,源端口,目的IP,目的端口主机A(客户端)-->VIP 主机B(调度器) 主机A(客户端)<--VIP 主机C(节点服务器) 通信五元素:源IP,源端口 ...

python 解析网址信息

python 解析网址信息

1: requests获取网页信息

2: 使用BeautifulSoup解析html网页

3: 使用lxml解析网页地址

python 解析网址信息的更多相关文章

随机推荐

热门专题