02_Python简单爬虫(熊猫直播LOL的up主,谁最强!)
声明:
本文仅用于Python练手,并无任何恶意攻击行为!
# 导入request模块
from urllib import request
# 导入re模块
import re
class Spider():
# url以http, https开头
url_to_run = r'https://www.panda.tv/cate/lol' # 待抓取网页,熊猫直播平台-LOL分类(抓取主播名,视频观看人数)
htmls = None # 保存抓取到的HTML内容
root_pattern = '<div class="video-info">(.*?)</div>' # 非贪婪匹配,匹配到最近的一个</div>,包含主播名,视频观看人数这两个tag的上一级tag
name_pattern = '</i>(.*?)</span>' # 非贪婪匹配,匹配到举例</i>最近的1个</span>,找到该视频的主播名
number_pattern = '<span class="video-number">(.*?)</span>'# 非贪婪匹配,匹配到举例最近的1个</span>, 找到该主播视频的观看人数
result_list = [] # 存储最后的分析结果,每个元素为{'name':主播名, 'number':视频观看数}} @classmethod
def fetch_content(cls):
"""
模拟浏览器,向服务器发送获取特定页面的请求
将返回的HTML页面,字符串形式保存到Spider.htmls
:return: None
"""
#request模块下的urlopen方法, 将web服务器返回的结果封装为1个file-like object,本质Response实例
result = request.urlopen(cls.url_to_run) # result操作
#print(result.getcode()) # HTTP返回码,200则正常获取到页面
#print(result.geturl()) # 实际获取的URL,判定页面是否有重定向
cls.htmls = result.read() # 实际的HTML页面内容, bytes类型
cls.htmls = str(cls.htmls, encoding='utf-8') # 将byte类型的HTML页面内容,转换为str字符串 @classmethod
def analysis(cls):
"""
根据Spider.htmls中保存的HTML页面,进行分析
1)主播名
2)视频观看次数
将每个主播和视频的观看次数,组成1个dict, 添加到cls.result_list
:return: None
"""
# root_pattern中做了group, 返回结果中已经没有外部video-info标签
video_info_lst = re.findall(cls.root_pattern, cls.htmls, flags=re.S) for video in video_info_lst:
up_host = re.findall(cls.name_pattern, video, flags=re.S)
video_number = re.findall(cls.number_pattern, video, flags=re.S) # 对up_host内容格式进行调整: 丢弃第二个\n, 将第一个的\n开头和两边的空白字符去除
up_host = up_host[0]
up_host = up_host.strip('\n')
up_host = up_host.strip(' ') # 对video_number内容格式进行调整, 将vidoe_number从list中取出
video_number = video_number[0] # 主播名,观看数,组成字典,添加到结果列表
dic = {'name':up_host, 'number':video_number}
cls.result_list.append(dic) @classmethod
def sort_seed(cls, item):
"""
result_list中的元素是dict, 不能对dict直接做大小比较
指定将dict中的number作为key, 进行不同dict间的比较依据
sorted比较,传入要比较的ict, sort_seed返回dict中的number, 作为比较依据
:return: item['number'] 作为比较依据
"""
r = re.findall('\d+', item['number'])
number = float(r[0])
# 处理“万”级别用户换算
if '万' in item['number']:
number *= 10000 return number @classmethod
def sort_result(cls):
"""
将cls.result_list中的元素,按照观看人数进行排序
:return:
"""
# sorted(iterable, key = None, reverse = False)
cls.result_list = sorted(cls.result_list, key=cls.sort_seed, reverse=True) @classmethod
def show(cls):
print("Total Uphost: " + str(len(cls.result_list)))
print('='*45)
for item in cls.result_list:
print('Uphost:'+ item['name'] + " ," + "Rank: " + str(cls.result_list.index(item) + 1) + ' Video Watched: ' + item['number'] ) @classmethod
def go(cls):
cls.fetch_content()
cls.analysis()
cls.sort_result()
cls.show() # 类测试代码
Spider.go()
部分实际测试结果:
Total Uphost:
=============================================
Uphost:即将拥有人鱼线的PDD ,Rank: Video Watched: .7万
Uphost:RNG丶MLXG ,Rank: Video Watched: .5万
Uphost:熊猫伏念 ,Rank: Video Watched: .7万
Uphost:药水哥s ,Rank: Video Watched: .3万
Uphost:WE丶Mystic丶 ,Rank: Video Watched: .0万
Uphost:叫我官人 ,Rank: Video Watched: .5万
Uphost:冠军锐雯 ,Rank: Video Watched: .5万
Uphost:熊猫丶蛮神 ,Rank: Video Watched: .3万
Uphost:起飛的辛德浪 ,Rank: Video Watched: .6万
Uphost:善言_ ,Rank: Video Watched: .9万
Uphost:左手QAQ ,Rank: Video Watched: .3万
Uphost:S7全球总决赛 ,Rank: Video Watched: .2万
Uphost:Pino一米八 ,Rank: Video Watched: .2万
Uphost:金三炮o金三岁 ,Rank: Video Watched:
Uphost:挽神z ,Rank: Video Watched:
Uphost:易小埋l ,Rank: Video Watched:
Uphost:主播毕老实 ,Rank: Video Watched:
Uphost:一剑西来QAQ ,Rank: Video Watched:
Uphost:英雄联盟活动直播间 ,Rank: Video Watched:
Uphost:超级提莫丶牛腩君 ,Rank: Video Watched:
Uphost:mid六安王 ,Rank: Video Watched:
Uphost:熊猫丶乐鱼阿卡丽 ,Rank: Video Watched:
Uphost:熊猫TV一休哥 ,Rank: Video Watched:
Uphost:小黑胖砸 ,Rank: Video Watched:
Uphost:或许这就是离岛吧 ,Rank: Video Watched:
Uphost:第一最寂寞1u ,Rank: Video Watched:
Uphost:李阿特 ,Rank: Video Watched:
Uphost:LOL日常活动直播间 ,Rank: Video Watched:
Uphost:LPL熊猫官方直播 ,Rank: Video Watched:
Uphost:熊猫TV丶小青龙 ,Rank: Video Watched:
Uphost:熊猫TV灬小豆豆 ,Rank: Video Watched:
Uphost:小啊雅大大大 ,Rank: Video Watched:
Uphost:小凯南zz ,Rank: Video Watched:
Uphost:拿铁不加糖 ,Rank: Video Watched:
Uphost:金克喵的猫珥朵丶 ,Rank: Video Watched:
Uphost:炽天使z1 ,Rank: Video Watched:
Uphost:小小小女人丶 ,Rank: Video Watched:
Uphost:東東東 ,Rank: Video Watched:
Uphost:纯纯小流_氓 ,Rank: Video Watched:
Uphost:熊猫tv芭比公主 ,Rank: Video Watched:
Uphost:big火鸡 ,Rank: Video Watched:
Uphost:机器猫mmm ,Rank: Video Watched:
Uphost:大家都叫我冷爷丶 ,Rank: Video Watched:
Uphost:栗子菌i ,Rank: Video Watched:
Uphost:星矢魔术 ,Rank: Video Watched:
Uphost:唐人leo ,Rank: Video Watched:
Uphost:十级浪 ,Rank: Video Watched:
Uphost:筱兮QAQ ,Rank: Video Watched:
Uphost:酥软迷妹小慢慢Zz ,Rank: Video Watched:
Uphost:小凡Aaaaaa ,Rank: Video Watched:
Uphost:小丸子爱吃樱桃丶 ,Rank: Video Watched:
Uphost:爱流血的兔斯基 ,Rank: Video Watched:
Uphost:凶残的喵绵绵 ,Rank: Video Watched:
Uphost:别叫凯隐叫隐神 ,Rank: Video Watched:
Uphost:Panda初心2018 ,Rank: Video Watched:
Uphost:熊猫丶大风6 ,Rank: Video Watched:
Uphost:顽皮ssssssssssss ,Rank: Video Watched:
Uphost:大表哥响尾蛇 ,Rank: Video Watched:
Uphost:告白White ,Rank: Video Watched:
Uphost:牌面之王丶火影劫 ,Rank: Video Watched:
Uphost:西湖仙境 ,Rank: Video Watched:
Uphost:飞不起来1 ,Rank: Video Watched:
Uphost:逗了个蛋 ,Rank: Video Watched:
Uphost:瓜皮球球 ,Rank: Video Watched:
Uphost:竹蜻蜓呀 ,Rank: Video Watched:
Uphost:少年阿超和阿斌 ,Rank: Video Watched:
Uphost:刚出土的i帕帕 ,Rank: Video Watched:
Uphost:小主播安旭 ,Rank: Video Watched:
Uphost:西决哟 ,Rank: Video Watched:
Uphost:Panda丶夏木 ,Rank: Video Watched:
Uphost:冰雪丶狐狸 ,Rank: Video Watched:
Uphost:夜魅丝 ,Rank: Video Watched:
Uphost:熊猫丶皮皮瓜 ,Rank: Video Watched:
Uphost:Panda灬刀刀 ,Rank: Video Watched:
Uphost:莫莫莫夏夏夏 ,Rank: Video Watched:
Uphost:皮皮翔i ,Rank: Video Watched:
Uphost:南表妹QAQ ,Rank: Video Watched:
Uphost:青蛙OB ,Rank: Video Watched:
Uphost:_Infi_ ,Rank: Video Watched:
Uphost:暴躁茹阿姨 ,Rank: Video Watched:
Uphost:整天打碟的DJ胖丶 ,Rank: Video Watched:
Uphost:熊猫丶一百 ,Rank: Video Watched:
Uphost:全蛋狮子喵 ,Rank: Video Watched:
Uphost:熊猫TV丶小66 ,Rank: Video Watched:
Uphost:电竞张全蛋长长 ,Rank: Video Watched:
Uphost:熊猫第一不亏哥 ,Rank: Video Watched:
Uphost:叫我东邪 ,Rank: Video Watched:
Uphost:熊猫TV丶一手绝 ,Rank: Video Watched:
Uphost:熊猫TV丶别勉强 ,Rank: Video Watched:
Uphost:提莫的小女朋友 ,Rank: Video Watched:
Uphost:王者蕾 ,Rank: Video Watched:
Uphost:日暮哟 ,Rank: Video Watched:
Uphost:颖妹er超甜的 ,Rank: Video Watched:
Uphost:熊猫TV丶成小七 ,Rank: Video Watched:
Uphost:熊猫tv丶马小越 ,Rank: Video Watched:
Uphost:柒柒天 ,Rank: Video Watched:
Uphost:Panda电竞白子画 ,Rank: Video Watched:
Uphost:熊猫TV_苏璞 ,Rank: Video Watched:
Uphost:你的小老虎哥哥 ,Rank: Video Watched:
Uphost:门徒zzzz ,Rank: Video Watched:
Uphost:李易钧 ,Rank: Video Watched:
Uphost:熊猫TV丶农药术士 ,Rank: Video Watched:
Uphost:熊猫贝乐 ,Rank: Video Watched:
Uphost:李小青盲僧 ,Rank: Video Watched:
Uphost:刘慕宸 ,Rank: Video Watched:
Uphost:寒风强袭 ,Rank: Video Watched:
Uphost:会蛙泳的饼干0 ,Rank: Video Watched:
Uphost:阿四德莱文丶 ,Rank: Video Watched:
Uphost:知道神龙摆尾吗 ,Rank: Video Watched:
Uphost:瓦罗兰的未来丶尨 ,Rank: Video Watched:
Uphost:JO丶欣欣 ,Rank: Video Watched:
Uphost:123ivan456 ,Rank: Video Watched:
Uphost:only丶提莫 ,Rank: Video Watched:
Uphost:情话好听但不暖心 ,Rank: Video Watched:
Uphost:小丸子真好吃 ,Rank: Video Watched:
Uphost:一只提莫送你回家 ,Rank: Video Watched:
Uphost:请叫我大腿岩丶 ,Rank: Video Watched:
Uphost:伊人芳泽瑞尔心i ,Rank: Video Watched:
02_Python简单爬虫(熊猫直播LOL的up主,谁最强!)的更多相关文章
- Python简单爬虫入门三
我们继续研究BeautifulSoup分类打印输出 Python简单爬虫入门一 Python简单爬虫入门二 前两部主要讲述我们如何用BeautifulSoup怎去抓取网页信息以及获取相应的图片标题等信 ...
- [Java]使用HttpClient实现一个简单爬虫,抓取煎蛋妹子图
第一篇文章,就从一个简单爬虫开始吧. 这只虫子的功能很简单,抓取到”煎蛋网xxoo”网页(http://jandan.net/ooxx/page-1537),解析出其中的妹子图,保存至本地. 先放结果 ...
- 简单爬虫,突破IP访问限制和复杂验证码,小总结
简单爬虫,突破复杂验证码和IP访问限制 文章地址:http://www.cnblogs.com/likeli/p/4730709.html 好吧,看题目就知道我是要写一个爬虫,这个爬虫的目标网站有 ...
- Python简单爬虫入门二
接着上一次爬虫我们继续研究BeautifulSoup Python简单爬虫入门一 上一次我们爬虫我们已经成功的爬下了网页的源代码,那么这一次我们将继续来写怎么抓去具体想要的元素 首先回顾以下我们Bea ...
- GJM : Python简单爬虫入门(二) [转载]
感谢您的阅读.喜欢的.有用的就请大哥大嫂们高抬贵手"推荐一下"吧!你的精神支持是博主强大的写作动力以及转载收藏动力.欢迎转载! 版权声明:本文原创发表于 [请点击连接前往] ,未经 ...
- python 简单爬虫diy
简单爬虫直接diy, 复杂的用scrapy import urllib2 import re from bs4 import BeautifulSoap req = urllib2.Request(u ...
- python3实现简单爬虫功能
本文参考虫师python2实现简单爬虫功能,并增加自己的感悟. #coding=utf-8 import re import urllib.request def getHtml(url): page ...
- Python开发简单爬虫 - 慕课网
课程链接:Python开发简单爬虫 环境搭建: Eclipse+PyDev配置搭建Python开发环境 Python入门基础教程 用Eclipse编写Python程序 课程目录 第1章 课程介绍 ...
- nodejs的简单爬虫
闲聊 好久没写博客了,前几天小颖在朋友的博客里看到了用nodejs的简单爬虫.所以小颖就自己试着做了个爬博客园数据的demo.嘻嘻...... 小颖最近养了条泰日天,自从养了我家 ...
随机推荐
- appstore加速审核通道
申请入口:https://developer.apple.com/contact/app-store/?topic=expedite
- AMR格式语音采集/编码/转码/解码/播放
1.opencore-amr源码下载 https://sourceforge.net/projects/opencore-amr/files/opencore-amr/ 2.opencore-amr编 ...
- uvloop —— 超级快的 Python 异步网络框架
简短介绍 asyncio是遵循Python标准库的一个异步 I/O框架.在这篇文章里,我将介绍 uvloop: 可以完整替代asyncio事件循环.uvloop是用Cython写的,基于 libuv. ...
- nginx安装,反向代理配置
1.centos 版本 下载最新稳定版 https://www.nginx.com/resources/wiki/start/topics/tutorials/install/# 2.执行语句: ./ ...
- VS2010/MFC编程入门之五十二(Ribbon界面开发:创建Ribbon样式的应用程序框架)
上一节中鸡啄米讲了GDI对象之画刷CBrush,至此图形图像的入门知识就讲完了.从本节开始鸡啄米将为大家带来Ribbon界面开发的有关内容.本文先来说说如何创建Ribbon样式的应用程序框架. Rib ...
- ide vscode安装
在linux系统中安装VSCode(Visual Studio Code) 在linux系统中安装VSCode(Visual Studio Code) 1.从官网下载压缩包(话说下载下来解压就直接 ...
- KM算法模板
大白书P248有证明,此处贴出两种复杂度的方案, n^4 大白书P350 n^3 #include <algorithm> #include <string.h> #inclu ...
- 20154312 曾林 Exp4恶意软件分析
写在前面 如果把恶意软件比作罪犯的话,怎么看这次实验? 实验目的:以后能够在茫茫人海中找到罪犯. 实验过程:现在以及抓到了一个罪犯,把他放到茫茫人海里去,看看他和普通人有啥区别.这些区别就是罪犯的特征 ...
- Kafka集群监控工具之二--Kafka Eagle
基于kafka: kafka_2.11-0.11.0.0.tgz kafka-eagle-bin-1.2.1.tar.gz 1.下载解压 tar -zxvf kafka-eagle-bin-1.2.1 ...
- 阿里云服务器ECS web环境配置(LNAP)ubantu
Ubuntu 系统中,可以使用 apt-get 命令来搭建 LNMP环境.这种方式较编译方式安装更加简便 安装Nginx 1.使用 sudo apt-get install nginx 就能自动安装 ...