02_Python简单爬虫(熊猫直播LOL的up主,谁最强!)
声明:
本文仅用于Python练手,并无任何恶意攻击行为!
# 导入request模块
from urllib import request
# 导入re模块
import re
class Spider():
# url以http, https开头
url_to_run = r'https://www.panda.tv/cate/lol' # 待抓取网页,熊猫直播平台-LOL分类(抓取主播名,视频观看人数)
htmls = None # 保存抓取到的HTML内容
root_pattern = '<div class="video-info">(.*?)</div>' # 非贪婪匹配,匹配到最近的一个</div>,包含主播名,视频观看人数这两个tag的上一级tag
name_pattern = '</i>(.*?)</span>' # 非贪婪匹配,匹配到举例</i>最近的1个</span>,找到该视频的主播名
number_pattern = '<span class="video-number">(.*?)</span>'# 非贪婪匹配,匹配到举例最近的1个</span>, 找到该主播视频的观看人数
result_list = [] # 存储最后的分析结果,每个元素为{'name':主播名, 'number':视频观看数}} @classmethod
def fetch_content(cls):
"""
模拟浏览器,向服务器发送获取特定页面的请求
将返回的HTML页面,字符串形式保存到Spider.htmls
:return: None
"""
#request模块下的urlopen方法, 将web服务器返回的结果封装为1个file-like object,本质Response实例
result = request.urlopen(cls.url_to_run) # result操作
#print(result.getcode()) # HTTP返回码,200则正常获取到页面
#print(result.geturl()) # 实际获取的URL,判定页面是否有重定向
cls.htmls = result.read() # 实际的HTML页面内容, bytes类型
cls.htmls = str(cls.htmls, encoding='utf-8') # 将byte类型的HTML页面内容,转换为str字符串 @classmethod
def analysis(cls):
"""
根据Spider.htmls中保存的HTML页面,进行分析
1)主播名
2)视频观看次数
将每个主播和视频的观看次数,组成1个dict, 添加到cls.result_list
:return: None
"""
# root_pattern中做了group, 返回结果中已经没有外部video-info标签
video_info_lst = re.findall(cls.root_pattern, cls.htmls, flags=re.S) for video in video_info_lst:
up_host = re.findall(cls.name_pattern, video, flags=re.S)
video_number = re.findall(cls.number_pattern, video, flags=re.S) # 对up_host内容格式进行调整: 丢弃第二个\n, 将第一个的\n开头和两边的空白字符去除
up_host = up_host[0]
up_host = up_host.strip('\n')
up_host = up_host.strip(' ') # 对video_number内容格式进行调整, 将vidoe_number从list中取出
video_number = video_number[0] # 主播名,观看数,组成字典,添加到结果列表
dic = {'name':up_host, 'number':video_number}
cls.result_list.append(dic) @classmethod
def sort_seed(cls, item):
"""
result_list中的元素是dict, 不能对dict直接做大小比较
指定将dict中的number作为key, 进行不同dict间的比较依据
sorted比较,传入要比较的ict, sort_seed返回dict中的number, 作为比较依据
:return: item['number'] 作为比较依据
"""
r = re.findall('\d+', item['number'])
number = float(r[0])
# 处理“万”级别用户换算
if '万' in item['number']:
number *= 10000 return number @classmethod
def sort_result(cls):
"""
将cls.result_list中的元素,按照观看人数进行排序
:return:
"""
# sorted(iterable, key = None, reverse = False)
cls.result_list = sorted(cls.result_list, key=cls.sort_seed, reverse=True) @classmethod
def show(cls):
print("Total Uphost: " + str(len(cls.result_list)))
print('='*45)
for item in cls.result_list:
print('Uphost:'+ item['name'] + " ," + "Rank: " + str(cls.result_list.index(item) + 1) + ' Video Watched: ' + item['number'] ) @classmethod
def go(cls):
cls.fetch_content()
cls.analysis()
cls.sort_result()
cls.show() # 类测试代码
Spider.go()
部分实际测试结果:
Total Uphost:
=============================================
Uphost:即将拥有人鱼线的PDD ,Rank: Video Watched: .7万
Uphost:RNG丶MLXG ,Rank: Video Watched: .5万
Uphost:熊猫伏念 ,Rank: Video Watched: .7万
Uphost:药水哥s ,Rank: Video Watched: .3万
Uphost:WE丶Mystic丶 ,Rank: Video Watched: .0万
Uphost:叫我官人 ,Rank: Video Watched: .5万
Uphost:冠军锐雯 ,Rank: Video Watched: .5万
Uphost:熊猫丶蛮神 ,Rank: Video Watched: .3万
Uphost:起飛的辛德浪 ,Rank: Video Watched: .6万
Uphost:善言_ ,Rank: Video Watched: .9万
Uphost:左手QAQ ,Rank: Video Watched: .3万
Uphost:S7全球总决赛 ,Rank: Video Watched: .2万
Uphost:Pino一米八 ,Rank: Video Watched: .2万
Uphost:金三炮o金三岁 ,Rank: Video Watched:
Uphost:挽神z ,Rank: Video Watched:
Uphost:易小埋l ,Rank: Video Watched:
Uphost:主播毕老实 ,Rank: Video Watched:
Uphost:一剑西来QAQ ,Rank: Video Watched:
Uphost:英雄联盟活动直播间 ,Rank: Video Watched:
Uphost:超级提莫丶牛腩君 ,Rank: Video Watched:
Uphost:mid六安王 ,Rank: Video Watched:
Uphost:熊猫丶乐鱼阿卡丽 ,Rank: Video Watched:
Uphost:熊猫TV一休哥 ,Rank: Video Watched:
Uphost:小黑胖砸 ,Rank: Video Watched:
Uphost:或许这就是离岛吧 ,Rank: Video Watched:
Uphost:第一最寂寞1u ,Rank: Video Watched:
Uphost:李阿特 ,Rank: Video Watched:
Uphost:LOL日常活动直播间 ,Rank: Video Watched:
Uphost:LPL熊猫官方直播 ,Rank: Video Watched:
Uphost:熊猫TV丶小青龙 ,Rank: Video Watched:
Uphost:熊猫TV灬小豆豆 ,Rank: Video Watched:
Uphost:小啊雅大大大 ,Rank: Video Watched:
Uphost:小凯南zz ,Rank: Video Watched:
Uphost:拿铁不加糖 ,Rank: Video Watched:
Uphost:金克喵的猫珥朵丶 ,Rank: Video Watched:
Uphost:炽天使z1 ,Rank: Video Watched:
Uphost:小小小女人丶 ,Rank: Video Watched:
Uphost:東東東 ,Rank: Video Watched:
Uphost:纯纯小流_氓 ,Rank: Video Watched:
Uphost:熊猫tv芭比公主 ,Rank: Video Watched:
Uphost:big火鸡 ,Rank: Video Watched:
Uphost:机器猫mmm ,Rank: Video Watched:
Uphost:大家都叫我冷爷丶 ,Rank: Video Watched:
Uphost:栗子菌i ,Rank: Video Watched:
Uphost:星矢魔术 ,Rank: Video Watched:
Uphost:唐人leo ,Rank: Video Watched:
Uphost:十级浪 ,Rank: Video Watched:
Uphost:筱兮QAQ ,Rank: Video Watched:
Uphost:酥软迷妹小慢慢Zz ,Rank: Video Watched:
Uphost:小凡Aaaaaa ,Rank: Video Watched:
Uphost:小丸子爱吃樱桃丶 ,Rank: Video Watched:
Uphost:爱流血的兔斯基 ,Rank: Video Watched:
Uphost:凶残的喵绵绵 ,Rank: Video Watched:
Uphost:别叫凯隐叫隐神 ,Rank: Video Watched:
Uphost:Panda初心2018 ,Rank: Video Watched:
Uphost:熊猫丶大风6 ,Rank: Video Watched:
Uphost:顽皮ssssssssssss ,Rank: Video Watched:
Uphost:大表哥响尾蛇 ,Rank: Video Watched:
Uphost:告白White ,Rank: Video Watched:
Uphost:牌面之王丶火影劫 ,Rank: Video Watched:
Uphost:西湖仙境 ,Rank: Video Watched:
Uphost:飞不起来1 ,Rank: Video Watched:
Uphost:逗了个蛋 ,Rank: Video Watched:
Uphost:瓜皮球球 ,Rank: Video Watched:
Uphost:竹蜻蜓呀 ,Rank: Video Watched:
Uphost:少年阿超和阿斌 ,Rank: Video Watched:
Uphost:刚出土的i帕帕 ,Rank: Video Watched:
Uphost:小主播安旭 ,Rank: Video Watched:
Uphost:西决哟 ,Rank: Video Watched:
Uphost:Panda丶夏木 ,Rank: Video Watched:
Uphost:冰雪丶狐狸 ,Rank: Video Watched:
Uphost:夜魅丝 ,Rank: Video Watched:
Uphost:熊猫丶皮皮瓜 ,Rank: Video Watched:
Uphost:Panda灬刀刀 ,Rank: Video Watched:
Uphost:莫莫莫夏夏夏 ,Rank: Video Watched:
Uphost:皮皮翔i ,Rank: Video Watched:
Uphost:南表妹QAQ ,Rank: Video Watched:
Uphost:青蛙OB ,Rank: Video Watched:
Uphost:_Infi_ ,Rank: Video Watched:
Uphost:暴躁茹阿姨 ,Rank: Video Watched:
Uphost:整天打碟的DJ胖丶 ,Rank: Video Watched:
Uphost:熊猫丶一百 ,Rank: Video Watched:
Uphost:全蛋狮子喵 ,Rank: Video Watched:
Uphost:熊猫TV丶小66 ,Rank: Video Watched:
Uphost:电竞张全蛋长长 ,Rank: Video Watched:
Uphost:熊猫第一不亏哥 ,Rank: Video Watched:
Uphost:叫我东邪 ,Rank: Video Watched:
Uphost:熊猫TV丶一手绝 ,Rank: Video Watched:
Uphost:熊猫TV丶别勉强 ,Rank: Video Watched:
Uphost:提莫的小女朋友 ,Rank: Video Watched:
Uphost:王者蕾 ,Rank: Video Watched:
Uphost:日暮哟 ,Rank: Video Watched:
Uphost:颖妹er超甜的 ,Rank: Video Watched:
Uphost:熊猫TV丶成小七 ,Rank: Video Watched:
Uphost:熊猫tv丶马小越 ,Rank: Video Watched:
Uphost:柒柒天 ,Rank: Video Watched:
Uphost:Panda电竞白子画 ,Rank: Video Watched:
Uphost:熊猫TV_苏璞 ,Rank: Video Watched:
Uphost:你的小老虎哥哥 ,Rank: Video Watched:
Uphost:门徒zzzz ,Rank: Video Watched:
Uphost:李易钧 ,Rank: Video Watched:
Uphost:熊猫TV丶农药术士 ,Rank: Video Watched:
Uphost:熊猫贝乐 ,Rank: Video Watched:
Uphost:李小青盲僧 ,Rank: Video Watched:
Uphost:刘慕宸 ,Rank: Video Watched:
Uphost:寒风强袭 ,Rank: Video Watched:
Uphost:会蛙泳的饼干0 ,Rank: Video Watched:
Uphost:阿四德莱文丶 ,Rank: Video Watched:
Uphost:知道神龙摆尾吗 ,Rank: Video Watched:
Uphost:瓦罗兰的未来丶尨 ,Rank: Video Watched:
Uphost:JO丶欣欣 ,Rank: Video Watched:
Uphost:123ivan456 ,Rank: Video Watched:
Uphost:only丶提莫 ,Rank: Video Watched:
Uphost:情话好听但不暖心 ,Rank: Video Watched:
Uphost:小丸子真好吃 ,Rank: Video Watched:
Uphost:一只提莫送你回家 ,Rank: Video Watched:
Uphost:请叫我大腿岩丶 ,Rank: Video Watched:
Uphost:伊人芳泽瑞尔心i ,Rank: Video Watched:
02_Python简单爬虫(熊猫直播LOL的up主,谁最强!)的更多相关文章
- Python简单爬虫入门三
我们继续研究BeautifulSoup分类打印输出 Python简单爬虫入门一 Python简单爬虫入门二 前两部主要讲述我们如何用BeautifulSoup怎去抓取网页信息以及获取相应的图片标题等信 ...
- [Java]使用HttpClient实现一个简单爬虫,抓取煎蛋妹子图
第一篇文章,就从一个简单爬虫开始吧. 这只虫子的功能很简单,抓取到”煎蛋网xxoo”网页(http://jandan.net/ooxx/page-1537),解析出其中的妹子图,保存至本地. 先放结果 ...
- 简单爬虫,突破IP访问限制和复杂验证码,小总结
简单爬虫,突破复杂验证码和IP访问限制 文章地址:http://www.cnblogs.com/likeli/p/4730709.html 好吧,看题目就知道我是要写一个爬虫,这个爬虫的目标网站有 ...
- Python简单爬虫入门二
接着上一次爬虫我们继续研究BeautifulSoup Python简单爬虫入门一 上一次我们爬虫我们已经成功的爬下了网页的源代码,那么这一次我们将继续来写怎么抓去具体想要的元素 首先回顾以下我们Bea ...
- GJM : Python简单爬虫入门(二) [转载]
感谢您的阅读.喜欢的.有用的就请大哥大嫂们高抬贵手"推荐一下"吧!你的精神支持是博主强大的写作动力以及转载收藏动力.欢迎转载! 版权声明:本文原创发表于 [请点击连接前往] ,未经 ...
- python 简单爬虫diy
简单爬虫直接diy, 复杂的用scrapy import urllib2 import re from bs4 import BeautifulSoap req = urllib2.Request(u ...
- python3实现简单爬虫功能
本文参考虫师python2实现简单爬虫功能,并增加自己的感悟. #coding=utf-8 import re import urllib.request def getHtml(url): page ...
- Python开发简单爬虫 - 慕课网
课程链接:Python开发简单爬虫 环境搭建: Eclipse+PyDev配置搭建Python开发环境 Python入门基础教程 用Eclipse编写Python程序 课程目录 第1章 课程介绍 ...
- nodejs的简单爬虫
闲聊 好久没写博客了,前几天小颖在朋友的博客里看到了用nodejs的简单爬虫.所以小颖就自己试着做了个爬博客园数据的demo.嘻嘻...... 小颖最近养了条泰日天,自从养了我家 ...
随机推荐
- CentOS工作内容(四)主机禁ping
CentOS工作内容(四)主机禁ping 用到的快捷键 tab 自动补齐(有不知道的吗) ctrl+a 移动到当前行的开头(a ahead) ctrl+u 删除(剪切)此处至开始所有内容 vim 末行 ...
- 006-springboot2.0.4 配置log4j2,以及打印mybatis的sql
一.pom配置 普通项目 <!-- log4j2 --> <dependency> <groupId>org.apache.logging.log4j</gr ...
- 关hashMap跟hashTable的区别
1.HashMap和Hashtable都实现了Map接口 2.HashMap是非synchronized,而Hashtable是synchronized 3.HashTable使用Enumeratio ...
- 搭建本地离线yum仓库
目录 前言 把rpm包下载到本地 配置本地yum仓库信息 生成repodata信息 检查以及使用 对本地仓库进行更新 参考资料 修改记录 环境:VMware-Workstation-12-Pro,Wi ...
- [LeetCode]94, 144, 145 Binary Tree InOrder, PreOrder, PostOrder Traversal_Medium
Given a binary tree, return the inorder, preorder, postorder traversal of its nodes' values. Example ...
- jmeter 测试websocket接口(一)
jmeter 测试websocket接口时,需要对jmeter添加测试websocket的jar包. 下载地址: https://download.csdn.net/download/qq_14913 ...
- mybitis学习笔记
<?xml version="1.0" encoding="UTF-8" ?><!DOCTYPE mapperPUBLIC "-// ...
- centos迷你版,没有安装ifconfig命令
ifconfig命令是设置或显示网络接口的程序,可以显示出我们机器的网卡信息,可是有些时候最小化安装CentOS等Linux发行版的时候会默认不安装ifconfig等命令,这时候你进入终端,运行ifc ...
- sql 关于存储过程的查询
--查数据库中所有的存储过程select * from sys.procedures ----------------------查数据库中所有的存储过程select o.name from sysc ...
- Lower Power with CPF(四)
CPF从Front-end到Back-end(RTL--GDSII)的整个流程: 1)Creating a CPF file:来在前端就建立lower power的规范. 2)检查CPF文件的正确性, ...