使用Python自带的库和正则表达式爬取熊猫直播主播观看人气

主要是体现代码的规范性

from urllib import request

import re

class Spider():

    url = 'https://www.panda.tv/cate/lol'

    root_pattern = '<div class="video-info">([\s\S]*?)</div>'

    name_pattern = '</i>([\s\S]*?)</span>'

    number_pattern = '<span class="video-number">([\s\S]*?)</span>'

    def __fetch_content(self):

        r = request.urlopen(Spider.url)

        htmls = r.read()

        htmls = str(htmls, encoding='utf-8')

        return htmls

    def __analysis(self, htmls):

        root_html = re.findall(Spider.root_pattern, htmls)

        anchors = []

        for html in root_html:

            name = re.findall(Spider.name_pattern, html)

            number = re.findall(Spider.number_pattern, html)

            anchor = {"name": name, "number": number}

            anchors.append(anchor)

        # print(root_html[1])

        # print(anchors[1])

        return anchors

    def __refine(self, anchors):

        L = lambda anchor: {"name": anchor['name'][0].strip(), 'number': anchor['name'][1]}

        return map(L, anchors)

    def __sort(self, anchors):

        anchors = sorted(anchors, key=self.__sort_seed, reverse=True)

        return anchors

    def __sort_seed(self, anchor):

        r = re.findall("\d*", anchor["number"])

        number = float(r[0])

        if '万' in anchor['number']:

            number = number * 10000

        return number

    def __show(self, anchors):

        for rank in range(0, len(anchors)):

            print("排名："+str(rank+1)+"  主播：" + anchors[rank]['name'] +

                  "--------" + "观看人数：" +

                  anchors[rank]['number'])

    def go(self):

        htmls = self.__fetch_content()

        anchors = self.__analysis(htmls)

        anchors = list(self.__refine(anchors))

        anchors = self.__sort(anchors)

        self.__show(anchors)

        print(len(anchors))

        # print(anchors)

spider = Spider()

spider.go()

使用Python自带的库和正则表达式爬取熊猫直播主播观看人气的更多相关文章

PYTHON 爬虫笔记八:利用Requests+正则表达式爬取猫眼电影top100（实战项目一）
利用Requests+正则表达式爬取猫眼电影top100 目标站点分析流程框架爬虫实战使用requests库获取top100首页: import requests def get_one_pag ...
python 3.6 urllib库实现天气爬取、邮件定时给妹子发送天气
#由于每天早上要和妹子说早安,于是做个定时任务,每天早上自动爬取天气,发送天气问好邮件##涉及模块:#(1)定时任务:windows的定时任务# 配置教程链接:http://b ...
爬虫基本库request使用—爬取猫眼电影信息
使用request库和正则表达式爬取猫眼电影信息. 1.爬取目标猫眼电影TOP100的电影名称,时间,评分,等信息,将结果以文件存储. 2.准备工作安装request库. 3.代码实现 impor ...
[python] 常用正则表达式爬取网页信息及分析HTML标签总结【转】
[python] 常用正则表达式爬取网页信息及分析HTML标签总结转http://blog.csdn.net/Eastmount/article/details/51082253 标签: pytho ...
初识python 之爬虫：使用正则表达式爬取“糗事百科 - 文字版”网页数据
初识python 之爬虫:使用正则表达式爬取"古诗文"网页数据的兄弟篇. 详细代码如下: #!/user/bin env python # author:Simple-Sir ...
初识python 之爬虫：使用正则表达式爬取“古诗文”网页数据
通过requests.re(正则表达式) 爬取"古诗文"网页数据. 详细代码如下: #!/user/bin env python # author:Simple-Sir # tim ...
[Python爬虫] 使用 Beautiful Soup 4 快速爬取所需的网页信息
[Python爬虫] 使用 Beautiful Soup 4 快速爬取所需的网页信息 2018-07-21 23:53:02 larger5 阅读数 4123更多分类专栏: 网络爬虫版权声明: ...
14-Requests+正则表达式爬取猫眼电影
'''Requests+正则表达式爬取猫眼电影TOP100''''''流程框架:抓去单页内容:利用requests请求目标站点,得到单个网页HTML代码,返回结果.正则表达式分析:根据HTML代码分析 ...
第三百三十节，web爬虫讲解2—urllib库爬虫—实战爬取搜狗微信公众号—抓包软件安装Fiddler4讲解
第三百三十节,web爬虫讲解2—urllib库爬虫—实战爬取搜狗微信公众号—抓包软件安装Fiddler4讲解封装模块 #!/usr/bin/env python # -*- coding: utf- ...

随机推荐

一步一步学EF系列二【Fluent API的方式来处理实体与数据表之间的映射关系】
EF里面的默认配置有两个方法,一个是用Data Annotations(在命名空间System.ComponentModel.DataAnnotations;),直接作用于类的属性上面,还有一个就是F ...
hdu2328 Corporate Identity
地址:http://acm.hdu.edu.cn/showproblem.php?pid=2328 题目: Corporate Identity Time Limit: 9000/3000 MS (J ...
Tasks in parallel
using System; using System.Collections.Generic; using System.Linq; using System.Web; using System.We ...
Java 创建数组的方式，以及各种类型数组元素的默认值
①创建数组的方式3种 ①第1种方法 public class MyTest { public static void main(String[] args){ //method 1 int[] arr ...
string integer == equals 转
java中的数据类型,可分为两类: 1.基本数据类型,也称原始数据类型.byte,short,char,int,long,float,double,boolean 他们之间的比较,应用双等号(== ...
微信小程序 drawImage 问题
好久没写了,其实可写的还是挺多,主要还是懒吧... 最近公司项目使用小程序做序列帧动画,大概有 116 张图,共9.6M. 比较闲的日子里实验了一番,主要有以下几种方法, 1. css backgro ...
WCF使用安全证书验证消息加密
首先安装服务端安全证书代码如下: // 下面第一行是安装证书,第二行是将证书列入信任 makecert.exe -sr LocalMachine -ss MY -a sha1 -n CN=lo ...
20135320赵瀚青LINUX第三章读书笔记
第三章进程管理 3.1 进程进程的定义: 是处于执行期的程序以及它所包含的资源的总称. 线程的定义: 是在进程中活动的对象. 每个线程都拥有一个独立的程序计数器.进程栈和一组进程寄存器. 内核调度 ...
python3执行js之pyexecjs
执行js的三种方法:1.阅读js代码,将之转成python2.找到js代码,用python第三方库执行相关代码 python2-pyv8 python3-pyexecjs3.用selenium驱动浏览 ...
redis命令的使用
批量删除特定前缀的keys redis-cli KEYS "prefix:*" | xargs redis-cli DEL 返回list的长度 > LPUSH test &q ...

使用Python自带的库和正则表达式爬取熊猫直播主播观看人气

使用Python自带的库和正则表达式爬取熊猫直播主播观看人气的更多相关文章

随机推荐

热门专题