【Python】理想论坛每小时发帖量统计图表

写以下代码的目的是分析一天中各时段理想论坛中用户发帖回帖的活跃程度，获得结尾那张图表是核心。

以下代码两种爬虫协助，论坛爬虫先爬主贴，爬到主贴后启动帖子爬虫爬子贴，然后把每个子贴的发表时间等存入数据库。

再用一个程序对各个时段中发帖次数进行统计，然后用Excel生产图表。

获取数据的爬虫代码如下：

# 论坛爬虫，用于爬取主贴再爬子贴
from bs4 import BeautifulSoup
import requests
import threading
import re
import pymysql

user_agent='Mozilla/4.0 (compatible;MEIE 5.5;windows NT)'
headers={'User-Agent':user_agent}

# 论坛爬虫类（多线程）
class forumCrawler(threading.Thread):
    def __init__(self,name,url):
        threading.Thread.__init__(self,name=name)
        self.name=name
        self.url=url
        self.infos=[]

    def run(self):
        print("线程"+self.name+"开始爬取页面"+self.url);

        try:
            rsp=requests.get(self.url,headers=headers)
            soup= BeautifulSoup(rsp.text,'html.parser',from_encoding='utf-8')
            #print(rsp.text); # rsp.text是全文

            # 找出span
            for spans in soup.find_all('span',class_="forumdisplay"):
                #找出link
                for link in spans.find_all('a'):
                    if link and link.get("href"):
                        #print(link.get("href"))
                        #print(link.text+'\n')
                        topicLink="http://www.55188.com/"+link.get("href")

                        tc=topicCrawler(name=self.name+'_tc#'+link.get("href"),url=topicLink)
                        tc.start()

        except Exception as e:
            print("线程"+self.name+"发生异常。")# 不管怎么出现的异常，就让它一直爬到底
            print(e);

# 帖子爬虫类（多线程）
class topicCrawler(threading.Thread):
    def __init__(self,name,url):
        threading.Thread.__init__(self,name=name)
        self.name=name
        self.url=url
        self.infos=[]

    def run(self):
        while(self.url!="none"):
            print("线程"+self.name+"开始爬取页面"+self.url);

            try:
                rsp=requests.get(self.url,headers=headers)
                self.url="none"#用完之后置空，看下一页能否取到值
                soup= BeautifulSoup(rsp.text,'html.parser',from_encoding='utf-8')
                #print(rsp.text); # rsp.text是全文

                # 找出一页里每条发言
                for divs in soup.find_all('div',class_="postinfo"):
                    #print(divs.text) # divs.text包含作者和发帖时间的文字

                    # 用正则表达式将多个空白字符替换成一个空格
                    RE = re.compile(r'(\s+)')
                    line=RE.sub(" ",divs.text)

                    arr=line.split(' ')

                    #print(len(arr))
                    arrLength=len(arr)

                    if arrLength==7:
                        info={'楼层':arr[1],
                              '作者':arr[2].replace('只看：',''),
                              '日期':arr[4],
                              '时间':arr[5]}
                        self.infos.append(info);
                    elif arrLength==8:
                        info={'楼层':arr[1],
                              '作者':arr[2].replace('只看：',''),
                              '日期':arr[5],
                              '时间':arr[6]}
                        self.infos.append(info);

                #找下一页所在地址
                for pagesDiv in soup.find_all('div',class_="pages"):
                    for strong in pagesDiv.find_all('strong'):
                        print('当前为第'+strong.text+'页')

                        # 找右边的兄弟节点
                        nextNode=strong.next_sibling
                        if nextNode and nextNode.get("href"): # 右边的兄弟节点存在，且其有href属性
                            #print(nextNode.get("href"))
                            self.url='http://www.55188.com/'+nextNode.get("href")

                if self.url!="none":
                    print("有下一页，线程"+self.name+"前往下一页")
                    continue
                else:
                    print("无下一页，线程"+self.name+'爬取结束，开始打印...')

                    for info in self.infos:
                        print('\n')
                        for key in info:
                            print(key+":"+info[key])

                    print("线程"+self.name+'打印结束.')

                    insertDB(self.name,self.infos)

            except Exception as e:
                print("线程"+self.name+"发生异常。重新爬行")# 不管怎么出现的异常，就让它一直爬到底
                print(e);
                continue

# 数据库插值
def insertDB(crawlName,infos):
    conn=pymysql.connect(host=',db='test',charset='utf8')

    for info in infos:
        sql="insert into test.topic(floor,author,tdate,ttime,crawlername,addtime) values ('"+info['楼层']+"','"+info['作者']+"','"+info['日期']+"','"+info['时间']+"','"+crawlName+"',now() )"
        print(sql)
        conn.query(sql)

    conn.commit()# 写操作之后commit不可少
    conn.close()

# 入口函数
def main():
    for i in range(1,10):
        url='http://www.55188.com/forum-8-'+str(i)+'.html'
        tc=forumCrawler(name='fc#'+str(i),url=url)
        tc.start()

# 开始
main()

控制台输出太多就不贴了，把插入数据后的数据库展示一下，ttime字段就是想要获得的关键数据：

再做一个小程序对发帖时间进行统计，代码如下：

# 对发帖时间进行统计
import pymysql

# 入口函数
def main():
    dic={':0}

    conn=pymysql.connect(host=',db='test',charset='utf8')

    cs=conn.cursor()
    cs.execute("select * from topic")
    results = cs.fetchall()

    for row in results:
        ttime=row[4]
        hour=ttime.split(':')[0]
        dic[hour]=dic[hour]+1

    conn.close()

    print(dic)
# 开始
main()

输出字典如下：

C:\Users\horn1\Desktop\python\17>python sum.py
{': 388}

用Excel来个图形化看看：

从上图可以得出以下结论：

1.早上0点-6点是交易者最闲的时候，他们大部分都在睡觉，3-5点睡得最熟。

2.发帖峰值一个是在9-10点，一个是15-16点。股市在9：30开盘，低开高开也出来了，行情也走了一段，大家开始热情高涨了发帖，之后就逐步回落，午休落入低谷，下午15点收盘后，大家又争相发表对这天行情的看法，但能一天走势能谈多少，于是一个小时就消停了。另外15-16点也是股评家发表股评的黄金时段。

3.入夜了，虽然早已收盘，大家依旧在浏览论坛，希望从帖子里发现什么或者讨论什么，直到23-24点还有不少人从事这项活动。

呵呵，我也玩票了一把数据分析。

2018年4月4日16点14分

【Python】理想论坛每小时发帖量统计图表的更多相关文章

【ichartjs】爬取理想论坛前30页帖子获得每个子贴的发帖时间，总计83767条数据进行统计，生成统计图表
统计数据如下: {': 2451} 图形化后效果如下: 源码: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//E ...
【pyhon】理想论坛爬虫1.08
#------------------------------------------------------------------------------------ # 理想论坛爬虫1.08,用 ...
【Python】爬取理想论坛单帖爬虫
代码: # 单帖爬虫,用于爬取理想论坛帖子得到发帖人,发帖时间和回帖时间,url例子见main函数 from bs4 import BeautifulSoup import requests impo ...
【python】理想论坛爬虫长贴版1.00
理想论坛有些长贴,针对这些长贴做统计可以知道某ID什么时段更活跃. 爬虫代码为: #---------------------------------------------------------- ...
【python】理想论坛爬虫1.08
#------------------------------------------------------------------------------------ # 理想论坛爬虫1.08, ...
【python】理想论坛帖子爬虫1.06
昨天认识到在本期同时起一百个回调/线程后程序会崩溃,造成结果不可信. 于是决定用Python单线程操作,因为它理论上就用主线程跑不会有问题,只是时间长点. 写好程序后,测试了一中午,210个主贴,11 ...
【Python】理想论坛帖子读取爬虫1.04版
1.01-1.03版本都有多线程争抢DB的问题,线程数一多问题就严重了. 这个版本把各线程要添加数据的SQL放到数组里,等最后一次性完成,这样就好些了.但乱码问题和未全部完成即退出现象还在,而且速度上 ...
【Python】分析自己的博客 https://www.cnblogs.com/xiandedanteng/p/?page=XX,看每个月发帖量是多少
要执行下面程序,需要安装Beautiful Soup和requests,具体安装方法请见:https://www.cnblogs.com/xiandedanteng/p/8668492.html # ...
【Nodejs】理想论坛帖子爬虫1.01
用Nodejs把Python实现过的理想论坛爬虫又实现了一遍,但是怎么判断所有回调函数都结束没有好办法,目前的spiderCount==spiderFinished判断法在多页情况下还是会提前中止. ...

随机推荐

jquery实用的一些方法
做个购物车功能,需要修改下前端页面有些实用的方法总结一下当你想实现最基本的加减法的时候,对于转换number实用Number(str)即可首先明确下页面的每一行是动态的,这个时候绑定事件的时候不 ...
JavaSE1
<The Pragmatic Programmer><The Mythical Man-month><Clean Code><The Clean Coder& ...
shell top解析
top命令是Linux下常用的性能分析工具,能够实时显示系统中各个进程的资源占用状况,类似于Windows的任务管理器. top显示系统当前的进程和其他状况,是一个动态显示过程,即可以通过用户按键来不 ...
hashMap原理剖析
在日常开发中,hashMap应该算是比较常用的一个类了,今天就来学习一下hashMap的实现原理. 概念 1.什么时hash? 书面定义:就是把一个不固定长度的二进制值映射成一个固定长度的二进制值. ...
Process Explorer常用操作介绍
(未获得作者本人同意,严禁转载) Process Explorer出现的背景 Process Explorer可以看成是一个加强版的任务管理器.在较早的Windows版本中,任务管理器提供的功能是非常 ...
BZOJ 3231: [Sdoi2008]递归数列 (JZYZOJ 1353) 矩阵快速幂
http://www.lydsy.com/JudgeOnline/problem.php?id=3231 和斐波那契一个道理在最后加一个求和即可 #include<cstdio> #i ...
Codeforces Round #348 (VK Cup 2016 Round 2, Div. 2 Edition) C. Little Artem and Matrix 模拟
C. Little Artem and Matrix 题目连接: http://www.codeforces.com/contest/669/problem/C Description Little ...
CROC 2016 - Qualification B. Processing Queries 模拟
B. Processing Queries 题目连接: http://www.codeforces.com/contest/644/problem/B Description In this prob ...
git fetch, git pull 以及 FETCH_HEAD
git push. 这个很简单, 其实和后面的差不多, 这里就不讲了. 唯一需要注意的地方是: git push origin :branch2, 表示将一个内容为空的同名分支推送到远程的分支.(说白 ...
Java性能优化的9大工具
在这篇文章中,我会带着大家一起看一下9个可以帮助我们优化Java性能的工具.有一些我们已经在IDR Solutions中使用了,而另外一些有可能在个人项目中使用. NetBeans Profiler ...

【Python】理想论坛每小时发帖量统计图表

【Python】理想论坛每小时发帖量统计图表的更多相关文章

随机推荐

热门专题