使用scrapy爬取dota2贴吧数据并进行分析

一直好奇贴吧里的小伙伴们在过去的时间里说的最多的词是什么，那我们就来抓取分析一下贴吧发文的标题内容，并提取分析一下，看看吧友们在说些什么。

首先我们使用scrapy对所有贴吧文章的标题进行抓取

scrapy startproject btspider

cd btspider

scrapy genspider -t basic btspiderx tieba.baidu.com

修改btspiderx内容

# -*- coding: utf-8 -*-

import scrapy

from btspider.items import BtspiderItem

class BTSpider(scrapy.Spider):

    name = "btspider"

    allowed_domains = ["baidu.com"]

    start_urls = []

    for x in xrange(91320):

        if x == 0:

            url = "https://tieba.baidu.com/f?kw=dota2&ie=utf-8"

        else:

            url = "https://tieba.baidu.com/f?kw=dota2&ie=utf-8&pn=" + str(x*50)

        start_urls.append(url)

    def parse(self, response):

        for sel in response.xpath('//div[@class="col2_right j_threadlist_li_right "]'):

            item = BtspiderItem()

            item['title'] = sel.xpath('div/div/a/text()').extract()

            item['link'] = sel.xpath('div/div/a/@href').extract()

            item['time'] = sel.xpath(

                'div/div/span[@class="threadlist_reply_date pull_right j_reply_data"]/text()').extract()

            yield item

修改items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

#

# See documentation in:

# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class BtspiderItem(scrapy.Item):

    title = scrapy.Field()

    link = scrapy.Field()

    time = scrapy.Field()

这里我们实际上保存的只是title标题内容

修改pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import codecs

import json

class BtspiderPipeline(object):

    def __init__(self):

        self.file = codecs.open('info', 'w', encoding='utf-8')

    def process_item(self, item, spider):

        # line = json.dumps(dict(item)) + "\n"

        titlex = dict(item)["title"]

        if len(titlex) != 0:

            title = titlex[0]

        #linkx = dict(item)["link"]

        #if len(linkx) != 0:

        #    link = 'http://tieba.baidu.com' + linkx[0]

        #timex = dict(item)["time"]

        #if len(timex) != 0:

        #    time = timex[0].strip()

        line = title + '\n' #+ link + '\n' + time + '\n'

        self.file.write(line)

        return item

    def spider_closed(self, spider):

        self.file.close()

修改settings.py

BOT_NAME = 'btspider'

SPIDER_MODULES = ['btspider.spiders']

NEWSPIDER_MODULE = 'btspider.spiders'

ROBOTSTXT_OBEY = True

ITEM_PIPELINES = {

   'btspider.pipelines.BtspiderPipeline': 300,

}

启动爬虫

scrapy crawl btspider

所有的标题内容会被保存为info文件

等到爬虫结束，我们来分析info文件的内容

github上有个示例，改改就能用

git clone https://github.com/FantasRu/WordCloud.git

修改main.py文件如下：

# coding: utf-8

from os import path

import numpy as np

# import matplotlib.pyplot as plt

# matplotlib.use('qt4agg')

from wordcloud import WordCloud, STOPWORDS

import jieba

class WordCloud_CN:

    '''

    use package wordcloud and jieba

    generating wordcloud for chinese character

    '''

    def __init__(self, stopwords_file):

        self.stopwords_file = stopwords_file

        self.text_file = text_file

    @property

    def get_stopwords(self):

        self.stopwords = {}

        f = open(self.stopwords_file, 'r')

        line = f.readline().rstrip()

        while line:

            self.stopwords.setdefault(line, 0)

            self.stopwords[line.decode('utf-8')] = 1

            line = f.readline().rstrip()

        f.close()

        return self.stopwords

    @property

    def seg_text(self):

        with open(self.text_file) as f:

            text = f.readlines()

            text = r' '.join(text)

            seg_generator = jieba.cut(text)

            self.seg_list = [

                i for i in seg_generator if i not in self.get_stopwords]

            self.seg_list = [i for i in self.seg_list if i != u' ']

            self.seg_list = r' '.join(self.seg_list)

        return self.seg_list

    def show(self):

        # wordcloud = WordCloud(max_font_size=40, relative_scaling=.5)

        wordcloud = WordCloud(font_path=u'./static/simheittf/simhei.ttf',

                              background_color="black", margin=5, width=1800, height=800)

        wordcloud = wordcloud.generate(self.seg_text)

        # plt.figure()

        # plt.imshow(wordcloud)

        # plt.axis("off")

        # plt.show()

        wordcloud.to_file("./demo/" + self.text_file.split('/')[-1] + '.jpg')

if __name__ == '__main__':

    stopwords_file = u'./static/stopwords.txt'

    text_file = u'./demo/info'

    generater = WordCloud_CN(stopwords_file)

    generater.show()

然后启动分析

python main.py

由于数据比较大，分析时间会比较长，可以拿到廉价的单核云主机上后台分析，等着那结果就好。

下边是我分析两个热门游戏贴吧的词云图片

使用scrapy爬取dota2贴吧数据并进行分析的更多相关文章

Scrapy爬取到的中文数据乱码问题处理
Scrapy爬取到中文数据默认是 Unicode编码的,于是显示是这样的: "country": ["\u56fd\u4ea7\u6c7d\u8f66\u6807\u5f ...
scrapy爬取booking酒店评论数据
# scrapy爬取酒店评论数据 -- 代码 here:github地址:https://github.com/760730895/scrapy_Booking-- 采用scrapy爬取酒店评论数据 ...
使用scrapy爬取网站的商品数据
目标是爬取网站http://www.muyingzhijia.com/上全部的商品数据信息,包括商品的一级类别,二级类别,商品title,品牌,价格. 搜索了一下,python的scrapy是一个不错 ...
scrapy爬取伯乐在线文章数据
创建项目切换到ArticleSpider目录下创建爬虫文件设置settings.py爬虫协议为False 编写启动爬虫文件main.py
Scrapy实战篇（八）之Scrapy对接selenium爬取京东商城商品数据
本篇目标:我们以爬取京东商城商品数据为例,展示Scrapy框架对接selenium爬取京东商城商品数据. 背景: 京东商城页面为js动态加载页面,直接使用request请求,无法得到我们想要的商品数据 ...
用scrapy爬取京东的数据
本文目的是使用scrapy爬取京东上所有的手机数据,并将数据保存到MongoDB中. 一.项目介绍主要目标 1.使用scrapy爬取京东上所有的手机数据 2.将爬取的数据存储到MongoDB 环境 ...
如何提升scrapy爬取数据的效率
在配置文件中修改相关参数: 增加并发默认的scrapy开启的并发线程为32个,可以适当的进行增加,再配置文件中修改CONCURRENT_REQUESTS = 100值为100,并发设置成了为100. ...
教你分分钟学会用python爬虫框架Scrapy爬取心目中的女神
本博文将带领你从入门到精通爬虫框架Scrapy,最终具备爬取任何网页的数据的能力.本文以校花网为例进行爬取,校花网:http://www.xiaohuar.com/,让你体验爬取校花的成就感. Scr ...
Scrapy爬取美女图片 (原创)
有半个月没有更新了,最近确实有点忙.先是华为的比赛,接着实验室又有项目,然后又学习了一些新的知识,所以没有更新文章.为了表达我的歉意,我给大家来一波福利... 今天咱们说的是爬虫框架.之前我使用pyt ...

随机推荐

6月17 ThinkPHP连接数据库------数据的修改及删除
1.数据修改操作 save() 实现数据修改,返回受影响的记录条数具体有两种方式实现数据修改,与添加类似(数组.AR方式) 1.数组方式 a) $goods = D(“Goods” ...
5月21 汽车查询及批量删除----php方法
---恢复内容开始--- 这个与之前不同是在php中实现了页面的查询,引用AJAX实现批量删除及弹窗的显示作业要求: 页面显示数据代码: <!DOCTYPE html PUBLIC " ...
img标签设置默认图片
为了美观当网页图片不存在时不显示叉叉图片当在页面显示的时候,万一图片被移动了位置或者丢失的话,将会在页面显示一个带X的图片,很是影响用户的体验.即使使用alt属性给出了”图片XX”的提示信息,也起不 ...
java判断时间为上午，中午，下午，晚上，凌晨
public static void main(String[] args) { Date date = new Date(); SimpleDateFormat df = new SimpleDat ...
2017-4-18/缓存、CDN
1. 什么是缓存,为什么要用缓存? 缓存就是数据交换的缓冲区(称作Cache),是存贮数据(使用频繁的数据)的临时地方.当用户查询数据,首先在缓存中寻找,如果找到了则直接执行.如果找不到,则去数据库中 ...
echarts的基本使用
echarts的基本使用官网:http://echarts.baidu.com/index.html ECharts,一个使用 JavaScript 实现的开源可视化库,可以流畅的运行在 PC 和移 ...
利用NPOI解析Excel的通用类
using System.Collections.Generic; using System.Data; using System.IO; using System.Linq; using NPOI. ...
iperf测试工具
一.iperf工具安装: 1.获取iperf源码安装包(iperf-3.0.5.tar.gz) 2.将iperf安装包上传到服务器/tmp/目录并解压 [root@localhost /]#cd /t ...
summary_16th Nov, 2018
一. 编程语言的分类: a. 机器语言:直接使用二进制指令去编写程序,必须考虑硬件细节 b:汇编语言:用英文标签取代二进制指令去编写程序,必须考虑硬件细节 c:高级语言:用人类能理解的方式编写程序,通 ...
Win10系列：JavaScript综合实例3
实现主页面的功能之后,接下来实现分类页面.分类页面中显示一种菜肴类别的详细信息,包括类别名称.图片.描述信息以及属于该类别的一些菜肴.在pages文件夹中添加一个名为classDetail的文件夹,并 ...

使用scrapy爬取dota2贴吧数据并进行分析

使用scrapy爬取dota2贴吧数据并进行分析的更多相关文章

随机推荐

热门专题