scrapy-redis 分布式哔哩哔哩网站用户爬虫

scrapy里面，对每次请求的url都有一个指纹，这个指纹就是判断url是否被请求过的。默认是开启指纹即一个URL请求一次。如果我们使用分布式在多台机上面爬取数据，为了让爬虫的数据不重复，我们也需要一个指纹。但是scrapy默认的指纹是保持到本地的。所有我们可以使用redis来保持指纹，并且用redis里面的set集合来判断是否重复。

setting.py

# -*- coding: utf-8 -*-

# Scrapy settings for bilibili project

#

# For simplicity, this file contains only settings considered important or

# commonly used. You can find more settings consulting the documentation:

#

#     https://doc.scrapy.org/en/latest/topics/settings.html

#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'bilibili'

SPIDER_MODULES = ['bilibili.spiders']

NEWSPIDER_MODULE = 'bilibili.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent

#USER_AGENT = 'bilibili (+http://www.yourdomain.com)'

# Obey robots.txt rules

# ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)

#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)

# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay

# See also autothrottle settings and docs

DOWNLOAD_DELAY = 1

# The download delay setting will honor only one of:

#CONCURRENT_REQUESTS_PER_DOMAIN = 16

#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)

#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)

#TELNETCONSOLE_ENABLED = False

# Override the default request headers:

DEFAULT_REQUEST_HEADERS = {

  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

  'Accept-Language': 'en',

}

# Enable or disable spider middlewares

# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html

#SPIDER_MIDDLEWARES = {

#    'bilibili.middlewares.BilibiliSpiderMiddleware': 543,

#}

# Enable or disable downloader middlewares

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

DOWNLOADER_MIDDLEWARES = {

    'bilibili.middlewares.BilibiliDownloaderMiddleware': 543,

    'bilibili.middlewares.randomUserAgentMiddleware':400

}

# Enable or disable extensions

# See https://doc.scrapy.org/en/latest/topics/extensions.html

#EXTENSIONS = {

#    'scrapy.extensions.telnet.TelnetConsole': None,

#}

# Configure item pipelines

# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {

   'bilibili.pipelines.BilibiliPipeline': 300,

    'scrapy_redis.pipelines.RedisPipeline':300

}

# Enable and configure the AutoThrottle extension (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/autothrottle.html

#AUTOTHROTTLE_ENABLED = True

# The initial download delay

#AUTOTHROTTLE_START_DELAY = 5

# The maximum download delay to be set in case of high latencies

#AUTOTHROTTLE_MAX_DELAY = 60

# The average number of requests Scrapy should be sending in parallel to

# each remote server

#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:

#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

#HTTPCACHE_ENABLED = True

#HTTPCACHE_EXPIRATION_SECS = 0

#HTTPCACHE_DIR = 'httpcache'

#HTTPCACHE_IGNORE_HTTP_CODES = []

#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

SCHEDULER = 'scrapy_redis.scheduler.Scheduler'

DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'

REDIS_URL = 'redis://@127.0.0.1:6379'

SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'

　　spider.py

# -*- coding: utf-8 -*-

import scrapy

import json,re

from bilibili.items import BilibiliItem

class BilibiliappSpider(scrapy.Spider):

    name = 'bilibiliapp'

    # allowed_domains = ['www.bilibili.com']

    # start_urls = ['http://www.bilibili.com/']

    def start_requests(self):

            for i in range(1, 300):

                url = 'https://api.bilibili.com/x/relation/stat?vmid={}&jsonp=jsonp&callback=__jp3'.format(i)

                url_ajax = 'https://space.bilibili.com/{}/'.format(i)

                # get的时候是这个东东, scrapy.Request(url=, callback=)

                req = scrapy.Request(url=url,callback=self.parse,meta={'id':i})

                req.headers['referer'] = url_ajax

                yield req

    def parse(self, response):

        # print(response.text)

        comm = re.compile(r'({.*})')

        text = re.findall(comm,response.text)[0]

        data = json.loads(text)

        # print(data)

        follower = data['data']['follower']

        following = data['data']['following']

        id = response.meta.get('id')

        url = 'https://space.bilibili.com/ajax/member/getSubmitVideos?mid={}&page=1&pagesize=25'.format(id)

        yield scrapy.Request(url=url,callback=self.getsubmit,meta={

            'id':id,

            'follower':follower,

            'following':following

        })

    def getsubmit(self, response):

        # print(response.text)

        data = json.loads(response.text)

        tilst = data['data']['tlist']

        tlist_list = []

        if tilst != []:

            # print(tilst)

            for tils in tilst.values():

                # print(tils['name'])

                tlist_list.append(tils['name'])

        else:

            tlist_list = ['无爱好']

        follower = response.meta.get('follower')

        following = response.meta.get('following')

        id = response.meta.get('id')

        url = 'https://api.bilibili.com/x/space/acc/info?mid={}&jsonp=jsonp'.format(id)

        yield scrapy.Request(url=url,callback=self.space,meta={

            'id':id,

            'follower':follower,

            'following':following,

            'tlist_list':tlist_list

        })

    def space(self, respinse):

        # print(respinse.text)

        data = json.loads(respinse.text)

        name = data['data']['name']

        sex = data['data']['sex']

        level = data['data']['level']

        birthday = data['data']['birthday']

        tlist_list = respinse.meta.get('tlist_list')

        animation = 0

        Life = 0

        Music = 0

        Game = 0

        Dance = 0

        Documentary = 0

        Ghost = 0

        science = 0

        Opera = 0

        entertainment = 0

        Movies = 0

        National = 0

        Digital = 0

        fashion = 0

        for tlist in tlist_list:

            if tlist == '动画':

                animation = 1

            elif tlist == '生活':

                Life = 1

            elif tlist == '音乐':

                Music = 1

            elif tlist == '游戏':

                Game = 1

            elif tlist == '舞蹈':

                Dance = 1

            elif tlist == '纪录片':

                Documentary = 1

            elif tlist == '鬼畜':

                Ghost = 1

            elif tlist == '科技':

                science = 1

            elif tlist == '番剧':

                Opera =1

            elif tlist == '娱乐':

                entertainment = 1

            elif tlist == '影视':

                Movies = 1

            elif tlist == '国创':

                National = 1

            elif tlist == '数码':

                Digital = 1

            elif tlist == '时尚':

                fashion = 1

        item = BilibiliItem()

        item['name'] = name

        item['sex'] = sex

        item['level'] = level

        item['birthday'] = birthday

        item['follower'] = respinse.meta.get('follower')

        item['following'] = respinse.meta.get('following')

        item['animation'] = animation

        item['Life'] = Life

        item['Music'] = Music

        item['Game'] = Game

        item['Dance'] = Dance

        item['Documentary'] = Documentary

        item['Ghost'] = Ghost

        item['science'] = science

        item['Opera'] = Opera

        item['entertainment'] = entertainment

        item['Movies'] = Movies

        item['National'] = National

        item['Digital'] = Digital

        item['fashion'] = fashion

        yield item

设置ua池

from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware

import random

class randomUserAgentMiddleware(UserAgentMiddleware):

    def __init__(self,user_agent=''):

        self.user_agent = user_agent

    def process_request(self, request, spider):

        ua = random.choice(self.user_agent_list)

        if ua:

            request.headers.setdefault('User-Agent', ua)

    user_agent_list = [ \

        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1" \

        "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", \

        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", \

        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", \

        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", \

        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", \

        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", \

        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \

        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \

        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \

        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \

        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \

        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \

        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \

        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \

        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", \

        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", \

        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"

    ]

git地址：https://github.com/18370652038/scrapy-bilibili

scrapy-redis 分布式哔哩哔哩网站用户爬虫的更多相关文章

爬虫--scrapy+redis分布式爬取58同城北京全站租房数据
作业需求: 1.基于Spider或者CrawlSpider进行租房信息的爬取 2.本机搭建分布式环境对租房信息进行爬取 3.搭建多台机器的分布式环境,多台机器同时进行租房数据爬取建议:用Pychar ...
如何下载B站哔哩哔哩(bilibili)弹幕网站上的视频呢？小白教你个简单方法
对于90后.00后来说,B站肯定听过吧.小编有一个苦恼的地方,有时候想把哔哩哔哩(bilibili)上看到的视频保存到手机相册,不知道咋操作啊.网上百度了下,都是要下载电脑软件的,有些还得要付费的.前 ...
scrapy之分布式
分布式爬虫概念:多台机器上可以执行同一个爬虫程序,实现网站数据的分布爬取. 原生的scrapy是不可以实现分布式爬虫? a) 调度器无法共享 b) 管道无法共享工具 scrapy-redis组件: ...
2019 哔哩哔哩java面试笔试题（含面试题解析）
本人5年开发经验.18年年底开始跑路找工作,在互联网寒冬下成功拿到阿里巴巴.今日头条.哔哩哔哩等公司offer,岗位是Java后端开发,因为发展原因最终选择去了哔哩哔哩,入职一年时间了,也成为了面 ...
最新哔哩哔哩java校招面经（含整理过的面试题大全）
从6月到10月,经过4个月努力和坚持,自己有幸拿到了网易雷火.京东.去哪儿.哔哩哔哩等10家互联网公司的校招Offer,因为某些自身原因最终选择了哔哩哔哩.6.7月主要是做系统复习.项目复盘.Leet ...
j2ee分布式架构 dubbo + springmvc + mybatis + ehcache + redis 分布式架构
介绍 <modules>  <module>jeesz-utils</module> ...
面试官问我，Redis分布式锁如何续期？懵了。
前言上一篇[面试官问我,使用Dubbo有没有遇到一些坑?我笑了.]之后,又有一位粉丝和我说在面试过程中被虐了.鉴于这位粉丝是之前肥朝的粉丝,而且周一又要开启新一轮的面试,为了回馈他长期以来的支持,所 ...
Redis 分布式缓存 Java 框架
为什么要在 Java 分布式应用程序中使用缓存? 在提高应用程序速度和性能上,每一毫秒都很重要.根据谷歌的一项研究,假如一个网站在3秒钟或更短时间内没有加载成功,会有 53% 的手机用户会离开. 缓存 ...
scrapy简单分布式爬虫
经过一段时间的折腾,终于整明白scrapy分布式是怎么个搞法了,特记录一点心得. 虽然scrapy能做的事情很多,但是要做到大规模的分布式应用则捉襟见肘.有能人改变了scrapy的队列调度,将起始的网 ...

随机推荐

exec 和 spawn 的区别
参考资料: difference-between-spawn-and-exec-of-node-js-child_process process_child 最近在用nodejs 的child_pro ...
inode、软连接、硬链接
一.inode是什么? 理解inode,要从文件储存说起.文件储存在硬盘上,硬盘的最小存储单位叫做"扇区"(Sector).每个扇区储存512字节(相当于0.5KB).操作系统读取 ...
学习 Shell —— 括号、引号
shell中各种括号的作用().(()).[].[[]].{} shell中的括号(小括号,大括号/花括号) ${},大括号用于确定变量的范围: $(( 数学运算 )) 0. 引号单引号.双引号.飘 ...
1111 Online Map (30)（30 分）
Input our current position and a destination, an online map can recommend several paths. Now your jo ...
bzoj 4514: 数字配对
题目大意自己看题解我们打表观察规律发现一定能构成一张二分图也就是不存在奇环所以我们一般保证费用非负的最大流即可. #include <cstdio> #include <c ...
[转载]IOCP模型的总结
原文:IOCP模型的总结 IOCP(I/O Completion Port,I/O完成端口)是性能最好的一种I/O模型.它是应用程序使用线程池处理异步I/O请求的一种机制.在处理多个并发的异步I/O请 ...
Day04:函数参数、对象、嵌套、闭包函数和装饰器
上节课复习: 1.什么是函数函数就是具备某一功能的工具 2.为何用函数 1.程序的组织结构和可读性 2.减少代码冗余 3.扩展性强 ...
【opencv学习笔记一】opencv下载安装与VS2017开发环境配置
本文章摘录自浅墨博客,原文链接http://blog.csdn.net/poem_qianmo/article/details/19809337 目录 1.opencv下载与安装 2.计算机环境变量配 ...
[hdu3586]Information Disturbing树形dp+二分
题意:给出一棵带权无向树,以及给定节点1,总约束为$m$,找出切断与所有叶子节点联系每条边所需要的最小价值约束. 解题关键:二分答案,转化为判定性问题,然后用树形dp验证答案即可. dp数组需要开到l ...
CF-845B
B. Luba And The Ticket time limit per test 2 seconds memory limit per test 256 megabytes input stand ...

scrapy-redis 分布式哔哩哔哩网站用户爬虫

scrapy-redis 分布式哔哩哔哩网站用户爬虫的更多相关文章

随机推荐

热门专题