scrapy里面,对每次请求的url都有一个指纹,这个指纹就是判断url是否被请求过的。默认是开启指纹即一个URL请求一次。如果我们使用分布式在多台机上面爬取数据,为了让爬虫的数据不重复,我们也需要一个指纹。但是scrapy默认的指纹是保持到本地的。所有我们可以使用redis来保持指纹,并且用redis里面的set集合来判断是否重复。

setting.py

# -*- coding: utf-8 -*-

# Scrapy settings for bilibili project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'bilibili' SPIDER_MODULES = ['bilibili.spiders']
NEWSPIDER_MODULE = 'bilibili.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'bilibili (+http://www.yourdomain.com)' # Obey robots.txt rules
# ROBOTSTXT_OBEY = True # Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default)
#COOKIES_ENABLED = False # Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False # Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
} # Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'bilibili.middlewares.BilibiliSpiderMiddleware': 543,
#} # Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'bilibili.middlewares.BilibiliDownloaderMiddleware': 543,
'bilibili.middlewares.randomUserAgentMiddleware':400
} # Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#} # Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'bilibili.pipelines.BilibiliPipeline': 300,
'scrapy_redis.pipelines.RedisPipeline':300
} # Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' SCHEDULER = 'scrapy_redis.scheduler.Scheduler'
DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'
REDIS_URL = 'redis://@127.0.0.1:6379'
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'

  spider.py

# -*- coding: utf-8 -*-
import scrapy
import json,re
from bilibili.items import BilibiliItem class BilibiliappSpider(scrapy.Spider):
name = 'bilibiliapp'
# allowed_domains = ['www.bilibili.com']
# start_urls = ['http://www.bilibili.com/']
def start_requests(self):
for i in range(1, 300): url = 'https://api.bilibili.com/x/relation/stat?vmid={}&jsonp=jsonp&callback=__jp3'.format(i)
url_ajax = 'https://space.bilibili.com/{}/'.format(i)
# get的时候是这个东东, scrapy.Request(url=, callback=)
req = scrapy.Request(url=url,callback=self.parse,meta={'id':i})
req.headers['referer'] = url_ajax yield req def parse(self, response):
# print(response.text)
comm = re.compile(r'({.*})')
text = re.findall(comm,response.text)[0]
data = json.loads(text)
# print(data)
follower = data['data']['follower']
following = data['data']['following']
id = response.meta.get('id')
url = 'https://space.bilibili.com/ajax/member/getSubmitVideos?mid={}&page=1&pagesize=25'.format(id)
yield scrapy.Request(url=url,callback=self.getsubmit,meta={
'id':id,
'follower':follower,
'following':following
}) def getsubmit(self, response):
# print(response.text)
data = json.loads(response.text)
tilst = data['data']['tlist']
tlist_list = []
if tilst != []:
# print(tilst)
for tils in tilst.values():
# print(tils['name'])
tlist_list.append(tils['name'])
else:
tlist_list = ['无爱好']
follower = response.meta.get('follower')
following = response.meta.get('following')
id = response.meta.get('id')
url = 'https://api.bilibili.com/x/space/acc/info?mid={}&jsonp=jsonp'.format(id)
yield scrapy.Request(url=url,callback=self.space,meta={
'id':id,
'follower':follower,
'following':following,
'tlist_list':tlist_list
}) def space(self, respinse):
# print(respinse.text)
data = json.loads(respinse.text)
name = data['data']['name']
sex = data['data']['sex']
level = data['data']['level']
birthday = data['data']['birthday']
tlist_list = respinse.meta.get('tlist_list')
animation = 0
Life = 0
Music = 0
Game = 0
Dance = 0
Documentary = 0
Ghost = 0
science = 0
Opera = 0
entertainment = 0
Movies = 0
National = 0
Digital = 0
fashion = 0
for tlist in tlist_list:
if tlist == '动画':
animation = 1
elif tlist == '生活':
Life = 1
elif tlist == '音乐':
Music = 1
elif tlist == '游戏':
Game = 1
elif tlist == '舞蹈':
Dance = 1
elif tlist == '纪录片':
Documentary = 1
elif tlist == '鬼畜':
Ghost = 1
elif tlist == '科技':
science = 1
elif tlist == '番剧':
Opera =1
elif tlist == '娱乐':
entertainment = 1
elif tlist == '影视':
Movies = 1
elif tlist == '国创':
National = 1
elif tlist == '数码':
Digital = 1
elif tlist == '时尚':
fashion = 1
item = BilibiliItem()
item['name'] = name
item['sex'] = sex
item['level'] = level
item['birthday'] = birthday
item['follower'] = respinse.meta.get('follower')
item['following'] = respinse.meta.get('following')
item['animation'] = animation
item['Life'] = Life
item['Music'] = Music
item['Game'] = Game
item['Dance'] = Dance
item['Documentary'] = Documentary
item['Ghost'] = Ghost
item['science'] = science
item['Opera'] = Opera
item['entertainment'] = entertainment
item['Movies'] = Movies
item['National'] = National
item['Digital'] = Digital
item['fashion'] = fashion
yield item

设置ua池

from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
import random class randomUserAgentMiddleware(UserAgentMiddleware): def __init__(self,user_agent=''):
self.user_agent = user_agent def process_request(self, request, spider):
ua = random.choice(self.user_agent_list)
if ua:
request.headers.setdefault('User-Agent', ua)
user_agent_list = [ \
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1" \
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", \
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", \
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", \
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", \
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", \
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", \
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", \
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", \
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]

git地址:https://github.com/18370652038/scrapy-bilibili

scrapy-redis 分布式哔哩哔哩网站用户爬虫的更多相关文章

  1. 爬虫--scrapy+redis分布式爬取58同城北京全站租房数据

    作业需求: 1.基于Spider或者CrawlSpider进行租房信息的爬取 2.本机搭建分布式环境对租房信息进行爬取 3.搭建多台机器的分布式环境,多台机器同时进行租房数据爬取 建议:用Pychar ...

  2. 如何下载B站哔哩哔哩(bilibili)弹幕网站上的视频呢?小白教你个简单方法

    对于90后.00后来说,B站肯定听过吧.小编有一个苦恼的地方,有时候想把哔哩哔哩(bilibili)上看到的视频保存到手机相册,不知道咋操作啊.网上百度了下,都是要下载电脑软件的,有些还得要付费的.前 ...

  3. scrapy之分布式

    分布式爬虫 概念:多台机器上可以执行同一个爬虫程序,实现网站数据的分布爬取. 原生的scrapy是不可以实现分布式爬虫? a) 调度器无法共享 b) 管道无法共享 工具 scrapy-redis组件: ...

  4. 2019 哔哩哔哩java面试笔试题 (含面试题解析)

      本人5年开发经验.18年年底开始跑路找工作,在互联网寒冬下成功拿到阿里巴巴.今日头条.哔哩哔哩等公司offer,岗位是Java后端开发,因为发展原因最终选择去了哔哩哔哩,入职一年时间了,也成为了面 ...

  5. 最新 哔哩哔哩java校招面经 (含整理过的面试题大全)

    从6月到10月,经过4个月努力和坚持,自己有幸拿到了网易雷火.京东.去哪儿.哔哩哔哩等10家互联网公司的校招Offer,因为某些自身原因最终选择了哔哩哔哩.6.7月主要是做系统复习.项目复盘.Leet ...

  6. j2ee分布式架构 dubbo + springmvc + mybatis + ehcache + redis 分布式架构

    介绍 <modules>        <!-- jeesz 工具jar -->        <module>jeesz-utils</module> ...

  7. 面试官问我,Redis分布式锁如何续期?懵了。

    前言 上一篇[面试官问我,使用Dubbo有没有遇到一些坑?我笑了.]之后,又有一位粉丝和我说在面试过程中被虐了.鉴于这位粉丝是之前肥朝的粉丝,而且周一又要开启新一轮的面试,为了回馈他长期以来的支持,所 ...

  8. Redis 分布式缓存 Java 框架

    为什么要在 Java 分布式应用程序中使用缓存? 在提高应用程序速度和性能上,每一毫秒都很重要.根据谷歌的一项研究,假如一个网站在3秒钟或更短时间内没有加载成功,会有 53% 的手机用户会离开. 缓存 ...

  9. scrapy简单分布式爬虫

    经过一段时间的折腾,终于整明白scrapy分布式是怎么个搞法了,特记录一点心得. 虽然scrapy能做的事情很多,但是要做到大规模的分布式应用则捉襟见肘.有能人改变了scrapy的队列调度,将起始的网 ...

随机推荐

  1. str_2.判断两个字符串是否互为旋转词

    1. 字符串str的前面任意部分挪到后面形成的字符串叫做字符串str的旋转词 $str1 = "2ab1"; $str2 = "ab12"; $ret = is ...

  2. 机器视觉 Local Binary Pattern (LBP)

    Local binary pattern (LBP),在机器视觉领域,是非常重要的一种特征.LBP可以有效地处理光照变化,在纹理分析,纹理识别方面被广泛应用. LBP 的算法非常简单,简单来说,就是对 ...

  3. ACM学习历程—HDU 5317 RGCDQ (数论)

    Problem Description Mr. Hdu is interested in Greatest Common Divisor (GCD). He wants to find more an ...

  4. Father Christmas flymouse

    Father Christmas flymouse Time Limit: 1000MS   Memory Limit: 131072K Total Submissions: 3479   Accep ...

  5. Tensorflow知识点学习

    1.TensorFlow中Tensor维度理解: (1)对于2维Tensor 0维对应列 1维对应行 (2)维度操作举例: 对于k维的,tf.reduce_sum(x, axis=k-1)的结果是对最 ...

  6. QT(1)介绍

    Qt官网 Qt官网:https://www.qt.io Qt下载:http://www.qt.io/download Qt所有下载:http://download.qt.io/archive/qt Q ...

  7. plsql developer点滴

    PLSql中查看编译错误的具体内容: 1. 打开Command Windows show errors procedure procedure_name 

  8. 数据库关键字 (Oracle, SQL Server, DB2)

    Oracle SQL Server DB2 ! @@IDENTITY   DETERMINISTIC & ADD   DISALLOW      ( ALL   DISCONNECT    ) ...

  9. WPF Background的设置有坑

    今天帮忙同事解决在后台绑定时,动态更改控件(Grid)的Background. 有个陷阱,C#有2个命名空间有Brush和Color, 分别为System.Drawing和System.Window. ...

  10. 在python 3.6的eclipse中,导入from lxml import etree老是提示,Unresolved import:etree的错误

    支持代码运行没问题,暂时没有找到真正解决办法,只能通过一下办法暂时解决.如下图: