scrapy里面,对每次请求的url都有一个指纹,这个指纹就是判断url是否被请求过的。默认是开启指纹即一个URL请求一次。如果我们使用分布式在多台机上面爬取数据,为了让爬虫的数据不重复,我们也需要一个指纹。但是scrapy默认的指纹是保持到本地的。所有我们可以使用redis来保持指纹,并且用redis里面的set集合来判断是否重复。

setting.py

# -*- coding: utf-8 -*-

# Scrapy settings for bilibili project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'bilibili' SPIDER_MODULES = ['bilibili.spiders']
NEWSPIDER_MODULE = 'bilibili.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'bilibili (+http://www.yourdomain.com)' # Obey robots.txt rules
# ROBOTSTXT_OBEY = True # Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default)
#COOKIES_ENABLED = False # Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False # Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
} # Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'bilibili.middlewares.BilibiliSpiderMiddleware': 543,
#} # Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'bilibili.middlewares.BilibiliDownloaderMiddleware': 543,
'bilibili.middlewares.randomUserAgentMiddleware':400
} # Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#} # Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'bilibili.pipelines.BilibiliPipeline': 300,
'scrapy_redis.pipelines.RedisPipeline':300
} # Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' SCHEDULER = 'scrapy_redis.scheduler.Scheduler'
DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'
REDIS_URL = 'redis://@127.0.0.1:6379'
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'

  spider.py

# -*- coding: utf-8 -*-
import scrapy
import json,re
from bilibili.items import BilibiliItem class BilibiliappSpider(scrapy.Spider):
name = 'bilibiliapp'
# allowed_domains = ['www.bilibili.com']
# start_urls = ['http://www.bilibili.com/']
def start_requests(self):
for i in range(1, 300): url = 'https://api.bilibili.com/x/relation/stat?vmid={}&jsonp=jsonp&callback=__jp3'.format(i)
url_ajax = 'https://space.bilibili.com/{}/'.format(i)
# get的时候是这个东东, scrapy.Request(url=, callback=)
req = scrapy.Request(url=url,callback=self.parse,meta={'id':i})
req.headers['referer'] = url_ajax yield req def parse(self, response):
# print(response.text)
comm = re.compile(r'({.*})')
text = re.findall(comm,response.text)[0]
data = json.loads(text)
# print(data)
follower = data['data']['follower']
following = data['data']['following']
id = response.meta.get('id')
url = 'https://space.bilibili.com/ajax/member/getSubmitVideos?mid={}&page=1&pagesize=25'.format(id)
yield scrapy.Request(url=url,callback=self.getsubmit,meta={
'id':id,
'follower':follower,
'following':following
}) def getsubmit(self, response):
# print(response.text)
data = json.loads(response.text)
tilst = data['data']['tlist']
tlist_list = []
if tilst != []:
# print(tilst)
for tils in tilst.values():
# print(tils['name'])
tlist_list.append(tils['name'])
else:
tlist_list = ['无爱好']
follower = response.meta.get('follower')
following = response.meta.get('following')
id = response.meta.get('id')
url = 'https://api.bilibili.com/x/space/acc/info?mid={}&jsonp=jsonp'.format(id)
yield scrapy.Request(url=url,callback=self.space,meta={
'id':id,
'follower':follower,
'following':following,
'tlist_list':tlist_list
}) def space(self, respinse):
# print(respinse.text)
data = json.loads(respinse.text)
name = data['data']['name']
sex = data['data']['sex']
level = data['data']['level']
birthday = data['data']['birthday']
tlist_list = respinse.meta.get('tlist_list')
animation = 0
Life = 0
Music = 0
Game = 0
Dance = 0
Documentary = 0
Ghost = 0
science = 0
Opera = 0
entertainment = 0
Movies = 0
National = 0
Digital = 0
fashion = 0
for tlist in tlist_list:
if tlist == '动画':
animation = 1
elif tlist == '生活':
Life = 1
elif tlist == '音乐':
Music = 1
elif tlist == '游戏':
Game = 1
elif tlist == '舞蹈':
Dance = 1
elif tlist == '纪录片':
Documentary = 1
elif tlist == '鬼畜':
Ghost = 1
elif tlist == '科技':
science = 1
elif tlist == '番剧':
Opera =1
elif tlist == '娱乐':
entertainment = 1
elif tlist == '影视':
Movies = 1
elif tlist == '国创':
National = 1
elif tlist == '数码':
Digital = 1
elif tlist == '时尚':
fashion = 1
item = BilibiliItem()
item['name'] = name
item['sex'] = sex
item['level'] = level
item['birthday'] = birthday
item['follower'] = respinse.meta.get('follower')
item['following'] = respinse.meta.get('following')
item['animation'] = animation
item['Life'] = Life
item['Music'] = Music
item['Game'] = Game
item['Dance'] = Dance
item['Documentary'] = Documentary
item['Ghost'] = Ghost
item['science'] = science
item['Opera'] = Opera
item['entertainment'] = entertainment
item['Movies'] = Movies
item['National'] = National
item['Digital'] = Digital
item['fashion'] = fashion
yield item

设置ua池

from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
import random class randomUserAgentMiddleware(UserAgentMiddleware): def __init__(self,user_agent=''):
self.user_agent = user_agent def process_request(self, request, spider):
ua = random.choice(self.user_agent_list)
if ua:
request.headers.setdefault('User-Agent', ua)
user_agent_list = [ \
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1" \
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", \
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", \
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", \
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", \
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", \
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", \
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", \
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", \
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]

git地址:https://github.com/18370652038/scrapy-bilibili

scrapy-redis 分布式哔哩哔哩网站用户爬虫的更多相关文章

  1. 爬虫--scrapy+redis分布式爬取58同城北京全站租房数据

    作业需求: 1.基于Spider或者CrawlSpider进行租房信息的爬取 2.本机搭建分布式环境对租房信息进行爬取 3.搭建多台机器的分布式环境,多台机器同时进行租房数据爬取 建议:用Pychar ...

  2. 如何下载B站哔哩哔哩(bilibili)弹幕网站上的视频呢?小白教你个简单方法

    对于90后.00后来说,B站肯定听过吧.小编有一个苦恼的地方,有时候想把哔哩哔哩(bilibili)上看到的视频保存到手机相册,不知道咋操作啊.网上百度了下,都是要下载电脑软件的,有些还得要付费的.前 ...

  3. scrapy之分布式

    分布式爬虫 概念:多台机器上可以执行同一个爬虫程序,实现网站数据的分布爬取. 原生的scrapy是不可以实现分布式爬虫? a) 调度器无法共享 b) 管道无法共享 工具 scrapy-redis组件: ...

  4. 2019 哔哩哔哩java面试笔试题 (含面试题解析)

      本人5年开发经验.18年年底开始跑路找工作,在互联网寒冬下成功拿到阿里巴巴.今日头条.哔哩哔哩等公司offer,岗位是Java后端开发,因为发展原因最终选择去了哔哩哔哩,入职一年时间了,也成为了面 ...

  5. 最新 哔哩哔哩java校招面经 (含整理过的面试题大全)

    从6月到10月,经过4个月努力和坚持,自己有幸拿到了网易雷火.京东.去哪儿.哔哩哔哩等10家互联网公司的校招Offer,因为某些自身原因最终选择了哔哩哔哩.6.7月主要是做系统复习.项目复盘.Leet ...

  6. j2ee分布式架构 dubbo + springmvc + mybatis + ehcache + redis 分布式架构

    介绍 <modules>        <!-- jeesz 工具jar -->        <module>jeesz-utils</module> ...

  7. 面试官问我,Redis分布式锁如何续期?懵了。

    前言 上一篇[面试官问我,使用Dubbo有没有遇到一些坑?我笑了.]之后,又有一位粉丝和我说在面试过程中被虐了.鉴于这位粉丝是之前肥朝的粉丝,而且周一又要开启新一轮的面试,为了回馈他长期以来的支持,所 ...

  8. Redis 分布式缓存 Java 框架

    为什么要在 Java 分布式应用程序中使用缓存? 在提高应用程序速度和性能上,每一毫秒都很重要.根据谷歌的一项研究,假如一个网站在3秒钟或更短时间内没有加载成功,会有 53% 的手机用户会离开. 缓存 ...

  9. scrapy简单分布式爬虫

    经过一段时间的折腾,终于整明白scrapy分布式是怎么个搞法了,特记录一点心得. 虽然scrapy能做的事情很多,但是要做到大规模的分布式应用则捉襟见肘.有能人改变了scrapy的队列调度,将起始的网 ...

随机推荐

  1. Linux-NoSQL之memcached

    1.memcached安装 yum search memcached yum install -y libevent memcached libmemcached 启动:/etc/init.d/mem ...

  2. zjoi2015d1题解

    闲来无事做了丽洁姐姐的题 t1给一棵树 每个点有点权 每次修改点权 修改后询问每个点到树的带权重心的带权距离是多少 每个点度数不超过20 很显然的一个点分树... 我们记一下 每个点的子树中的所有点到 ...

  3. POJ3107Godfather(求树的重心裸题)

    Last years Chicago was full of gangster fights and strange murders. The chief of the police got real ...

  4. 【LeetCode】031. Next Permutation

    题目: Implement next permutation, which rearranges numbers into the lexicographically next greater per ...

  5. MOVE降低高水位 HWM

    MOVE降低高水位 HWM --创建实验表空间SQL> create tablespace andy03 datafile '/home/oracle/app/oradata/orcl/andy ...

  6. Linux命令汇总(二)

    1.登录用户设置 新创建了一个用户,用useradd指令,但是发现通过终端无法登陆:  echo password | passwd --stdin username  或者  passwd --st ...

  7. JQuery Mobile+cordova构建一个Android项目

    1.安装Android开发环境     Android开发环境的安装,现在主要是由于不能访问谷歌站点,在windows下在host文件中添加一个对应的74.125.195.190 dl-ssl.goo ...

  8. Collection与Map总结

    顺序表和链表统称为线性表:顺序表一般表现为数组,如:ArrayList的实现;链表有单链表.双链表.循环链表的区分,如:LinkedArrayList由双链表+哈希表实现

  9. Process打开文件

    引用:using System.Diagnostics; 打开文件夹: System.Diagnostics.Process.Start(FilePath); 打开文件夹中某个文件: System.D ...

  10. 鼠标右键添加cmd

    给鼠标右键添加 cmd https://jingyan.baidu.com/article/3f16e003c408142591c103b2.html 有一些软件,最好不要装到Program File ...