[scrapy-redis] 将scrapy爬虫改造成分布式爬虫 (2)

1. 修改redis设置

redis默认处在protection mode, 修改/etc/redis.conf, protected-mode no, 或者给redis设置密码，

将bind 127.0.0.1这一行用#注释掉

2. 修改爬虫设置

向settings.py加入以下设置

REDIS_URL 为master的ip加上redis的端口号

# For scrapy_redis

# Enables scheduling storing requests queue in redis.

SCHEDULER = "scrapy_redis.scheduler.Scheduler"

# Ensure all spiders share same duplicates filter through redis.

DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

# Don't cleanup redis queues, allows to pause/resume crawls.

SCHEDULER_PERSIST = True

# Schedule requests using a priority queue. (default)

SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue' 

# Store scraped item in redis for post-processing.

ITEM_PIPELINES = {

    'scrapy_redis.pipelines.RedisPipeline': 300

}

# Specify the host and port to use when connecting to Redis (optional).

#REDIS_HOST = 'localhost'

#REDIS_PORT = 6379

# Specify the full Redis URL for connecting (optional).

# If set, this takes precedence over the REDIS_HOST and REDIS_PORT settings.

#REDIS_URL = 'redis://user:pass@hostname:9001'

REDIS_URL = 'redis://192.168.1.20:6379' #修改成自己的ip和port

3. 修改爬虫代码

使爬虫继承自RedisSpider

from scrapy_redis.spiders import RedisSpider

class DoubanSpider(RedisSpider):

增加一个redis_key属性，这个属性就是start_urls在redis中的key
注释掉start_urls

#!/usr/bin/python3

# -*- coding: utf-8 -*-

import scrapy

from scrapy import Request

from project_douban.items import Movie

from scrapy_redis.spiders import RedisSpider

class DoubanSpider(RedisSpider):

    name = 'douban'

    allowed_domains = ['douban.com']

    redis_key = "doubanSpider:start_urls"

    #start_urls = ['https://movie.douban.com/top250']

    headers = {

        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

        'Accept-Language': 'en',

        'User-Agent' : 'Mozilla/5.0 (Linux; Android 8.0; Pixel 2 Build/OPD3.170816.012) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Mobile Safari/537.36',

    }

    custom_settings = {

        'DEFAULT_REQUEST_HEADERS' : headers,

        'REDIRECT_ENABLED' : 'False',

        #'LOG_LEVEL' : 'WARNING',

    }

    def parse(self, response):

        items = response.xpath('//div[@class="item"]')

        for item in items:

            movie = Movie()

            movie['index'] = item.xpath('div//em/text()').extract_first(default = '')

            self.logger.info(movie['index'])

            movie['src'] = item.xpath('div//img/@src').extract_first(default = '')

            self.logger.info(movie['src'])

            movie['title'] = item.xpath('.//div[@class="hd"]/a/span[1]/text()').extract_first(default = '') #.xpath('string(.)').extract()).replace(' ','').replace('\xa0',' ').replace('\n',' ')

            self.logger.info(movie['title'])

            movie['star'] = item.xpath('.//span[@class="rating_num"]/text()').extract_first(default = '')

            self.logger.info(movie['star'])

            movie['info'] = item.xpath('.//div[@class="bd"]/p').xpath('string(.)').extract_first(default = '').strip().replace(' ','').replace('\xa0',' ').replace('\n',' ')

            self.logger.info(movie['info'])

            yield movie

        next_url = response.xpath('//span[@class="next"]/a/@href').extract_first(default = '')

        self.logger.info('next_url: ' + next_url)

        if next_url:

            next_url = 'https://movie.douban.com/top250' + next_url

            yield Request(next_url, headers = self.headers)

log写入文件（optional)

import logging

import os

import time

def get_logger(name, start_time = time.strftime('%Y_%m_%d_%H', time.localtime())):

    path = '/var/log/scrapy-redis/'

    # path = 'baidu_tieba.log'

    if not os.path.exists(path):

        os.makedirs(path)

    log_path = path + start_time

    # 创建一个logger

    my_logger = logging.getLogger(name)

    my_logger.setLevel(logging.INFO)

    formatter = logging.Formatter('[%(asctime)s] [%(levelname)s] %(filename)s[line:%(lineno)d] %(message)s', datefmt = '%Y-%m-%d %H:%M:%S')

    # 创建handler，用于写入日志文件

    handler_info = logging.FileHandler('%s_info.log' % log_path, 'a', encoding='UTF-8')

    handler_info.setLevel(logging.INFO)

    handler_info.setFormatter(formatter)

    my_logger.addHandler(handler_info)

    handler_warning = logging.FileHandler('%s_warning.log' % log_path, 'a', encoding='UTF-8')

    handler_warning.setLevel(logging.WARNING)

    handler_warning.setFormatter(formatter)

    my_logger.addHandler(handler_warning)

    handler_error = logging.FileHandler('%s_error.log' % log_path, 'a', encoding='UTF-8')

    handler_error.setLevel(logging.ERROR)

    handler_error.setFormatter(formatter)

    my_logger.addHandler(handler_error)

    return my_logger

Miscellaneous

RedisSpider vs RedisCrawlSpider

直接看源代码，上文本比较

item	RedisSpider	RedisCrawlSpider
REDIS_START_URLS_AS_SET	default: False	default: True
	继承自Spider	继承自CrawlSpider

scrapy.Spider -> scrapy.CrawlSpider

scrapy.Spider是所有爬虫的基类, scrapy.CrawlSpider基于scrapy.Spider, 增加了rules, 可以设置某种规则，只爬取满足这些规则的网页, RedisCrawlSpider也继承了这一特性

Reference

[scrapy-redis] 将scrapy爬虫改造成分布式爬虫 (2)的更多相关文章

Python爬虫教程-新浪微博分布式爬虫分享
爬虫功能: 此项目实现将单机的新浪微博爬虫重构成分布式爬虫. Master机只管任务调度,不管爬数据:Slaver机只管将Request抛给Master机,需要Request的时候再从Master机拿 ...
Scrapy框架之基于RedisSpider实现的分布式爬虫
需求:爬取的是基于文字的网易新闻数据(国内.国际.军事.航空). 基于Scrapy框架代码实现数据爬取后,再将当前项目修改为基于RedisSpider的分布式爬虫形式. 一.基于Scrapy框架数据爬 ...
【Python3爬虫】学习分布式爬虫第一步--Redis分布式爬虫初体验
一.写在前面之前写的爬虫都是单机爬虫,还没有尝试过分布式爬虫,这次就是一个分布式爬虫的初体验.所谓分布式爬虫,就是要用多台电脑同时爬取数据,相比于单机爬虫,分布式爬虫的爬取速度更快,也能更好地应对I ...
基于Python,scrapy,redis的分布式爬虫实现框架
原文 http://www.xgezhang.com/python_scrapy_redis_crawler.html 爬虫技术,无论是在学术领域,还是在工程领域,都扮演者非常重要的角色.相比于其他 ...
scrapy分布式爬虫scrapy_redis二篇
=============================================================== Scrapy-Redis分布式爬虫框架 ================ ...
Python分布式爬虫必学框架Scrapy打造搜索引擎
Python分布式爬虫必学框架Scrapy打造搜索引擎部分课程截图: 点击链接或搜索QQ号直接加群获取其它资料: 链接:https://pan.baidu.com/s/1-wHr4dTAxfd51M ...
爬虫07 /scrapy图片爬取、中间件、selenium在scrapy中的应用、CrawlSpider、分布式、增量式
爬虫07 /scrapy图片爬取.中间件.selenium在scrapy中的应用.CrawlSpider.分布式.增量式目录爬虫07 /scrapy图片爬取.中间件.selenium在scrapy ...
基于scrapy框架的分布式爬虫
分布式概念:可以使用多台电脑组件一个分布式机群,让其执行同一组程序,对同一组网络资源进行联合爬取. 原生的scrapy是无法实现分布式调度器无法被共享管道无法被共享基于 scrapy+redi ...
聚焦Python分布式爬虫必学框架Scrapy 打造搜索引擎视频教程
下载链接:https://www.yinxiangit.com/595.html 目录: 第1章课程介绍介绍课程目标.通过课程能学习到的内容.和系统开发前需要具备的知识第2章 windows下搭建 ...

随机推荐

wps10.1中将txt转为excel
1.将想要保存的内容保存为txt格式,用分隔符分隔好(包括空格.制表符.英文的逗号以及分号四种). 2.打开wps 3.点击数据->导入数据,选择刚才的txt文件 4.一步步操作,即可.
poj 2253 Frogger（floyd变形）
题目链接:http://poj.org/problem?id=1797 题意:给出两只青蛙的坐标A.B,和其他的n-2个坐标,任一两个坐标点间都是双向连通的.显然从A到B存在至少一条的通路,每一条通路 ...
sqlserver 用户定义表类型
有时需要将内存中的表与数据库中的表比较,比如Datatable中有100行数据,需要判断在数据库中是否存在,这个时候我们就可以使用sqlserver中的[用户定义表类型] 这里最最最重要的思路是把[ ...
"".equals(xxx)和xxx.equals("")的区别
今天做项目发现如下这个问题看见别人用 if ("abc".equals(str)),然后自作聪明地认为 if (str.equals("abc"))是等效的, ...
charles DNS欺骗
本文参考:charles DNS欺骗 DNS欺骗/DNS Spoofing 功能:通过将您自己的主机名指定给远程地址映射来欺骗DNS查找一般的开发流程中,在上线之前都需要在测试环境中先行进行验证,而 ...
渐进深入理解Nginx
文章原创于公众号:程序猿周先森.本平台不定时更新,喜欢我的文章,欢迎关注我的微信公众号. 之前其实写过一篇文章具体介绍过:最基础的Nginx教学,当时有提到过Nginx有一个重要的功能:负载均衡.所以 ...
Day 1总结
Java NIO之理解I/O模型（二）
前言上一篇文章讲解了I/O模型的一些基本概念,包括同步与异步,阻塞与非阻塞,同步IO与异步IO,阻塞IO与非阻塞IO.这次一起来了解一下现有的几种IO模型,以及高效IO的两种设计模式,也都是属于IO ...
豆瓣电影TOP250和书籍TOP250爬虫
豆瓣电影 TOP250 和书籍 TOP250 爬虫最近开始玩 Python , 学习爬虫相关知识的时候,心血来潮,爬取了豆瓣电影TOP250 和书籍TOP250, 这里记录一下自己玩的过程. 电影 ...
Java方法调用的字节码指令学习
Java1.8环境下,我们在编写程序时会进行各种方法调用,虚拟机在执行这些调用的时候会用到不同的字节码指令,共有如下五种: invokespecial:调用私有实例方法: invokestatic:调 ...