scrapy 让指定的spider执行指定的pipeline

处理scrapy中包括多个pipeline时如何让spider执行制定的pipeline管道
１:创建一个装饰器
from scrapy.exceptions import DropItem
import functools
'''
当有多个pipeline时,判断spider如何执行指定的管道
'''

def check_spider_pipeline(process_item_method):
    @functools.wraps(process_item_method)
    def wrapper(self, item, spider):
        # message template for debugging
        msg = '%%s %s pipeline step' % (self.__class__.__name__,)
        if self.__class__ in spider.pipeline:#判断要执行的spider中是否包含所需的pipeline　如果有则执行否则抛出DropItem信息
            spider.logger.debug(msg % 'executing')
            return process_item_method(self,item,spider)
        # otherwise, just return the untouched item (skip this step in
        # the pipeline)
        else:
            spider.logger.debug(msg % 'skipping')
            raise DropItem("Missing pipeline property")
    return wrapper
2:在每个spider所在的类中添加一个pipeline数组，里面包含要执行的pipeline的名字
-*- coding: utf-8 -*-
from scrapy.spiders import CrawlSpider,Rule
# from scrapy.selector import Selector
from ..items import BotcnblogsItem,BotItem
from scrapy.linkextractors import LinkExtractor
import re
from ..BotcnblogsPipeline import BotcnblogsPipeline
class CnblogsSpider(CrawlSpider):
    pipeline = set([BotcnblogsPipeline,])
    #爬虫名称
    name = "cnblogs"
    #设置允许的域名
    allowed_domains = ["cnblogs.com"]
    #设置开始爬去的页面
    start_urls = (
        'http://www.cnblogs.com/fengzheng/',
    )

    rules=(
           Rule(LinkExtractor(allow=('fengzheng/default.html\?page\=([\d]+)')),callback='parse_item',follow=True),
#            Rule(LinkExtractor(allow=('fengzheng/p/([\d]+).html')),callback='parse_info',follow=True),
           )

3:在要执行的pipeline中的process_item方法加上装饰器，这样就可以过滤pipeline了
import json
from .checkpipe import check_spider_pipeline
class BotcnblogsPipeline(object):

    def __init__(self):
        self.file=open('jd.json','w+')

    @check_spider_pipeline
    def process_item(self,item,spider):
        #此处如果有中文的话，要加上ensure_ascii=False参数，否则可能出现乱码
        record=json.dumps(dict(item),ensure_ascii=False)+"\n"
        self.file.write(record)
        return item

    def open_spider(self,spider):
        print("打开爬虫了")

    def close_spider(self,spider):
        print("关闭爬虫")
        self.file.close()

具体例子可以参考其中的cnblogs　spider的例子　下载

scrapy 让指定的spider执行指定的pipeline的更多相关文章

Scrapy 为每一个Spider设置自己的Pipeline
settings中的ITEM_PIPELINES 通常我们需要把数据存在数据库中,一般通过scrapy的pipelines管道机制来实现.做法是,先在pipelines.py模块中编写Pipeline ...
mvn 用指定setting.xml 执行指定pom.xml
mvn package -f pom.xml -s setting.xml clean install
C#固定时间执行指定事件（观察者模式+异步委托）
最近有个项目需要每天固定的时间去执行指定的事件,发现网上关于这样的文章比较少,而且比较散.通过学习了几篇文章后终于实现了这个功能,在此也特别感谢这些文章的作者们,这也是我第一次在园子里面发文章,望多指 ...
重新想象 Windows 8 Store Apps (42) - 多线程之线程池: 延迟执行, 周期执行, 在线程池中找一个线程去执行指定的方法
[源码下载] 重新想象 Windows 8 Store Apps (42) - 多线程之线程池: 延迟执行, 周期执行, 在线程池中找一个线程去执行指定的方法作者:webabcd 介绍重新想象 Wi ...
ScheduledExecutorService定时周期执行指定的任务
示例代码 package com.effective.common.concurrent.execute; import java.text.DateFormat; import java.text. ...
Linux 命令 - at: 在指定的时间执行任务
在指定的时间执行任务. 命令格式 at [-V] [-q queue] [-f file] [-mldbv] TIMEat [-V] [-q queue] [-f file] [-mldbv] -t ...
jQuery按回车键执行指定方法
1.按Enter键执行指定方法: //按回车进入页面 $(function(){ $(document).keydown(function(event){ if (event.keyCode == 1 ...
执行指定iframe页面的脚本
mark一下,通过jQuery执行指定iframe页面里面的脚本,当前仅知道页面名称. $(window.top.document).find('iframe[src="pagesrc&qu ...
Spring Bean初始化之后执行指定方法
转: Spring Bean初始化之后执行指定方法 2017年07月31日 15:59:33 vircens 阅读数:24807 Spring Bean初始化之后执行指定方法在运用Spring进 ...

随机推荐

内幕：XX二手车直卖网，狗屁直卖网，我来揭开他们套路！
转自:明锐论坛我是一位花生二手车直卖网的离职员工.已离职了一段时间,现在在某家汽车4S店公司上班.过去了那么久,每当看到他们铺天盖地的广告,心里都像十五个水桶--七上八下.思索已久,我还是决定鼓 ...
JavaScript快速入门-ECMAScript本地对象（Number）
Number 对象是原始数值的包装对象. 创建一个Number对象:var myNum=new Number(value); 注意: 1.参数 value 是要创建的 Number 对象的数值,或是要 ...
EF Core 新特性——Owned Entity Types
Owned Entity Types 首先owned entity type是EF Core 2.0的新特性. 至于什么是owned entity types,可以先把他理解为EF Core官方支持的 ...
SpringBoot日记——错误页处理的配置篇
在我们访问页面的时候经常会遇到各种问题,比如404,400,500,502等等,可返回的错误页对用户来讲,并不太亲民,所以要定制一下自己的错误页. 我们先访问一个错误页面,看下效果:(虽然给我们提供了 ...
JNI探秘-----你不知道的FileInputStream的秘密
作者:zuoxiaolong8810(左潇龙),转载请注明出处,特别说明:本博文来自博主原博客,为保证新博客中博文的完整性,特复制到此留存,如需转载请注明新博客地址即可. 设计模式系列结束,迎来了LZ ...
【亲测有效】Nodepad++/Sublime Text3中Python脚本运行出现语法错误：IndentationError: unindent does not match any outer indentation level解决策略
我在开发游戏的时候,发现一个python脚本,本来都运行好好的,然后写了几行代码,而且也都确保每行都对齐了,但是运行的时候,却出现语法错误: IndentationError: unindent do ...
Siki_Unity_2-9_C#高级教程(未完)
Unity 2-9 C#高级教程任务1:字符串和正则表达式任务1-1&1-2:字符串类string System.String类(string为别名) 注:string创建的字符串是不可变的 ...
PHP学习 Cookie和Session
<?phpheader("Content-type:text/html;charset=utf-8");session_start(); $_SESSION['count'] ...
Selenium--调用js，对话框处理 (python)
前言: 本次教程针对Python语言,selenium教程(调用js,对话框处理) 一.对话框处理更多的时候我们在实际的应用中碰到的并不是简单警告框,而是提供更多功能的会话框. 本节重点: 1.打开 ...
UVALive 4877 Non-Decreasing Digits 数位DP
4877 Non-Decreasing Digits A number is said to be made up ofnon-decreasing digitsif all the digits t ...

scrapy 让指定的spider执行指定的pipeline

scrapy 让指定的spider执行指定的pipeline的更多相关文章

随机推荐

热门专题