scrapy之Pipeline

官方文档：https://docs.scrapy.org/en/latest/topics/item-pipeline.html

　　激活pipeline，需要在settings里配置，然而这里配置的pipeline会作用于所有的spider。加入项目中有很多spider在运行。item pipeline的处理就会很麻烦，你可以通过process_item(self,item,spider)中的spider参数来判断是来自哪个爬虫，但是这种方法很冗余。更好的做法是配置spider类中的custom_settings属性。为每一个spider配置不同的pipeline。示例如下：

　　同时，这里你也会看到custom_settings的用法和用处。

class XiaohuaSpider(scrapy.Spider):

    name = 'xiaohua'

    custom_settings = {

        'ITEM_PIPELINES ':{

            'TB.pipelines.TBMongoPipeline':300,

        }

    }

一 method

　　1 process_item(self,item,spider)

　　This method is called for every item pipeline component

　　2 open_spider(self,spider)

　　This method is called when the spider is opened.

　　3 close_spider(self,spider)

　　4 from_crawler(cls,crawler)

　　It must return a new instance of the pipeline

二 Item Pipeline example

　　1 write items to mongodb

import pymongo

class MongoPipeline(object):

    collection_name = 'scrapy_items'

    def __init__(self, mongo_uri, mongo_db):

        self.mongo_uri = mongo_uri

        self.mongo_db = mongo_db

    @classmethod

    def from_crawler(cls, crawler):

        return cls(

            mongo_uri=crawler.settings.get('MONGO_URI'),

            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')

        )

    def open_spider(self, spider):

        self.client = pymongo.MongoClient(self.mongo_uri)

        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):

        self.client.close()

    def process_item(self, item, spider):

        self.db[self.collection_name].insert_one(dict(item))

        return item

　　2 duplicates filter

from scrapy.exceptions import DropItem

class DuplicatesPipeline(object):

    def __init__(self):

        self.ids_seen = set()

    def process_item(self, item, spider):

        if item['id'] in self.ids_seen:

            raise DropItem("Duplicate item found: %s" % item)

        else:

            self.ids_seen.add(item['id'])

            return item

scrapy之Pipeline的更多相关文章

python爬虫之scrapy的pipeline的使用
scrapy的pipeline是一个非常重要的模块,主要作用是将return的items写入到数据库.文件等持久化模块,下面我们就简单的了解一下pipelines的用法. 案例一: items池 cl ...
scrapy中pipeline的一点综合知识
初次学习scrapy ,觉得spider代码才是最重要的,越往后学,发现pipeline中的代码也很有趣, 今天顺便把pipeline中三种储存方法写下来,算是对自己学习的一点鼓励吧,也可以为后来者的 ...
scrapy item pipeline
item pipeline process_item(self, item, spider) #这个是所有pipeline都必须要有的方法在这个方法下再继续编辑具体怎么处理另可以添加别的方法 ope ...
爬虫（十三）：scrapy中pipeline的用法
当Item 在Spider中被收集之后,就会被传递到Item Pipeline中进行处理每个item pipeline组件是实现了简单的方法的python类,负责接收到item并通过它执行一些行为, ...
Python爬虫知识点四--scrapy框架
一.scrapy结构数据解释: 1.名词解析: o 引擎(Scrapy Engine)o 调度器(Scheduler)o 下载器(Downloader)o 蜘蛛(Spiders)o 项目管 ...
scrapy 博客爬取
item.py import scrapy class FulongpjtItem(scrapy.Item): # define the fields for your item here like: ...
scrapy爬虫学习系列五：图片的抓取和下载
系列文章列表: scrapy爬虫学习系列一:scrapy爬虫环境的准备: http://www.cnblogs.com/zhaojiedi1992/p/zhaojiedi_python_00 ...
爬虫系列----scrapy爬取网页初始
一基本流程创建工程,工程名称为(cmd):firstblood: scrapy startproject firstblood 进入工程目录中(cmd):cd :./firstblood 创建爬虫 ...
web全栈应用【爬取（scrapy）数据 -> 通过restful接口存入数据库 -> websocket推送展示到前台】
作为 https://github.com/fanqingsong/web_full_stack_application 子项目的一功能的核心部分,使用scrapy抓取数据,解析完的数据,使用 pyt ...

随机推荐

bash编程之循环控制：
bash编程之循环控制: for varName in LIST; do 循环体 done while CONDITION; do 循环体 done until CONDITION; do 循 ...
pandas删除及其映射修改操作。
1.使用drop_duplicates()函数删除重复的行 df.drop_duplicates() 2.映射映射的含义,创建一个映射关系,把values元素和一个特定的标签或字符串绑定 map = ...
JS - Array.prototype.sort(compare)
function compare(a, b) { return -1; // a 在 b 前面 return 1; // a 在 b 后面 return 0; // 并列排序,保持在源数组中的先后顺序 ...
python2和python3中filter函数
在python2和python3中filter是不同的,其中在python2中filter返回的是一个list,可以直接使用 >>> a = [1,2,3,4,5,6,7] > ...
C#基础-数组
数组定义定义数组并赋值 int[] scores = { 45, 56, 78, 98, 100 }; //在定义数组时赋值 for(int i = 0; i < scores.Length; ...
日志切割logrotate和定时任务crontab详解
1.关于日志切割日志文件包含了关于系统中发生的事件的有用信息,在排障过程中或者系统性能分析时经常被用到.对于忙碌的服务器,日志文件大小会增长极快,服务器会很快消耗磁盘空间,这成了个问题.除此之外,处 ...
如何在 CentOS 7 上安装 Python 3
当前最新的 CentOS 7.5 默认安装的是 Python 2.7.5,并且默认的官方 yum 源中不提供 Python 3 的安装包.这里主要介绍两种在 CentOS 7 中安装 Python 3 ...
Applied Nonparametric Statistics-lec9
Ref:https://onlinecourses.science.psu.edu/stat464/print/book/export/html/12 前面我们考虑的情况是:response是连续的, ...
（转）.gitignore详解
本文转自http://sentsin.com/web/666.html 今天讲讲Git中非常重要的一个文件——.gitignore. 首先要强调一点,这个文件的完整文件名就是“.gitignore”, ...
i2c drivers
Linux设备驱动程序架构分析之一个I2C驱动实例转载于:http://blog.csdn.net/liuhaoyutz 内核版本:3.10.1 编写一个I2C设备驱动程序的工作可分为两部分 ...

scrapy之Pipeline

scrapy之Pipeline的更多相关文章

随机推荐

热门专题