scrapy pipelines导出各种格式

scrapy在使用pipelines的时候，我们经常导出csv,json.jsonlines等等格式。每次都需要写一个类去导出，很麻烦。

这里我整理一个pipeline文件，支持多种格式的。

# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

from scrapy import signals

from scrapy.exporters import *

import logging

logger=logging.getLogger(__name__)

class BaseExportPipeLine(object):

    def __init__(self,**kwargs):

        self.files = {}

        self.exporter=kwargs.pop("exporter",None)

        self.dst=kwargs.pop("dst",None)

        self.option=kwargs

    @classmethod

    def from_crawler(cls, crawler):

        pipeline = cls()

        crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)

        crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)

        return pipeline

    def spider_opened(self, spider):

        file = open(self.dst, 'wb')

        self.files[spider] = file

        self.exporter = self.exporter(file,**self.option)

        self.exporter.start_exporting()

    def spider_closed(self, spider):

        self.exporter.finish_exporting()

        file = self.files.pop(spider)

        file.close()

    def process_item(self, item, spider):

        self.exporter.export_item(item)

        return item

#

# 'fields_to_export':["url","edit_url","title"] 设定只导出部分字段,以下几个pipeline都支持这个参数

# 'export_empty_fields':False 设定是否导出空字段 以下几个pipeline都支持这个参数

# 'encoding':'utf-8' 设定默认编码，以下几个pipeline都支持这个参数

# 'indent' :1： 设置缩进，这个参数主要给JsonLinesExportPipeline使用

# "item_element":"item"设置xml节点元素的名字，只能XmlExportPipeline使用,效果是<item></item>

# "root_element":"items"设置xml根元素的名字，只能XmlExportPipeline使用，效果是<items>里面是很多item</items>

# "include_headers_line":True 是否包含字段行， 只能CsvExportPipeline使用

# "join_multivalued":","设置csv文件的分隔符号， 只能CsvExportPipeline使用

# 'protocol':2设置PickleExportPipeline 导出协议，只能PickleExportPipeline使用

# "dst":"items.json" 设置目标位置

class JsonExportPipeline(BaseExportPipeLine):

    def __init__(self):

        option={"exporter":JsonItemExporter,"dst":"items.json","encoding":"utf-8","indent":4,}

        super(JsonExportPipeline, self).__init__(**option)

class JsonLinesExportPipeline(BaseExportPipeLine):

    def __init__(self):

        option={"exporter":JsonLinesItemExporter,"dst":"items.jl","encoding":"utf-8"}

        super(JsonLinesExportPipeline, self).__init__(**option)

class XmlExportPipeline(BaseExportPipeLine):

    def __init__(self):

        option={"exporter":XmlItemExporter,"dst":"items.xml","item_element":"item","root_element":"items","encoding":'utf-8'}

        super(XmlExportPipeline, self).__init__(**option)

class CsvExportPipeline(BaseExportPipeLine):

    def __init__(self):

        # 设置分隔符的这个，我这里测试是不成功的

        option={"exporter":CsvItemExporter,"dst":"items.csv","encoding":"utf-8","include_headers_line":True, "join_multivalued":","}

        super(CsvExportPipeline, self).__init__(**option)

class  PickleExportPipeline(BaseExportPipeLine):

    def __init__(self):

        option={"exporter":PickleItemExporter,"dst":"items.pickle",'protocol':2}

        super(PickleExportPipeline, self).__init__(**option)

class  MarshalExportPipeline(BaseExportPipeLine):

    def __init__(self):

        option={"exporter":MarshalItemExporter,"dst":"items.marsha"}

        super(MarshalExportPipeline, self).__init__(**option)

class  PprintExportPipeline(BaseExportPipeLine):

    def __init__(self):

        option={"exporter":PprintItemExporter,"dst":"items.pprint.jl"}

        super(PprintExportPipeline, self).__init__(**option)

上面的定义好之后。我们就可以在settings.py里面设置导出指定的类了。

ITEM_PIPELINES = {

    'ScrapyCnblogs.pipelines.PprintExportPipeline': 300,

    #'ScrapyCnblogs.pipelines.JsonLinesExportPipeline': 302,

    #'ScrapyCnblogs.pipelines.JsonExportPipeline': 303,

    #'ScrapyCnblogs.pipelines.XmlExportPipeline': 304,

}

是不是很强大。如果你感兴趣，可以去github上找找这个部分的源码，地址如下：https://github.com/scrapy/scrapy/blob/master/scrapy/exporters.py

exporters的测试代码在这个位置：https://github.com/scrapy/scrapy/blob/master/tests/test_exporters.py，有兴趣的话，可以拜读下人家的源码吧。

详细的使用案例，可以参考我的一个github项目：https://github.com/zhaojiedi1992/ScrapyCnblogs

scrapy pipelines导出各种格式的更多相关文章

SQL SERVER导出特殊格式的平面文件
有时候我们需要将SQL SERVER的数据一次性导入到ORACLE中,对于数据量大的表.我一般习惯先从SQL SERVER导出特殊格式的平面文件(CSV或TXT),然后用SQL*Loader装载数据到 ...
OAF_文件系列2_实现OAF导出CSV格式文件ExportButton（案例）
20150727 Created By BaoXinjian
Powerdesigner 导出Excel格式数据字典导出Excel格式文件
版权声明:本文为博主原创文章,转载请注明出处; 网上我也看到了很多的Powerdesigner 导出方法,因为Powerdesigner 提供了部分VBA功能,所以让我用代码导出Excel格式文件得以 ...
使用PHPExcel导入导出excel格式文件
使用PHPExcel导入导出excel格式文件作者:zccst 因为导出使用较多,以下是导出实现过程. 第一步,将PHPExcel的源码拷贝到项目的lib下文件包含:PHPExcel.ph ...
导出CSV格式文件，用Excel打开乱码的解决办法
导出CSV格式文件,用Excel打开乱码的解决办法 1.治标不治本的办法将导出CSV数据文件用记事本打开,然后另存为"ANSI"编码格式,再用Excel打开,乱码解决. 但是,这 ...
java导出csv格式文件
导出csv格式文件的本质是导出以逗号为分隔的文本数据 import java.io.BufferedWriter; import java.io.File; import java.io.FileIn ...
C# Aspose.Cells导出xlsx格式Excel，打开文件报“Excel 已完成文件级验证和修复。此工作簿的某些部分可能已被修复或丢弃”
报错信息: 最近打开下载的 Excel,会报如下错误.(xls 格式不受影响) 解决方案: 下载代码(红色为新添代码) public void download() { string fileName ...
asp.net NPOI导出xlsx格式文件，打开文件报“Excel 已完成文件级验证和修复。此工作簿的某些部分可能已被修复或丢弃”
NPOI导出xlsx格式文件,会出现如下情况: 点击“是”: 导出代码如下: /// <summary> /// 将datatable数据写入excel并下载 /// </summa ...
将页面中表格数据导出excel格式的文件（vue）
近期由于项目需要,需要将页面中的表格数据导出excel格式的文件,折腾了许久,在网上各种百度,虽然资料不少,但是大都不全,踩了许多坑,总算是皇天不负有心人,最后圆满解决了. 1.安装相关依赖(npm安 ...

随机推荐

数据库SQLServr安装时出现--"需要更新以前的Visual Studio 2010实例"--状态失败
在电脑中安装过Visual Studio比较低版本的软件的时候将原本的Microsoft Visual Studio 2010 Service Pack 1进行了更改导致sql比较高版本的不能很好 ...
Exp1 PC平台逆向破解 20164302 王一帆
1 逆向及Bof基础实践说明 1.1 实践目标本次实践的对象是一个名为pwn1的linux可执行文件. 该程序正常执行流程是:main调用foo函数,foo函数会简单回显任何用户输入的字符串. 该程 ...
golang 内存模型
1,是什么是一套规范.内存操作指导解决多线程编程的程序的原子性,有序性,可见性(主要)的问题. 多核操作系统,会存在缓存不一致的情况,说到底是一个同步的问题. 2, 内容内存模型,除了定义了 ...
180815 Python自学成才001
1.为什么学习Python? Python:脚本语言,易入门,可移植. Python适用范围:web开发.自动化测试工具编写. 适用岗位:运维开发(运维).自动化测试(软件测试).Python开发(软 ...
List使用linq的OrderBy方法排序，并按照两个字段排序的写法
SfaMember.GetList(searchInfo, 0, 1000, out Allcount).Where(item => item.bOpen == true).OrderBy(it ...
Error: Unable to access xxx.jar
在cmd中运行java -jar xxx.jar出现如下错误: Error: Unable to access xxx.jar 解决方法: 使用绝对路径:java -jar D:\Program Fi ...
vue.js数据可以在页面上渲染成功却总是警告提示某个字段“undefined”未定义
最近在开发公司的一个后端管理系统,用的是比较流行的vue框架.在开发过程中,总是出现各种各样的报错问题,有警告的,有接口不通的,有自己马虎造成的低级错误的等等,这些错误在一些老司机面前分分钟解决,但今 ...
Xtrabackup实现Mysql的InnoDB引擎热备份
前面Zabbix使用的数据库是mysql,数据库备份不用多说,必须滴,由于使用的是innodb引擎,既然做,那就使用第三方强大的Xtrabackup工具来热备吧,Xtrabackup的说明,参见htt ...
Linux中搭建Maven私服
linux安装maven 先解压maven的压缩包apache-maven-3.5.4-bin.tar.gz 命令: tar -zavf pache-maven-3.5.4-bin.tar.gz ...
[Swift]LeetCode318. 最大单词长度乘积 | Maximum Product of Word Lengths
Given a string array words, find the maximum value of length(word[i]) * length(word[j]) where the tw ...

scrapy pipelines导出各种格式

scrapy pipelines导出各种格式的更多相关文章

随机推荐

热门专题