Scrapy持久化存储-爬取数据转义

Scrapy持久化存储

爬虫爬取数据转义问题

使用这种格式，会自动帮我们转义

'insert into wen values(%s,%s)',(item['title'],item['content']）

基于终端的指令:

只可以将parse方法的返回值存储到本地的文本文件中，支持（json,jsonlines,jl,csv,xml,marshal,pickle)

保存指令

scrapy crawl name -o xxx.csv

好处：简介高效便捷

缺点：局限性比较大（只能保存到本地文件，不能保存到数据库）

# -*- coding: utf-8 -*-

import scrapy

class DuanziSpider(scrapy.Spider):

    name = 'duanzi'

    # allowed_domains = ['www.xxx.com']

    start_urls = ['http://duanziwang.com/']

    def parse(self, response):

        div_list=response.xpath('//main/article')

        data=[]

        for i in div_list:

            title=i.xpath('.//h1/a/text()').extract_first()

            #xpath返回的是存放selector对象的列表，想要拿到数据需要调用extract()函数取出内容，如果列表长度为1可以使用extract_first()

            content=i.xpath('./div[@class="post-content"]/p/text()').extract_first()

            da={

                'title':title,

                'content':content

            }

            data.append(da)

        return data

基于管道的持久化存储操作

编码流程

1.数据解析

# -*- coding: utf-8 -*-

import scrapy

from zx_spider.items import ZxSpiderItem

class Duanzi2Spider(scrapy.Spider):

    name = 'duanzi2'

    start_urls = ['https://ishuo.cn']

    def parse(self, response):

        data_list=response.xpath('//div[@id="list"]/ul/li')

        for i in data_list:

            title=i.xpath('./div[2]/a/text()').extract_first()

            content=i.xpath('./div[1]/text()').extract_first()

            print(title)

            print(content)

            #创建item对象将内容填入

            item=ZxSpiderItem()

            item['title']=title

            item['content']=content

            #将item提交给管道

            yield item

2.解析的数据封装存储到item对象（在item中定义相关的属性）

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

#

# See documentation in:

# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class ZxSpiderItem(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    title = scrapy.Field()

    content = scrapy.Field()

    # pass

3.将item类型对象提交给管道持久化存储操作，在管道类的process_item中要将其接受到的item对象中的数据进行持久化操作

# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

class ZxSpiderPipeline(object):

    fw=None

    #该方法只在开始爬虫的时候调用一次

    def open_spider(self,spider):

        print("开始写入爬虫数据")

        self.fw=open('./zx/duanzi2.csv',"w",encoding='utf8')

    #该方法可以接受到爬虫文件提交过来的item对象

    def process_item(self, item, spider):

        title=item['title']

        content=item['content']

        self.fw.write(title+"\n"+content+'\n')

        return item

    def close_spider(self,spider):

        print("爬虫数据写入完成")

        self.fw.close()

4.在配置文件中开启管道

ITEM_PIPELINES = {

   'zx_spider.pipelines.ZxSpiderPipeline': 300,

    #300表示优先级，数字越小优先级越高

}

将爬取的数据存储到多个平台（文件，mysql）

ZxSpiderPipeline中的return不是没有用处的，是讲item传入下一个优先级的管道进行处理（前提要在setting里面配置）

# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

import pymysql

class ZxSpiderPipeline(object):

    fw=None

    #该方法只在开始爬虫的时候调用一次

    def open_spider(self,spider):

        print("开始写入爬虫数据")

        self.fw=open('./zx/duanzi2.csv',"w",encoding='utf8')

    #该方法可以接受到爬虫文件提交过来的item对象

    def process_item(self, item, spider):

        title=item['title']

        content=item['content']

        self.fw.write(title+"\n"+content+'\n')

        return item

    def close_spider(self,spider):

        print("爬虫数据写入完成")

        self.fw.close()

class MysqlSpiderPipeline(object):

    conn=None

    cursor=None

    def open_spider(self,spider):

        print("爬虫数据库写入完成")

        self.conn=pymysql.Connect(host='127.0.0.1',port=3306,user="root",password='zx125',db="zx",charset='utf8')

    def process_item(self, item, spider):

        self.cursor=self.conn.cursor()

        try:

            self.cursor.execute('insert into wen values(%s,%s)',(item['title'],item['content']))

            self.conn.commit()

        except Exception as e:

            print(e)

            self.conn.rollback()

        return item

    def close_spider(self,spider):

        print("爬虫数据库写入完成")

        self.cursor.close()

        self.conn.close()

配置

ITEM_PIPELINES = {

   'zx_spider.pipelines.ZxSpiderPipeline': 300,

   'zx_spider.pipelines.MysqlSpiderPipeline': 301,

    #300表示优先级，数字越小优先级越高

}

Scrapy持久化存储-爬取数据转义的更多相关文章

scrapy使用PhantomJS爬取数据
环境:python2.7+scrapy+selenium+PhantomJS 内容:测试scrapy+PhantomJS 爬去内容:涉及到js加载更多的页面原理:配置文件打开中间件+修改proces ...
Python使用Scrapy框架爬取数据存入CSV文件(Python爬虫实战4)
1. Scrapy框架 Scrapy是python下实现爬虫功能的框架,能够将数据解析.数据处理.数据存储合为一体功能的爬虫框架. 2. Scrapy安装 1. 安装依赖包 yum install g ...
scrapy爬取数据的基本流程及url地址拼接
说明:初学者,整理后方便能及时完善,冗余之处请多提建议,感谢! 了解内容: Scrapy :抓取数据的爬虫框架异步与非阻塞的区别异步:指的是整个过程,中间如果是非阻塞的,那就是异步 ...
如何提升scrapy爬取数据的效率
在配置文件中修改相关参数: 增加并发默认的scrapy开启的并发线程为32个,可以适当的进行增加,再配置文件中修改CONCURRENT_REQUESTS = 100值为100,并发设置成了为100. ...
安居客scrapy房产信息爬取到数据可视化(下)-可视化代码
接上篇:安居客scrapy房产信息爬取到数据可视化(下)-可视化代码,可视化的实现~ 先看看保存的数据吧~ 本人之前都是习惯把爬到的数据保存到本地json文件, 这次保存到数据库后发现使用mongod ...
安居客scrapy房产信息爬取到数据可视化(上)-scrapy爬虫
出发点想做一个地图热力图,发现安居客房产数据有我要的特性.emmm,那就尝试一次好了~ 老规矩,从爬虫,从拿到数据开始... scrapy的配置创建一个项目(在命令行下敲~): scrapy st ...
爬虫必知必会（6）_提升scrapy框架爬取数据的效率之配置篇
如何提升scrapy爬取数据的效率:只需要将如下五个步骤配置在配置文件中即可增加并发:默认scrapy开启的并发线程为32个,可以适当进行增加.在settings配置文件中修改CONCURRENT_ ...
【Spider】使用CrawlSpider进行爬虫时，无法爬取数据，运行后很快结束，但没有报错
在学习<python爬虫开发与项目实践>的时候有一个关于CrawlSpider的例子,当我在运行时发现,没有爬取到任何数据,以下是我敲的源代码:import scrapyfrom UseS ...
scrapy框架 + selenium 爬取豆瓣电影top250......
废话不说,直接上代码..... 目录结构 items.py import scrapy class DoubanCrawlerItem(scrapy.Item): # 电影名称 movieName = ...

随机推荐

Android H5混合开发（1）：构建Cordova 项目
Cordova是什么 Apache Cordova是一个开源的移动开发框架.允许你用标准的web技术-HTML5,CSS3和JavaScript做跨平台开发. 以移动平台为例,安卓.IOS平台设备的常 ...
Docker的centos7容器中如何安装mongodb
下载安装包: wget https://fastdl.mongodb.org/linux/mongodb-linux-x86_64-3.2.12.tgz 解压安装包 tar -zxvf mongodb ...
缓存管理之MemoryCache与Redis的使用
一..MemoryCache介绍 MemoryCache是.Net Framework 4.0开始提供的内存缓存类,使用该类型可以方便的在程序内部缓存数据并对于数据的有效性进行方便的管理, 它通过在内 ...
MyBatis --- 映射关系【一对一、一对多、多对多】，懒加载机制
映射(多.一)对一的关联关系 1)若只想得到关联对象的id属性,不用关联数据表 2)若希望得到关联对象的其他属性,要关联其数据表举例: 员工与部门的映射关系为:多对一 1.创建表员工表确定其外键 ...
python的基础认识
一.python的简介 python的创始人为吉多·范罗苏姆(Guido van Rossum).1989年的圣诞节期间,Guido开始写能够解释Python语言语法的解释器.Python这个名 ...
201871010114-李岩松《面向对象程序设计（java）》第六、七周学习总结
项目内容这个作业属于哪个课程 https://www.cnblogs.com/nwnu-daizh/ 这个作业的要求在哪里 https://www.cnblogs.com/nwnu-daizh/p ...
手动部署LNMP环境（CentOS 7）
手动部署LNMP环境(CentOS 7) 一.修改 yum 源 [root@localhost ~]# rpm -Uvh https://dl.fedoraproject.org/pub/epel/e ...
nyoj 14-会场安排问题 (贪心)
14-会场安排问题内存限制:64MB 时间限制:3000ms Special Judge: No accepted:9 submit:15 题目描述: 学校的小礼堂每天都会有许多活动,有时间这些活动 ...
队列+BFS（附vector初试）
优先队列的使用: include<queue>//关联头文件 struct node{ int x,y; friend bool operator < (node d1,node d ...
python：time模块
(鱼c)time模块详解http://bbs.fishc.com/forum.php?mod=viewthread&tid=51326&extra=page%3D1%26filter% ...