Scrapy持久化存储-爬取数据转义

Scrapy持久化存储

爬虫爬取数据转义问题

使用这种格式，会自动帮我们转义

'insert into wen values(%s,%s)',(item['title'],item['content']）

基于终端的指令:

只可以将parse方法的返回值存储到本地的文本文件中，支持（json,jsonlines,jl,csv,xml,marshal,pickle)

保存指令

scrapy crawl name -o xxx.csv

好处：简介高效便捷

缺点：局限性比较大（只能保存到本地文件，不能保存到数据库）

# -*- coding: utf-8 -*-

import scrapy

class DuanziSpider(scrapy.Spider):

    name = 'duanzi'

    # allowed_domains = ['www.xxx.com']

    start_urls = ['http://duanziwang.com/']

    def parse(self, response):

        div_list=response.xpath('//main/article')

        data=[]

        for i in div_list:

            title=i.xpath('.//h1/a/text()').extract_first()

            #xpath返回的是存放selector对象的列表，想要拿到数据需要调用extract()函数取出内容，如果列表长度为1可以使用extract_first()

            content=i.xpath('./div[@class="post-content"]/p/text()').extract_first()

            da={

                'title':title,

                'content':content

            }

            data.append(da)

        return data

基于管道的持久化存储操作

编码流程

1.数据解析

# -*- coding: utf-8 -*-

import scrapy

from zx_spider.items import ZxSpiderItem

class Duanzi2Spider(scrapy.Spider):

    name = 'duanzi2'

    start_urls = ['https://ishuo.cn']

    def parse(self, response):

        data_list=response.xpath('//div[@id="list"]/ul/li')

        for i in data_list:

            title=i.xpath('./div[2]/a/text()').extract_first()

            content=i.xpath('./div[1]/text()').extract_first()

            print(title)

            print(content)

            #创建item对象将内容填入

            item=ZxSpiderItem()

            item['title']=title

            item['content']=content

            #将item提交给管道

            yield item

2.解析的数据封装存储到item对象（在item中定义相关的属性）

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

#

# See documentation in:

# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class ZxSpiderItem(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    title = scrapy.Field()

    content = scrapy.Field()

    # pass

3.将item类型对象提交给管道持久化存储操作，在管道类的process_item中要将其接受到的item对象中的数据进行持久化操作

# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

class ZxSpiderPipeline(object):

    fw=None

    #该方法只在开始爬虫的时候调用一次

    def open_spider(self,spider):

        print("开始写入爬虫数据")

        self.fw=open('./zx/duanzi2.csv',"w",encoding='utf8')

    #该方法可以接受到爬虫文件提交过来的item对象

    def process_item(self, item, spider):

        title=item['title']

        content=item['content']

        self.fw.write(title+"\n"+content+'\n')

        return item

    def close_spider(self,spider):

        print("爬虫数据写入完成")

        self.fw.close()

4.在配置文件中开启管道

ITEM_PIPELINES = {

   'zx_spider.pipelines.ZxSpiderPipeline': 300,

    #300表示优先级，数字越小优先级越高

}

将爬取的数据存储到多个平台（文件，mysql）

ZxSpiderPipeline中的return不是没有用处的，是讲item传入下一个优先级的管道进行处理（前提要在setting里面配置）

# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

import pymysql

class ZxSpiderPipeline(object):

    fw=None

    #该方法只在开始爬虫的时候调用一次

    def open_spider(self,spider):

        print("开始写入爬虫数据")

        self.fw=open('./zx/duanzi2.csv',"w",encoding='utf8')

    #该方法可以接受到爬虫文件提交过来的item对象

    def process_item(self, item, spider):

        title=item['title']

        content=item['content']

        self.fw.write(title+"\n"+content+'\n')

        return item

    def close_spider(self,spider):

        print("爬虫数据写入完成")

        self.fw.close()

class MysqlSpiderPipeline(object):

    conn=None

    cursor=None

    def open_spider(self,spider):

        print("爬虫数据库写入完成")

        self.conn=pymysql.Connect(host='127.0.0.1',port=3306,user="root",password='zx125',db="zx",charset='utf8')

    def process_item(self, item, spider):

        self.cursor=self.conn.cursor()

        try:

            self.cursor.execute('insert into wen values(%s,%s)',(item['title'],item['content']))

            self.conn.commit()

        except Exception as e:

            print(e)

            self.conn.rollback()

        return item

    def close_spider(self,spider):

        print("爬虫数据库写入完成")

        self.cursor.close()

        self.conn.close()

配置

ITEM_PIPELINES = {

   'zx_spider.pipelines.ZxSpiderPipeline': 300,

   'zx_spider.pipelines.MysqlSpiderPipeline': 301,

    #300表示优先级，数字越小优先级越高

}

Scrapy持久化存储-爬取数据转义的更多相关文章

scrapy使用PhantomJS爬取数据
环境:python2.7+scrapy+selenium+PhantomJS 内容:测试scrapy+PhantomJS 爬去内容:涉及到js加载更多的页面原理:配置文件打开中间件+修改proces ...
Python使用Scrapy框架爬取数据存入CSV文件(Python爬虫实战4)
1. Scrapy框架 Scrapy是python下实现爬虫功能的框架,能够将数据解析.数据处理.数据存储合为一体功能的爬虫框架. 2. Scrapy安装 1. 安装依赖包 yum install g ...
scrapy爬取数据的基本流程及url地址拼接
说明:初学者,整理后方便能及时完善,冗余之处请多提建议,感谢! 了解内容: Scrapy :抓取数据的爬虫框架异步与非阻塞的区别异步:指的是整个过程,中间如果是非阻塞的,那就是异步 ...
如何提升scrapy爬取数据的效率
在配置文件中修改相关参数: 增加并发默认的scrapy开启的并发线程为32个,可以适当的进行增加,再配置文件中修改CONCURRENT_REQUESTS = 100值为100,并发设置成了为100. ...
安居客scrapy房产信息爬取到数据可视化(下)-可视化代码
接上篇:安居客scrapy房产信息爬取到数据可视化(下)-可视化代码,可视化的实现~ 先看看保存的数据吧~ 本人之前都是习惯把爬到的数据保存到本地json文件, 这次保存到数据库后发现使用mongod ...
安居客scrapy房产信息爬取到数据可视化(上)-scrapy爬虫
出发点想做一个地图热力图,发现安居客房产数据有我要的特性.emmm,那就尝试一次好了~ 老规矩,从爬虫,从拿到数据开始... scrapy的配置创建一个项目(在命令行下敲~): scrapy st ...
爬虫必知必会（6）_提升scrapy框架爬取数据的效率之配置篇
如何提升scrapy爬取数据的效率:只需要将如下五个步骤配置在配置文件中即可增加并发:默认scrapy开启的并发线程为32个,可以适当进行增加.在settings配置文件中修改CONCURRENT_ ...
【Spider】使用CrawlSpider进行爬虫时，无法爬取数据，运行后很快结束，但没有报错
在学习<python爬虫开发与项目实践>的时候有一个关于CrawlSpider的例子,当我在运行时发现,没有爬取到任何数据,以下是我敲的源代码:import scrapyfrom UseS ...
scrapy框架 + selenium 爬取豆瓣电影top250......
废话不说,直接上代码..... 目录结构 items.py import scrapy class DoubanCrawlerItem(scrapy.Item): # 电影名称 movieName = ...

随机推荐

p1156 题解（未完全解决）
题目描述卡门――农夫约翰极其珍视的一条Holsteins奶牛――已经落了到“垃圾井”中.“垃圾井”是农夫们扔垃圾的地方,它的深度为D(2 \le D \le 100)D(2≤D≤100)英尺. 卡门 ...
AutoCad 二次开发 .net 之层表的增加删除修改图层颜色遍历设置当前层
AutoCad 二次开发 .net 之层表的增加删除修改图层颜色遍历设置当前层 AutoCad 二次开发 .net 之层表的增加删除修改图层颜色遍历设置当前层我理解的图层的作用大概是把 ...
在linux上使用ssh登录服务器,Linux权限
本文是作者原创,版权归作者所有.若要转载,请注明出处 ssh为Secure Shell(安全外壳协议)的缩写. 很多ftp.pop和telnet在本质上都是不安全的. 我们使用的Xshell6就是基于 ...
Matlab 在线使用 | 推荐
Matlab 在线使用 | 推荐
使用火狐浏览器模仿手机浏览器，附浏览器HTTP_USER_AGENT汇总
HTTP_USER_AGENT用来获取浏览页面的访问者在用什么操作系统(包括版本号)浏览器(包括版本号)和用户个人偏好. 改变浏览器的这个参数就可以伪装成相应的浏览器. User Agent Swit ...
Kubernetes Horizontal Pod Autoscaling
HPA介绍 Horizontal Pod Autoscaler基于观察到的CPU利用率(或借助自定义指标支持,基于其他一些应用程序提供的指标)自动缩放复制控制器,部署或副本集中的Pod数量 .请 ...
生信 - 从repeatmasker传送门过来的 blast
以前有的是非完整时间写的博客,抽时间需要统一整理一下. 今天在重新装repeatmasker. 整个过程是这样的,有关联的事情有两个. 1. 装repeatmasker需要各种Prerequisite ...
MySql: AUTO_INCREMENT
首先要在Column使用AUTO_INCREMENT (每张表只有一个列可以AUTO_INCREMENT): 以下示例取自MySql官网(http://dev.mysql.com/doc/refman ...
SpringSecurity系列之自定义登录验证成功与失败的结果处理
一.需要自定义登录结果的场景在我之前的文章中,做过登录验证流程的源码解析.其中比较重要的就是当我们登录成功的时候,是由AuthenticationSuccessHandler进行登录结果处理,默认 ...
Secure CRT注册码
secure CRT 把记忆的东西放在这就行了,:) SecureCRT 5.2.2的注册码 Name: Apollo InteractiveCompany: Apollo ...