scrapy的持久化相关

终端指令的持久化存储

保证爬虫文件的parse方法中有可迭代类型对象（通常为列表or字典）的返回，该返回值可以通过终端指令的形式写入指定格式的文件中进行持久化操作。

需求是：将糗百首页中段子的内容和标题进行爬取

新建项目流程

cmd中

# 建立项目

scrapy startproject qiubaiDemo

# 进入项目名称

cd qiubaiDemo

# 创建应用和起始url  // 网址先随便先 一会编辑时候在修改

scrapy genspider qiubai www.xxx.com

# 编辑后运行时候执行

scrapy crawl 应用名称

settings 文件

19行：USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36' #伪装请求载体身份

22行：ROBOTSTXT_OBEY = False  #可以忽略或者不遵守robots协议

qiubai文件编辑内容为

# -*- coding: utf- -*-

import scrapy

class QiubaiSpider(scrapy.Spider):

    name = 'qiubai'

    allowed_domains = ['www.xxx.com']

    start_urls = ['https://www.qiushibaike.com/text/']

    def parse(self, response):

        div_list = response.xpath('//div[@id="content-left"]/div')

        all_data = []

        for div in div_list:

            title = div.xpath('./div[1]/a[2]/h2/text() | ./div[1]/span[2]/h2/text()').extract_first()

            content = div.xpath('./a[1]/div/span/text()').extract_first()

            dic = {

                'title': title,

                'content': content

            }

            all_data.append(dic)

            print(all_data)

        # 基于终端指令的持久化存储:可以通过终端指令的形式将parse方法的返回值中存储的数据进行本地磁盘的持久化存储

        return all_data

执行爬虫应用

scrapy crawl qiubai -o 糗百.csv

* 执行输出指定格式进行存储：将爬取到的数据写入不同格式的文件中进行存储

    scrapy crawl 爬虫名称 -o xxx.json

    scrapy crawl 爬虫名称 -o xxx.xml

    scrapy crawl 爬虫名称 -o xxx.csv

基于管道的持久化存储

scrapy框架中已经为我们专门集成好了高效、便捷的持久化操作功能，我们直接使用即可。要想使用scrapy的持久化操作功能，我们首先来认识如下两个文件：

    items.py：数据结构模板文件。定义数据属性。

    pipelines.py：管道文件。接收数据（items），进行持久化操作。

持久化流程：

    .爬虫文件爬取到数据后，需要将数据封装到items对象中。

    .使用yield关键字将items对象提交给pipelines管道进行持久化操作。

    .在管道文件中的process_item方法中接收爬虫文件提交过来的item对象，然后编写持久化存储的代码将item对象中存储的数据进行持久化存储

    .settings.py配置文件中开启管道

将boos直聘的数据爬取下来，然后进行持久化存储

mysql篇

# -*- coding: utf- -*-

import scrapy

from boosPro.items import BoosproItem

class BoosSpider(scrapy.Spider):

    name = 'boos'

    allowed_domains = ['www.xxx.com']

    start_urls = [

        'https://www.zhipin.com/job_detail/?query=python%E7%88%AC%E8%99%AB&scity=101010100&industry=&position=']

    def parse(self, response):

        li_list = response.xpath('//div[@class="job-list"]/ul/li')

        for li in li_list:

            title = li.xpath('.//div[@class="info-primary"]/h3[@class="name"]/a/div/text()').extract_first()

            salary = li.xpath('.//div[@class="info-primary"]/h3[@class="name"]/a/span/text()').extract_first()

            company = li.xpath('.//div[@class="company-text"]/h3/a/text()').extract_first()

            # 实例化一个item类型的对象

            item = BoosproItem()

            # 将解析到的数据值存储到item对象中:why?

            item['title'] = title

            item['salary'] = salary

            item['company'] = company

            # 将item对象提交给管道进行持久化存储

            yield item

boos.py

import scrapy

class BoosproItem(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    title = scrapy.Field()

    salary = scrapy.Field()

    company = scrapy.Field()

items

# 注意:默认情况下,管道机制并没有开启.需要手动在配置文件中进行开启

# 使用管道进行持久化存储的流程:

# .获取解析到的数据值

# .将解析的数据值存储到item对象(item类中进行相关属性的声明)

# .通过yield关键字将item提交到管道

# .管道文件中进行持久化存储代码的编写(process_item)

# .在配置文件中开启管道

import pymysql

class mysqlPipeLine(object):

    conn = None  # 数据库连接

    cursor = None  # 游标

    def open_spider(self, spider):

        self.conn = pymysql.Connect(host='127.0.0.1', port=, user='root', password='', db='spider')

        print(self.conn)

    def process_item(self, item, spider):

        self.cursor = self.conn.cursor()

        sql = 'insert into boss values("%s","%s","%s")' % (item['title'], item['salary'], item['company'])

        try:  # 事务 如果成功全部写入  不成功全部回滚

            self.cursor.execute(sql)

            self.conn.commit()

        except Exception as e:

            print(e)

            self.conn.rollback()

        return item

    def close_spider(self, spider):

        self.cursor.close()

        self.conn.close()

pipelines.py

# 加入身份伪装

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'

# 关闭root协议

ROBOTSTXT_OBEY = False

# 开启管道

ITEM_PIPELINES = {

    'boosPro.pipelines.BoosproPipeline': ,

    'boosPro.pipelines.mysqlPipeLine': ,

}

settings.py

redis篇

在刚才代码中修改了settings文件和pipelines.py

# -*- coding: utf- -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

# 注意:只要涉及到持久化存储的相关的操作,必须要写在管道文件中

# 管道文件:需要接受爬虫文件提交过来的数据,并对数据进行持久化存储.(IO操作)

class BoosproPipeline(object):

    fp = None

    # 只会被执行一次(开始爬虫的时候执行一次)

    def open_spider(self, spider):

        print('开始爬虫!!!')

        self.fp = open('./boos.txt', 'w', encoding='utf-8')

    # 爬虫文件没提交一次item,该方法会被调用一次

    def process_item(self, item, spider):

        self.fp.write(item['title'] + "\t" + item['salary'] + '\t' + item['company'] + '\n')

        return item

    def close_spider(self, spider):

        print('爬虫结束!!!')

        self.fp.close()

# 注意:默认情况下,管道机制并没有开启.需要手动在配置文件中进行开启

# 使用管道进行持久化存储的流程:

# .获取解析到的数据值

# .将解析的数据值存储到item对象(item类中进行相关属性的声明)

# .通过yield关键字将item提交到管道

# .管道文件中进行持久化存储代码的编写(process_item)

# .在配置文件中开启管道

import pymysql

class mysqlPipeLine(object):

    conn = None  # 数据库连接

    cursor = None  # 游标

    def open_spider(self, spider):

        self.conn = pymysql.Connect(host='127.0.0.1', port=, user='root', password='', db='spider')

        print(self.conn)

    def process_item(self, item, spider):

        self.cursor = self.conn.cursor()

        sql = 'insert into boss values("%s","%s","%s")' % (item['title'], item['salary'], item['company'])

        try:  # 事务 如果成功全部写入  不成功全部回滚

            self.cursor.execute(sql)

            self.conn.commit()

        except Exception as e:

            print(e)

            self.conn.rollback()

        return item

    def close_spider(self, spider):

        self.cursor.close()

        self.conn.close()

from redis import Redis

import json

class RedisPipeLine(object):

    conn = None

    def process_item(self, item, spider):

        dic = {

            'title': item['title'],

            'salary': item['salary'],

            'company': item['company']

        }

        self.conn.lpush('jobInfo', json.dumps(dic))

        return item

    def open_spider(self, spider):

        self.conn = Redis(host='127.0.0.1', port=)

        print(self.conn)

# [注意]一定要保证每一个管道类的process_item方法要有返回值

pipelines.py

ITEM_PIPELINES = {

    # 'boosPro.pipelines.BoosproPipeline': ,

    # 'boosPro.pipelines.mysqlPipeLine': ,

    'boosPro.pipelines.RedisPipeLine': ,

}

settings.py

[注意]一定要保证每一个管道类的process_item方法要有返回值

执行爬虫程序

登录redis-cli 查看数据

lrange jobInfo  -

# type 是list

scrapy的持久化相关的更多相关文章

11.scrapy框架持久化存储
今日概要基于终端指令的持久化存储基于管道的持久化存储今日详情 1.基于终端指令的持久化存储保证爬虫文件的parse方法中有可迭代类型对象(通常为列表or字典)的返回,该返回值可以通过终端指令的 ...
scrapy框架持久化存储
基于终端指令的持久化存储基于管道的持久化存储 1.基于终端指令的持久化存储保证爬虫文件的parse方法中有可迭代类型对象(通常为列表or字典)的返回,该返回值可以通过终端指令的形式写入指定格式的文 ...
10 Scrapy框架持久化存储
一.基于终端指令的持久化存储保证parse方法中有可迭代类型对象(通常为列表or字典)的返回,该返回值可以通过终端指令的形式写入指定格式的文件中进行持久化操作. 执行输出指定格式进行存储:将爬取到的 ...
scrapy之持久化存储
scrapy之持久化存储 scrapy持久化存储一般有三种,分别是基于终端指令保存到磁盘本地,存储到MySQL,以及存储到Redis. 基于终端指令的持久化存储 scrapy crawl xxoo - ...
第三百二十三节，web爬虫，scrapy模块以及相关依赖模块安装
第三百二十三节,web爬虫,scrapy模块以及相关依赖模块安装当前环境python3.5 ,windows10系统 Linux系统安装在线安装,会自动安装scrapy模块以及相关依赖模块 pip ...
二 web爬虫，scrapy模块以及相关依赖模块安装
当前环境python3.5 ,windows10系统 Linux系统安装在线安装,会自动安装scrapy模块以及相关依赖模块 pip install Scrapy 手动源码安装,比较麻烦要自己手动安 ...
11，scrapy框架持久化存储
今日总结基于终端指令的持久化存储基于管道的持久化存储今日详情 1.基于终端指令的持久化存储保证爬虫文件的parse方法中有可迭代类型对象(通常为列表or字典)的返回,该返回值可以通过终端指令的 ...
scrapy 框架持久化存储
1.基于终端的持久化存储保证爬虫文件的parse方法中有可迭代类型对象(通常为列表或字典)的返回,该返回值可以通过终端指令的形式写入指定格式的文件中进行持久化操作. # 执行输出指定格式进行存储:将 ...
scrapy各种持久化存储的奇淫技巧
理论磁盘文件: 基于终端指令 1)保证parse方法返回一个可迭代类型的对象(存储解析到的页面内容) 2)使用终端指令完成数据存储到指定磁盘文件中的操作,如:scrapy crawl 爬虫文件名称 ...

随机推荐

JavaScript 高级
在线JS编辑 JS 编写规范阮一峰 ES 6 阮一峰廖雪峰操作文件 <html> <head> <script src='./jquery-2.2.3.min.js ...
python 正则指北之我的总结
本文经本人搜索网络加上个人理解整理而成,如有侵权,请告知,会立即删除! 正则引擎大体上可分为不同的两类:DFA和NFA,而NFA又基本上可以分为传统型NFA和POSIX NFA. DFA Determ ...
第一节. .Net Core环境的安装和常用指令
一. 环境介绍和安装 1. 环境介绍 .Net FrameWork框架:BCL(基础类库 system.dll).CLR(运行时仅支持:Windows).FCL(一些框架,比如:MVC.WPF) . ...
ES6.3.2 index操作源码流程
ES 6.3.2 index 操作源码流程 client 发送请求 TransportBulkAction#doExecute(Task,BulkRequest,listener) 解析请求,是否要自 ...
Ajax简述
AJAX即“Asynchronous Javascript And XML”(异步JavaScript和XML),是指一种创建交互式网页应用的网页开发技术.AJAX = 异步 JavaScript和X ...
《11招玩转网络安全》之第三招：Web暴力破解-Low级别
Docker中启动LocalDVWA容器,准备DVWA环境.在浏览器地址栏输入http://127.0.0.1,中打开DVWA靶机.自动跳转到了http://127.0.0.1/login.php登录 ...
Timer定时方法（间隔时间后执行）
Timer time = new Timer(); time.schedule(new TimerTask() { @Override public void run() { // TODO Auto ...
论文阅读笔记（七）YOLO
You Only Look Once: Unified, Real-Time Object Detection Joseph Redmon, CVPR, 2016 1. 之前的目标检测工作将分类器用作 ...
C++自己实现一个String类
C++自己实现一个String类(构造函数.拷贝构造函数.析构函数和字符串赋值函数) #include <iostream> #include <cstring> using ...
Linux服务器查看外网IP地址的命令
可以直接输入如下几个命令:1.curl ifconfig.me2.curl cip.cc3.curl icanhazip.com4.curl ident.me5.curl ipecho.net/pla ...