scrapy文件管道

安装scrapy

pip install scrapy

新建项目

(python36) E:\www>scrapy startproject fileDownload

New Scrapy project 'fileDownload', using template directory 'c:\users\brady\.conda\envs\python36\lib\site-packages\scrapy\templates\project', created in:

    E:\www\fileDownload

You can start your first spider with:

    cd fileDownload

    scrapy genspider example example.com

(python36) E:\www>

(python36) E:\www>scrapy startproject fileDownload

New Scrapy project 'fileDownload', using template directory 'c:\users\brady\.conda\envs\python36\lib\site-packages\scrapy\templates\project', created in:

    E:\www\fileDownload

You can start your first spider with:

    cd fileDownload

    scrapy genspider example example.com

(python36) E:\www>

编辑爬虫提取内容

# -*- coding: utf-8 -*-

import scrapy

from scrapy.linkextractors import LinkExtractor

from scrapy.spiders import CrawlSpider, Rule

from  fileDownload.items import  FiledownloadItem

class PexelsSpider(CrawlSpider):

    name = 'pexels'

    allowed_domains = ['www.pexels.com']

    start_urls = ['https://www.pexels.com/photo/white-concrete-building-2559175/']

    rules = (

        Rule(LinkExtractor(allow=r'/photo/'), callback='parse_item', follow=True),

    )

    def parse_item(self, response):

        print(response.url)

        url = response.xpath("//img[contains(@src,'photos')]/@src").extract()

        item = FiledownloadItem()

        try:

            item['file_urls'] = url

            print("爬取到图片列表 " + url)

            yield item

        except Exception as  e:

            print(str(e))

配置item

class FiledownloadItem(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    file_urls = scrapy.Field()

setting.py

启用文件管道

'scrapy.pipelines.files.FilesPipeline':2 文件管道

FILES_STORE='' //存储路径

item里面

file_urls = scrapy.Field()

files = scrapy.field()

爬虫里面改为file_urls参数传递到管道

重写文件管道保存文件名为图片原名

pipelines.php里面新建自己图片管道，继承图片管道

# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

from scrapy.pipelines.files import  FilesPipeline

class FiledownloadPipeline(object):

    def process_item(self, item, spider):

        tmp = item['file_urls']

        item['file_urls'] = []

        for i in tmp:

            if "?" in i:

                item['file_urls'].append(i.split('?')[0])

            else:

                item['file_urls'].append(i)

        print(item)

        return item

class  MyFilesPipeline(FilesPipeline):

    def file_path(self, request, response=None, info=None):

        file_path = request.url

        file_path = file_path.split('/')[-1]

        print("下载图片"+ file_path)

        return 'full/%s' % (file_path)

setting.py 改为启用自己文件管道

ITEM_PIPELINES = {

    'fileDownload.pipelines.FiledownloadPipeline': 1,

    'fileDownload.pipelines.MyFilesPipeline': 2,

    #'scrapy.pipelines.files.FilesPipeline':2

}

获取套图

# -*- coding: utf-8 -*-

from time import sleep

import scrapy

from scrapy.linkextractors import LinkExtractor

from scrapy.spiders import CrawlSpider, Rule

class AngelSpider(CrawlSpider):

    name = 'angel'

    allowed_domains = ['angelimg.spbeen.com']

    start_urls = ['http://angelimg.spbeen.com/']

    base_url = "http://angelimg.spbeen.com"

    rules = (

        Rule(LinkExtractor(allow=r'^http://angelimg.spbeen.com/ang/\d+$'), callback='parse_item', follow=False),

    )

    def parse_item(self, response):

        item = response.meta.get('item',False)

        if item:

            pass

        else:

            item = {}

            item['files'] = []

            item['file_urls'] = []

        print(response.url)

        img_url = response.xpath('.//div[@id="content"]/a/img/@src').extract_first()

        item['file_urls'].append(img_url)

        # 如果有下一页 请求下一页，没有数据丢回管道

        next_url = response.xpath('.//div[@class="page"]//a[contains(@class,"next")]/@href').extract_first()

        if next_url:

            next_url = self.base_url + next_url

            yield scrapy.Request(next_url,callback=self.parse_item,meta={'item':item})

        else:

            print(item)

            yield item

    def parse_next_response(self,response,):

        item = response.meta.get('item')

        print(item,response.url)

　　github地址

https://github.com/brady-wang/spider-fileDownload

scrapy文件管道的更多相关文章

scrapy之管道
scrapy之管道通过管道将数据持久化到数据库中,企业中常见的数据库是MySQL,分布式爬取数据时只能讲数据存储到Redis装,还可以将数据存储到本地磁盘(即写入到本地文件中). 未完待续... 0
scrapy 图片管道学习笔记
使用scrapy首先需要安装 python环境使用3.6 windows下激活进入python3.6环境 activate python36 mac下 mac@macdeMacBook-Pro:~$ ...
python文件管道下载图集
# -*- coding: utf-8 -*- import re from time import sleep import scrapy from scrapy.linkextractors im ...
Scrapy框架——安装以及新建scrapy文件
一.安装 conda install Scrapy :之后在按y 表示允许安装相关的依赖库(下载速度慢的话也可以借助镜像源),安装的前提是安装了anaconda作为python , 测试scr ...
爬虫框架Scrapy 之(二) --- scrapy文件
框架简介核心部分: 引擎.下载器.调度器自定义部分: spider(自己建的爬虫文件).管道(pipelines.py) 目录结构 firstSpider firstSpider spiders ...
scrapy学习---管道
使用管道必须实现process_item() 方法 process_item(self, item, spider) 次方法实现数据的过滤处理等操作 open_spider(self, spider) ...
Scrapy学习篇（九）之文件与图片下载
Media Pipeline Scrapy为下载item中包含的文件(比如在爬取到产品时,同时也想保存对应的图片)提供了一个可重用的 item pipelines . 这些pipeline有些共同的方 ...
scrapy保存csv文件有空行的解决方案
比如现在我有一个名为test的爬虫,运行爬虫后将结果保存到test.csv文件默认情况下,我执行scrapy crawl test -o test.csv ,得到的结果可能就是下面这种情况,每两行中 ...
scrapy框架--新建调试的main.py文件
一.原因: 由于pycharm中没有scrapy的一个模板,所有没办法直接在scrapy文件中调试,所有我们需要写一个自己的main.py文件,在文件里面调用命令行,来实现scrapy的一个调试.(在 ...

随机推荐

linux命令详解之du命令
du命令概述du命令作用是估计文件系统的磁盘已使用量,常用于查看文件或目录所占磁盘容量.du命令与df命令不同,df命令是统计磁盘使用情况,详见linux命令详解之df命令.du命令会直接到文件系统内 ...
setDefaultDllDirectories无法定位动态链接库kernel32.dll
参考链接 : https://blog.csdn.net/gdali/article/details/93084828 https://tieba.baidu.com/p/5795675519?red ...
PMP 1~3章错题总结
工作到了一定的年限,都或多或少想了解管理的知识,PMP是国际认证的一项考试,招聘要求上也有提及. 不需要报名培训班,万能的某宝即可解决报名.PDU.学习资料的问题,但3900的考试费还是免不了的,为了 ...
C语言设计模式
一 .C语言和设计模式(继承.封装.多态) C++有三个最重要的特点,即继承.封装.多态.我发现其实C语言也是可以面向对象的,也是可以应用设计模式的,关键就在于如何实现面向对象语言的三个重要属性. ( ...
Selenium-PO设计模式
先来一张图,看看整个Po架构的实现: operatePages:操作页面,也就是把每一个操作页面,写成一个类. pages:用来存放公共配置文件的一个目录.比如基础类,后续所有类都会用到基础类. re ...
恋恋山城 Jean de Florette (1986) 男人的野心 / 弗洛莱特的若望 / 让·德·弗罗莱特 / 水源下一部甘泉，玛侬
<让·德·弗洛莱特>电影剧本文/[法]马赛尔·巴涅尔译/苏原编者按:<让·德·弗洛莱特>和<甘泉,玛侬>是根据法国著名作家马赛尔·巴涅尔的同名小说改编的电影.马 ...
【MongoDB学习之二】MongoDB数据库、文档、集合、元数据
环境 MongoDB 4.0 CentOS6.5_x64 一.连接语法格式: mongodb://[username:password@]host1[:port1][,host2[:port2],.. ...
Java的集合类之Set接口
Set最大的特性就是不允许在其中存放的元素是重复的.根据这个特点,我们就可以使用Set 这个接口来实现前面提到的关于商品种类的存储需求.Set 可以被用来过滤在其他集合中存放的元素,从而得到一个没有包 ...
PHP下载远程图片到本地的几种方法总结(tp5.1)
1.CURL 2.使用file_get_contents 3.使用fopen 参考链接:https://www.jb51.net/article/110615.htm
【转】ISE——完整工程的建立
FPGA公司主要是两个Xilinx和Altera(现intel PSG),我们目前用的ISE是Xilinx的开发套件,现在ISE更新到14.7已经不更新了,换成了另一款开发套件Vivado,也是Xil ...

scrapy文件管道

scrapy文件管道的更多相关文章

随机推荐

热门专题