如果基于scrapy进行图片数据的爬取

  1. 在爬虫文件中只需要解析提取出图片地址,然后将地址提交给管道
  2. 配置文件中:IMAGES_STORE = './imgsLib'
  3. 在管道文件中进行管道类的制定:
    • from scrapy.pipelines.images import ImagesPipeline
    • 将管道类的父类修改成ImagesPipeline
    • 重写父类的三个方法

# -*- coding: utf-8 -*-
import scrapy
from imgPro.items import ImgproItem class ImgSpider(scrapy.Spider):
name = 'img'
# allowed_domains = ['www.xxx.com']
start_urls = ['http://www.521609.com/daxuemeinv/']
url = 'http://www.521609.com/daxuemeinv/list8%d.html'
pageNum = 1
def parse(self, response):
li_list = response.xpath('//*[@id="content"]/div[2]/div[2]/ul/li')
for li in li_list:
img_src = 'http://www.521609.com'+li.xpath('./a[1]/img/@src').extract_first()
item = ImgproItem()
item['src'] = img_src yield item # if self.pageNum < 3:
# self.pageNum += 1
# new_url = format(self.url%self.pageNum)
# yield scrapy.Request(new_url,callback=self.parse)

img.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html import scrapy class ImgproItem(scrapy.Item):
# define the fields for your item here like:
src = scrapy.Field()
# pass

items.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html from scrapy.pipelines.images import ImagesPipeline
import scrapy
# class ImgproPipeline(object):
# def process_item(self, item, spider):
# return item
class ImgproPipeline(ImagesPipeline): #对某一个媒体资源进行请求发送
#item就是接收到的spider提交过来的item
def get_media_requests(self, item, info):
yield scrapy.Request(item['src']) #制定媒体数据存储的名称
def file_path(self, request, response=None, info=None):
name = request.url.split('/')[-1]
print('正在下载:',name)
return name #将item传递给下一个即将给执行的管道类
def item_completed(self, results, item, info):
return item

pipelines.py

# -*- coding: utf-8 -*-

# Scrapy settings for imgPro project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'imgPro' SPIDER_MODULES = ['imgPro.spiders']
NEWSPIDER_MODULE = 'imgPro.spiders' USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36' # Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'imgPro (+http://www.yourdomain.com)' # Obey robots.txt rules
ROBOTSTXT_OBEY = False LOG_LEVEL = 'ERROR'
# LOG_FILE = './log.txt' IMAGES_STORE = './imgsLib'
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default)
# COOKIES_ENABLED = False # Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False # Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#} # Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'imgPro.middlewares.ImgproSpiderMiddleware': 543,
#} # Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'imgPro.middlewares.ImgproDownloaderMiddleware': 543,
#} # Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#} # Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'imgPro.pipelines.ImgproPipeline': 300,
} # Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

settings.py

scrapy框架爬取图片并将图片保存到本地的更多相关文章

  1. 使用scrapy框架爬取图片网全站图片(二十多万张),并打包成exe可执行文件

    目标网站:https://www.mn52.com/ 本文代码已上传至git和百度网盘,链接分享在文末 网站概览 目标,使用scrapy框架抓取全部图片并分类保存到本地. 1.创建scrapy项目 s ...

  2. python爬虫---scrapy框架爬取图片,scrapy手动发送请求,发送post请求,提升爬取效率,请求传参(meta),五大核心组件,中间件

    # settings 配置 UA USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, l ...

  3. 爬虫 Scrapy框架 爬取图虫图片并下载

    items.py,根据需求确定自己的数据要求 # -*- coding: utf-8 -*- # Define here the models for your scraped items # # S ...

  4. Python多线程爬图&Scrapy框架爬图

    一.背景 对于日常Python爬虫由于效率问题,本次测试使用多线程和Scrapy框架来实现抓取斗图啦表情.由于IO操作不使用CPU,对于IO密集(磁盘IO/网络IO/人机交互IO)型适合用多线程,对于 ...

  5. 使用scrapy框架爬取自己的博文(2)

    之前写了一篇用scrapy框架爬取自己博文的博客,后来发现对于中文的处理一直有问题- - 显示的时候 [u'python\u4e0b\u722c\u67d0\u4e2a\u7f51\u9875\u76 ...

  6. php 获取远程图片保存到本地

    php 获取远程图片保存到本地 使用两个函数 1.获取远程文件 2.把图片保存到本地 /** * 获取远程图片并把它保存到本地 * $url 是远程图片的完整URL地址,不能为空. */ functi ...

  7. iOS 将图片保存到本地

    //将图片保存到本地 + (void)SaveImageToLocal:(UIImage*)image Keys:(NSString*)key {     NSUserDefaults* prefer ...

  8. iOS-iOS调用相机调用相册【将图片保存到本地相册】

    设置头部代理 <UINavigationControllerDelegate, UIImagePickerControllerDelegate> 1.调用相机 检测前置摄像头是否可用 - ...

  9. Android View转为图片保存为本地文件,异步监听回调操作结果;

    把手机上的一个View或ViewGroup转为Bitmap,再把Bitmap保存为.png格式的图片: 由于View转Bitmap.和Bitmap转图片都是耗时操作,(生成一个1M的图片大约500ms ...

随机推荐

  1. 3.Work Queues

    标题 : 3.Work Queues 目录 : RabbitMQ 序号 : 3 var channel1 = _connection.CreateModel(); channel1.BasicQos( ...

  2. C++ inline与operator

    title: C++ inline与operator date: 2020-03-10 categories: c++ tags: [c++] inline修饰符,operator关键字 1.inli ...

  3. hdu 6242 Geometry Problem

    Geometry Problem Time Limit: 2000/1000 MS (Java/Others)    Memory Limit: 262144/262144 K (Java/Other ...

  4. c# App.xaml

    随着wpf自动创建的,是项目的起始点..Net先再App里找,找到了window然后开启window,项目真正的起始点是在App里. 这两个 (App 的xaml和cs文件)和MainWindow 的 ...

  5. 操作系统 part3

    1.操作系统四特性 并发:一个时间段,多个进程在宏观上同时运行 共享:系统中的资源可以被多个并发进程共同使用(互斥共享,同时共享) 虚拟:利用多道程序设计,利用时分复用(分时系统)和空分复用(虚拟内存 ...

  6. spring-cloud-netflix-eureka-server

    一.构建springcloud父pom工程,管理版本 pom.xml <?xml version="1.0" encoding="UTF-8"?> ...

  7. springboot( 三)redis demo

    redis介绍 Redis是目前业界使用最广泛的内存数据存储.相比memcached,Redis支持更丰富的数据结构,例如hashes, lists, sets等,同时支持数据持久化.除此之外,Red ...

  8. Gym 101170I Iron and Coal(BFS + 思维)题解

    题意:有一个有向图,有些点是煤,有些点是铁,但不会同时有铁和煤.现在我要从1出发,占领可以到达的点.问最少占领几个点能同时拥有一个煤和一个铁(1不用占领). 思路:思路很秀啊.我们从1往外bfs,得到 ...

  9. Volatile如何保证线程可见性之总线锁、缓存一致性协议

    基础知识回顾 下图给出了假想机的基本设计.中央处理单元(CPU)是进行算术和逻辑操作的部件,包含了有限数量的存储位置--寄存器(register),一个高频时钟.一个控制单元和一个算术逻辑单元. 时钟 ...

  10. when I was installing github for windows ,some errors occurred !

    1: 2: 3: 4: install.log error messages: