scrapy下载图片到自己的目录，创建缩略图，存储入库

环境和工具：python2.7，scrapy

实验网站：http://www.XXXX.com/tag/333.html 爬去所有兔女郎图片，下面的推荐需要过滤

逻辑：分析网站信息，下载图片和入库需要开启ITEM_PIPELINES，开启缩略图配置，转移图片

-----settings.py

##不按照robots.txt

ROBOTSTXT_OBEY = False

##默认

DOWNLOAD_DELAY = 3

##关闭cookie

COOKIES_ENABLED = False

##开启ITEM_PIPELINES

ITEM_PIPELINES = {

                    'MyPicSpider.pipelines.MyImagesPipeline': 300,

                    'MyPicSpider.pipelines.MysqlPipeline': 400

                  }

##存储路径

IMAGES_STORE ='G:\\www\\scrapy_rpo\\pic\\meinv\\rabbit\\'

##过滤图片

IMAGES_MIN_HEIGHT = 110

IMAGES_MIN_WIDTH = 110

##缩略图片

IMAGES_THUMBS = {

    'big': (270, 270),

}

------items.py

import scrapy

class PicspiderItem(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    tag = scrapy.Field()

    image_urls = scrapy.Field()

    images_data = scrapy.Field()

    img_path = scrapy.Field()

    img_big_path = scrapy.Field()

    file_path = scrapy.Field()

----pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import scrapy,os,datetime

from scrapy.pipelines.images import ImagesPipeline

from scrapy.exceptions import DropItem

import shutil,os,pymysql

# 导入项目设置

from scrapy.utils.project import get_project_settings

#conn = pymysql.Connection(host="localhost", user="root", passwd="root", db='test', charset="UTF8")

#cursor = conn.cursor()

class MyImagesPipeline(ImagesPipeline):

    # 从项目设置文件中导入图片下载路径

    img_store = get_project_settings().get('IMAGES_STORE')

    def get_media_requests(self, item, info):

        ''' 多个url'''

        for image_url in item['image_urls']:

            yield scrapy.Request(image_url)

    def item_completed(self, results, item, info, ):

        image_paths = [x["path"] for ok, x in results if ok]

        if not image_paths:

            raise DropItem("Item contains no images")

        file_path = item['file_path']

        # 定义分类保存的路径

        if os.path.exists(file_path) == False:

            os.mkdir(file_path)

        print image_paths

        ## pic  ==  full/80dd7db02e4da4e63f05d9d49c1092fc7fdcb43e.jpg

        pic_list = []

        for v in image_paths:

            pic_name = v.replace('full/','')

            pic_small_name =pic_name.replace('.jpg','')+'_s.jpg'

            pic_big_name = pic_name.replace('.jpg', '') + '_b.jpg'

            ##获取创建的图片名字

            # 将文件从默认下路路径移动到指定路径下

            # 移动图片

            shutil.move(self.img_store + 'full\\'+pic_name, file_path + "\\" + pic_name)

            # 移动缩略图

            #shutil.move(self.img_store + 'thumbs\\small\\'+ pic_name, file_path + "\\" + pic_small_name)

            shutil.move(self.img_store + 'thumbs\\big\\' + pic_name, file_path + "\\" + pic_big_name)

            #img_path_dict['img_path'] = file_path + "\\" + pic_name

            #img_path_dict['img_small_path'] = file_path + "\\" + pic_small_name

            #img_path_dict['img_big_path'] = file_path + "\\" + pic_big_name

            img_path_dict = ('picture/meinv/rabbit/'+item['tag']+"/" + pic_name,'picture/meinv/rabbit/'+item['tag']+"/" +pic_big_name)

            pic_list.append(img_path_dict)

        item["img_path"] = pic_list

        return item

##入库

class MysqlPipeline(object):

    def __init__(self):

        self.conn = pymysql.Connection(host="localhost", user="root", passwd="root", db='test1', charset="UTF8")

        # 创建指针

        self.cursor = self.conn.cursor()

    def process_item(self, item, spider):

        ###组装数据

        list = []

        datetime_now  =datetime.datetime.now()

        datetime_now = datetime.datetime.now()

        datetime_str = '{0}-{1}-{2} {3}:{4}:{5}'.format(datetime_now.year, datetime_now.month, datetime_now.day,datetime_now.hour, datetime_now.minute, datetime_now.second)

        ##增加type

        result = self.cursor.execute(u"select id from network_type where RESOURCETYPE ='p' and TYPENAME='{0}'".format(item['tag']))

        if result==0:

            self.cursor.execute("insert into network_type(PID,RESOURCETYPE,TYPENAME)values(%s,%s,%s) ",(2415,'p',item['tag']))

            typeid = self.cursor.lastrowid

            self.conn.commit()

        else:

            #tag_id = self.cursor.fetchall()

            #typeid = tag_id[0][0]

            return False

        types = ','+str(typeid)+','

        #print item['img_path']

        self.cursor.execute('select  id from network_picture order by cast(id as SIGNED INTEGER) desc limit 0,1')

        old_id = self.cursor.fetchone()

        if old_id:

            id_n = str(int(old_id[0]) + 1)

        else:

            id_n = str(1)

        for v in item['img_path']:

            path1 = v[0]

            path2 = v[1]

            self.cursor.execute(u'select  id from network_picture where FILEPATH="{0}" and fileScalPath="{1}"'.format(path1,path2))

            data = self.cursor.fetchone()

            if data:

                print u'该数据已经存在'

            else:

                a = (str(id_n),'',path1,'',types,0,datetime_str,path2)

            list.append(a)

            id_n = int(id_n) + 1

        print list

        self.cursor.executemany("insert into network_picture(ID,NAME,FILEPATH,FILESIZE,TYPES,STATUS,DATETIME,fileScalPath)values(%s,%s,%s,%s,%s,%s,%s,%s)", list)

        self.conn.commit()

        return item

----spider.py

# -*- coding: utf-8 -*-

import scrapy,os,urllib2

from scrapy.linkextractors import LinkExtractor   ##引入linkextractors  用于筛选链接和跟进链接，还有很多功能，可以去百度下

from scrapy.spiders import CrawlSpider, Rule     ##定义spider的模板，引入Rule规则

from MyPicSpider.items import PicspiderItem      ##引入定义的items.py

# 导入项目设置

from scrapy.utils.project import get_project_settings

from bs4 import BeautifulSoup

import time,pymysql

headers = {'User_agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'}

conn = pymysql.Connection(host="localhost", user="root", passwd="root", db='test1', charset="UTF8")

# 创建指针

cursor = conn.cursor()

class PicSpider(CrawlSpider):    ##继承模板CrawlSpider 普通模板继承Spider

    name = 'pic'     ###定义spider名    运行---$ scrapy crawl blog

    allowed_domains = ['www.xxxx.com']    ##  定义查找范围

    start_urls = ['http://www.xxxx.com/tag/333.html']   ###初始url

    ####当有follow=True  则会跟进该页面

    ####原理就是  spider在初始页面查找，同时查找帖子详情页的url和下一个分页，同时跟进下一个分页页面，继续查找下一个分页页面和上面的详情页url,详情页面使用回调函数进行采集

    rules = (

        ###爬去索引页并跟踪其中链接

        ###查找start_urls  所有的分页页面

        Rule(LinkExtractor(allow=r'/tag/[0-9]*_[0-9]*.html'),follow=True),

        ###爬去items页面并将下载响应返回个头parse_item函数

        ####查询每个分页页面的详情页

        Rule(LinkExtractor(allow=r'http://www.xxxx.com/ent/[a-z]*/[0-9]*/[0-9]*.html'), callback='parse_item', follow=False,),

    )

    ####详情页面回调函数

    def parse_item(self,response):

        start_url = response.url

        item = PicspiderItem()

        tag_name = response.xpath('//h1[@class="articleV4Tit"]/text()').extract()[0]

        # cursor.execute(u'select id from network_type  where PID=258 AND TYPENAME="{0}" limit 0,1'.format(tag_name))

        # old_id = cursor.fetchone()

        # if old_id:

        #     exit()

        name = u'兔'

        if name in tag_name:

            pass

        else:

            print u'----这是其他的分类----'

            return False

        li_list =  response.xpath('//ul[@class="articleV4Page l"]/li').extract()

        srcs = []

        for v in range(1, (len(li_list) - 3)):

            if v == 1:

                url_s = start_url

            else:

                url_s = start_url.replace('.html', '') + '_' + str(v) + '.html'

            try:

                request = urllib2.Request(url_s, headers=headers)

                response = urllib2.urlopen(request, timeout=200).read()

            except urllib2.URLError, err:

                print err, '错误的url' + url

            obj = BeautifulSoup(response, 'html.parser')

            try:

                pic_url = obj.find('center').find('img')['src']

            except:

                print u'----第一种获取方式失败----'

                try:

                    pic_url = obj.find('div', {'id': 'picBody'}).find('img')['src']

                except:

                    print u'----第二种方式获取失败----'

                    try:

                        pic_url = obj.find('p', attrs={"style": "text-align: center"}).find('img')['src']

                    except:

                        print u'----第三种获取方式失败----'

            srcs.append(pic_url)

        item['tag'] = tag_name

        item['file_path'] = '%s%s' %(get_project_settings().get('IMAGES_STORE'),tag_name)

        item['image_urls'] = srcs

        return item

------scrapy的去重方面我还不是特别了解，有知道的大佬可以告知本白，谢谢。

scrapy下载图片到自己的目录，创建缩略图，存储入库的更多相关文章

利用scrapy下载图片保存到本地
1.先声明一下,起始位置已经是将所有的图片链接都能到pipelines.py中 2.创建一个类,继承于ImagesPipeline,因此也就需要导入ImagesPipeline from scrapy ...
Scrapy下载图片及自定义分类下载路径
配置下载图片的流程如下在items中定义两个属性,image_urls 和images .image_urls是用来存储需要下载的图片url链接,列表类型: 当文件下载完成后会把相关下载信息存入im ...
scrapy 下载图片 from cuiqingcai
import scrapy class MzituScrapyItem(scrapy.Item): # define the fields for your item here like: # nam ...
[转]解决scrapy下载图片时相对路径转绝对路径的问题
专注自:http://blog.csdn.net/hjy_six/article/details/6862648 这段时间一直在研究利用scrapy抓取图片的问题,我发觉,用官网的http://doc ...
Scrapy 下载图片时 ModuleNotFoundError: No module named'PIL'
使用scrapy的下载模块需要PIL(python图像处理模块)的支持,使用pip安装即可
Scrapy 下载图片
参考 : https://www.jianshu.com/p/6c8d2730d088 https://docs.scrapy.org/en/latest/topics/item-pipeline.h ...
scrapy下载图片报[scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt:错误
本文转自:http://blog.csdn.net/zzk1995/article/details/51628205 先说结论,关闭scrapy自带的ROBOTSTXT_OBEY功能,在setting ...
Day3-scrapy爬虫下载图片自定义名称
学习Scrapy过程中发现用Scrapy下载图片时,总是以他们的URL的SHA1 hash值为文件名,如: 图片URL:http://www.example.com/image.jpg 它的SHA1 ...
使用Scrapy自带的ImagesPipeline下载图片，并对其进行分类。
ImagesPipeline是scrapy自带的类,用来处理图片(爬取时将图片下载到本地)用的. 优势: 将下载图片转换成通用的JPG和RGB格式避免重复下载缩略图生成图片大小过滤异步下载 . ...

随机推荐

java基础笔记（2）----流程控制
java流程控制结构包括顺序结构,分支结构,循环结构. 顺序结构: 程序从上到下依次执行,中间没有任何判断和跳转. 代码如下: package com.lvsling.test; public cla ...
【iOS】Swift GCD-上
尽管Grand Central Dispatch(GCD)已经存在一段时间了,但并非每个人都知道怎么使用它.这是情有可原的,因为并发很棘手,而且GCD本身基于C的API在Swift世界中很刺眼. 在这 ...
Flask 扩展 HTTP认证
Restful API不保存状态,无法依赖Cookie及Session来保存用户信息,自然也无法使用Flask-Login扩展来实现用户认证.所以这里,我们就要介绍另一个扩展,Flask-HTTPAu ...
java方法的定义格式
Java的方法类似于其他语言的函数,是一段用来完成特定功能的代码片段,声明格式为: [修饰符1 修饰符2 …..] 返回值类型方法名( 形式参数列表 ){ Java 语句;… … … } 例如 ...
jvm垃圾收集器总结jdk1.7
内存 ● 线程私有:程序计数器,虚拟机栈,本地方法栈 ● 线程共享: 方法区,堆判断存活算法 ● 引用计数法:无法解决循环引用问题. ● 可达性分析算法: 从GCRoot作为起始点,向下搜索,经过的 ...
前端面试题之css
1.请列出几个具有继承特性的css属性 font-family font-size color line-height text-align text-indent 2.阐述display: ...
Android类加载机制及热修复实现
Android类加载机制 Dalvik虚拟机如同其他Java虚拟机一样,在运行程序时首先需要将对应的类加载到内存中.而在Java标准的虚拟机中,类加载可以从class文件中读取,也可以是其他形式的二进 ...
CISCO路由器练习
前言: 总结了昨天的学习和今天的单臂路由写了今天的文章. 目录: 路由器的基本配置单臂路由的练习正文: 路由器基本配置环境要求 cisco模拟器 2台交换机 2台PC 1台路由器路由器介绍: ...
c语言中宏定义和常量定义的区别
他们有共同的好处就是"一改全改,避免输入错误"哪两者有不同之处吗?有的. 主要区别就在于,宏定义是在编译之前进行的,而const是在编译阶段处理的宏定义不占用内存单元而const ...
python利用递归函数输出嵌套列表的每个元素
1.先用 for 循环取. for item in l: if isinstance(item ,list): for newitem in item: print(newitem) else: pr ...

scrapy下载图片到自己的目录，创建缩略图，存储入库

scrapy下载图片到自己的目录，创建缩略图，存储入库的更多相关文章

随机推荐

热门专题