爬虫框架Scrapy之案例三图片下载器

items.py

class CoserItem(scrapy.Item):

    url = scrapy.Field()

    name = scrapy.Field()

    info = scrapy.Field()

    image_urls = scrapy.Field()

    images = scrapy.Field()

spiders/coser.py

# -*- coding: utf-8 -*-

from scrapy.selector import Selector

import scrapy

from scrapy.contrib.loader import ItemLoader

from Cosplay.items import CoserItem

class CoserSpider(scrapy.Spider):

    name = "coser"

    allowed_domains = ["bcy.net"]

    start_urls = (

        'http://bcy.net/cn125101',

        'http://bcy.net/cn126487',

        'http://bcy.net/cn126173'

    )

    def parse(self, response):

        sel = Selector(response)

        for link in sel.xpath("//ul[@class='js-articles l-works']/li[@class='l-work--big']/article[@class='work work--second-created']/h2[@class='work__title']/a/@href").extract():

            link = 'http://bcy.net%s' % link

            request = scrapy.Request(link, callback=self.parse_item)

            yield request

    def parse_item(self, response):

        l = ItemLoader(item=CoserItem(), response=response)

        l.add_xpath('name', "//h1[@class='js-post-title']/text()")

        l.add_xpath('info', "//div[@class='post__info']/div[@class='post__type post__info-group']/span/text()")

        urls = l.get_xpath('//img[@class="detail_std detail_clickable"]/@src')

        urls = [url.replace('/w650', '') for url in urls]

        l.add_value('image_urls', urls)

        l.add_value('url', response.url)

        return l.load_item()

pipelines.py

import requests

from Cosplay import settings

import os

class ImageDownloadPipeline(object):

    def process_item(self, item, spider):

        if 'image_urls' in item:

            images = []

            dir_path = '%s/%s' % (settings.IMAGES_STORE, spider.name)

            if not os.path.exists(dir_path):

                os.makedirs(dir_path)

            for image_url in item['image_urls']:

                us = image_url.split('/')[3:]

                image_file_name = '_'.join(us)

                file_path = '%s/%s' % (dir_path, image_file_name)

                images.append(file_path)

                if os.path.exists(file_path):

                    continue

                with open(file_path, 'wb') as handle:

                    response = requests.get(image_url, stream=True)

                    for block in response.iter_content(1024):

                        if not block:

                            break

                        handle.write(block)

            item['images'] = images

        return item

settings.py



ITEM_PIPELINES = {'Cosplay.pipelines.ImageDownloadPipeline': 1}

IMAGES_STORE = '../Images'

DOWNLOAD_DELAY = 0.25    # 250 ms of delay

在项目根目录下新建main.py文件,用于调试

from scrapy import cmdline

cmdline.execute('scrapy crawl coser'.split())

执行程序

py2 main.py

爬虫框架Scrapy之案例三图片下载器的更多相关文章

Python爬虫框架Scrapy实例（三）数据存储到MongoDB
Python爬虫框架Scrapy实例(三)数据存储到MongoDB任务目标:爬取豆瓣电影top250,将数据存储到MongoDB中. items.py文件复制代码# -*- coding: utf-8 ...
Python爬虫框架Scrapy实例（四）下载中间件设置
还是豆瓣top250爬虫的例子,添加下载中间件,主要是设置动态Uesr-Agent和代理IP Scrapy代理IP.Uesr-Agent的切换都是通过DOWNLOADER_MIDDLEWARES进行控 ...
爬虫框架Scrapy之案例二
新浪网分类资讯爬虫爬取新浪网导航页所有下所有大类.小类.小类里的子链接,以及子链接页面的新闻内容. 效果演示图: items.py import scrapy import sys reload(s ...
爬虫框架Scrapy之案例一
阳光热线问政平台 http://wz.sun0769.com/index.php/question/questionType?type=4 爬取投诉帖子的编号.帖子的url.帖子的标题,和帖子里的内容 ...
第三百四十一节，Python分布式爬虫打造搜索引擎Scrapy精讲—编写spiders爬虫文件循环抓取内容—meta属性返回指定值给回调函数—Scrapy内置图片下载器
第三百四十一节,Python分布式爬虫打造搜索引擎Scrapy精讲—编写spiders爬虫文件循环抓取内容—meta属性返回指定值给回调函数—Scrapy内置图片下载器编写spiders爬虫文件循环 ...
二十 Python分布式爬虫打造搜索引擎Scrapy精讲—编写spiders爬虫文件循环抓取内容—meta属性返回指定值给回调函数—Scrapy内置图片下载器
编写spiders爬虫文件循环抓取内容 Request()方法,将指定的url地址添加到下载器下载页面,两个必须参数, 参数: url='url' callback=页面处理函数使用时需要yield ...
第三篇：爬虫框架 - Scrapy
前言 Python提供了一个比较实用的爬虫框架 - Scrapy.在这个框架下只要定制好指定的几个模块,就能实现一个爬虫. 本文将讲解Scrapy框架的基本体系结构,以及使用这个框架定制爬虫的具体步骤 ...
小白学 Python 爬虫（35）：爬虫框架 Scrapy 入门基础（三） Selector 选择器
人生苦短,我用 Python 前文传送门: 小白学 Python 爬虫(1):开篇小白学 Python 爬虫(2):前置准备(一)基本类库的安装小白学 Python 爬虫(3):前置准备(二)Li ...
教你分分钟学会用python爬虫框架Scrapy爬取心目中的女神
本博文将带领你从入门到精通爬虫框架Scrapy,最终具备爬取任何网页的数据的能力.本文以校花网为例进行爬取,校花网:http://www.xiaohuar.com/,让你体验爬取校花的成就感. Scr ...

随机推荐

ansible（2）
一.ansible模块(yum.pip.service.conr.user.group) 上篇中我们已经学了ansible的几个模块,接下来再来学习几个,那么你是否知道ansible一共有多少模块呢? ...
Mybatis框架学习总结-解决字段名与实体类属性名不相同的冲突
在平时的开发中,我们表中的字段名和表对应实体类的属性名称不一定是完全相同的. 1.准备演示需要使用的表和数据 CREATE TABLE orders( order_id INT PRIMARY KEY ...
1.如何在虚拟机ubuntu上安装hadoop多节点分布式集群
要想深入的学习hadoop数据分析技术,首要的任务是必须要将hadoop集群环境搭建起来,可以将hadoop简化地想象成一个小软件,通过在各个物理节点上安装这个小软件,然后将其运行起来,就是一个had ...
send/receive h264/aac file/data by rtp/rtsp over udp/tcp
一.安装一些必要的调试工具 1.vlc安装sudo apt-get install vlcsudo apt-get install vlc-nox 2.ffmpeg安装,带ffplay,ffplay依 ...
如何理解PHP的单例模式
单例模式就是让类的一个对象成为系统中的唯一实例,避免大量的 new 操作消耗的资源. PHP的单例模式实现要求: 1.一个private的__construct是必须的,单例类不能在其它类中实例化,只 ...
简明python教程九----异常
使用try...except语句来处理异常.我们把通常的语句放在try-块中,而把错误处理语句放在except-块中. import sys try: s = raw_input('Enter som ...
centOS7下安装laravel + composer
1.wget https://dl.laravel-china.org/composer.phar -O /usr/local/bin/composer chmod a+x /usr/local/bi ...
Windows常见宏的使用
WIN32_LEAN_AND_MEAN 1. 参考资料:https://msdn.microsoft.com/en-us/library/windows/desktop/aa383745(v=vs. ...
Yii 2.x 和1.x区别以及yii2.0安装
知乎上有个类似的问题:http://www.zhihu.com/question/22924271/answer/23085751 大致思路不会变,开发流程变化也不是很大.有变化的是1.yii2带入的 ...
怎么将linux下的项目转换成windows的VS2010下的项目？
怎么将linux下的项目转换成windows的VS2010下的项目? 不显示删除回复显示所有回复显示星级回复 ...

爬虫框架Scrapy之案例三图片下载器

items.py

spiders/coser.py

pipelines.py

settings.py

在项目根目录下新建main.py文件,用于调试

执行程序

爬虫框架Scrapy之案例三图片下载器的更多相关文章

随机推荐

热门专题