1.参考

https://github.com/scrapy-plugins/scrapy-splash#configuration

以此为准

启动 Docker Quickstart Terminal
使用 putty 连接如下ip，端口22，用户名/密码：docker/tcuser
开启服务：
1. 　　sudo docker run -p 5023:5023 -p 8050:8050 -p 8051:8051 scrapinghub/splash
浏览器打开：http://192.168.99.100:8050/

docker is configured to use the default machine with IP 192.168.99.100

For help getting started, check out the docs at https://docs.docker.com

Start interactive shell

win7@win7-PC MINGW64 ~

$

2.实践

2.1新建项目后修改 settings.py

ROBOTSTXT_OBEY 改为 False，同时添加如下内容：

'''https://github.com/scrapy-plugins/scrapy-splash#configuration'''

# 1.Add the Splash server address to settings.py of your Scrapy project like this:

SPLASH_URL = 'http://192.168.99.100:8050'

# 2.Enable the Splash middleware by adding it to DOWNLOADER_MIDDLEWARES in your settings.py file

# and changing HttpCompressionMiddleware priority:

DOWNLOADER_MIDDLEWARES = {

'scrapy_splash.SplashCookiesMiddleware': 723,

'scrapy_splash.SplashMiddleware': 725,

'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,

}

# Order 723 is just before HttpProxyMiddleware (750) in default scrapy settings.

# HttpCompressionMiddleware priority should be changed in order to allow advanced response processing;

# see https://github.com/scrapy/scrapy/issues/1895 for details.

# 3.Enable SplashDeduplicateArgsMiddleware by adding it to SPIDER_MIDDLEWARES in your settings.py:

SPIDER_MIDDLEWARES = {

'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,

}

# This middleware is needed to support cache_args feature;

# it allows to save disk space by not storing duplicate Splash arguments multiple times in a disk request queue.

# If Splash 2.1+ is used the middleware also allows to save network traffic by not sending these duplicate arguments to Splash server multiple times.

# 4.Set a custom DUPEFILTER_CLASS:

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

# 5.If you use Scrapy HTTP cache then a custom cache storage backend is required.

# scrapy-splash provides a subclass of scrapy.contrib.httpcache.FilesystemCacheStorage:

HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

# If you use other cache storage then it is necesary to subclass it

# and replace all scrapy.util.request.request_fingerprint calls with scrapy_splash.splash_request_fingerprint.

# Note

# Steps (4) and (5) are necessary because Scrapy doesn't provide a way to override request fingerprints calculation algorithm globally; this could change in future.

# There are also some additional options available. Put them into your settings.py if you want to change the defaults:

# SPLASH_COOKIES_DEBUG is False by default. Set to True to enable debugging cookies in the SplashCookiesMiddleware. This option is similar to COOKIES_DEBUG for the built-in scarpy cookies middleware: it logs sent and received cookies for all requests.

# SPLASH_LOG_400 is True by default - it instructs to log all 400 errors from Splash. They are important because they show errors occurred when executing the Splash script. Set it to False to disable this logging.

# SPLASH_SLOT_POLICY is scrapy_splash.SlotPolicy.PER_DOMAIN by default. It specifies how concurrency & politeness are maintained for Splash requests, and specify the default value for slot_policy argument for SplashRequest, which is described below.

2.2 编写基本 spider

# -*- coding: utf-8 -*-

import scrapy

from scrapy_splash import SplashRequest

from scrapy.shell import inspect_response

import base64

from PIL import Image

from io import BytesIO

class CnblogsSpider(scrapy.Spider):

    name = 'cnblogs'

    allowed_domains = ['cnblogs.com']

    start_urls = ['https://www.cnblogs.com/']

    def start_requests(self):

        for url in self.start_urls:

            yield SplashRequest(url, self.parse, args={'wait': 0.5})    

    def parse(self, response):

        inspect_response(response, self)  ########################

调试 view(response) 是个txt。。。另存为 html 使用浏览器浏览即可。

2.3 编写截图 spider

同时参考https://stackoverflow.com/questions/45172260/scrapy-splash-screenshots

    def start_requests(self):

        splash_args = {

            'html': 1,

            'png': 1,

            #'width': 1024, #默认1027*768,4:3

            #'render_all': 1, #长图截屏，不提供则是第一屏，需要同时提供 wait,否则报错

            #'wait': 0.5,

        }

        for url in self.start_urls:

            yield SplashRequest(url, self.parse, endpoint='render.json', args=splash_args)

http://splash.readthedocs.io/en/latest/api.html?highlight=wait#render-png

render_all=1 requires non-zero wait parameter. This is an unfortunate restriction, but it seems that this is the only way to make rendering work reliably with render_all=1.

https://github.com/scrapy-plugins/scrapy-splash#responses

Responses

scrapy-splash returns Response subclasses for Splash requests:

SplashResponse is returned for binary Splash responses - e.g. for /render.png responses;
SplashTextResponse is returned when the result is text - e.g. for /render.html responses;
SplashJsonResponse is returned when the result is a JSON object - e.g. for /render.json responses or /execute responses when script returns a Lua table.

SplashJsonResponse provide extra features:

response.data attribute contains response data decoded from JSON; you can access it like response.data['html'].

show 另存文件

    def parse(self, response):

        # In [6]: response.data.keys()

        # Out[6]: [u'title', u'url', u'geometry', u'html', u'png', u'requestedUrl']        

        imgdata = base64.b64decode(response.data['png'])

        img = Image.open(BytesIO(imgdata))

        img.show()

        filename = 'some_image.png'

        with open(filename, 'wb') as f:

            f.write(imgdata)

        inspect_response(response, self)  ########################

scrapy相关：splash 实践的更多相关文章

scrapy相关：splash安装 A javascript rendering service 渲染
0. splash: 美人鱼溅,泼 1.参考 Splash使用初体验 docker在windows下的安装 https://blog.scrapinghub.com/2015/03/02/hand ...
scrapy的splash 的简单使用
安装Splash(拉取镜像下来)docker pull scrapinghub/splash安装scrapy-splashpip install scrapy-splash启动容器docker run ...
scrapy 相关
Spider类的一些自定制 # Spider类自定义起始解析器 def start_requests(self): for url in self.start_urls: yield Reques ...
scrapy相关通过设置 FEED_EXPORT_ENCODING 解决 unicode 中文写入json文件出现`\uXXXX`
0.问题现象爬取 item: 2017-10-16 18:17:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.hu ...
Scrapy对接Splash基础知识学习
一:什么是Splash Splash是一个 JavaScript渲染服务,是一个带有 HTTPAPI 的轻量级浏览器 1 功能介绍利用 Splash,我们可以实现如下功能: 口异步方式处理多个网页渲 ...
【python】scrapy相关
目前scrapy还不支持python3,python2.7与python3.5共存时安装scrapy后,执行scrapy后报错 Traceback (most recent call last): F ...
zookeeper 集群相关配置实践
一,zookeeper 集群下载及配置 1.1, 准备三台服务器node1,node2,node3. 1.2, [root@liunx local]#yum install -y java #安装ja ...
小白学 Python 爬虫（41）：爬虫框架 Scrapy 入门基础（八）对接 Splash 实战
人生苦短,我用 Python 前文传送门: 小白学 Python 爬虫(1):开篇小白学 Python 爬虫(2):前置准备(一)基本类库的安装小白学 Python 爬虫(3):前置准备(二)Li ...
Scrapy框架延迟请求之Splash的使用
Splash是什么,用来做什么 Splash, 就是一个Javascript渲染服务.它是一个实现了HTTP API的轻量级浏览器,Splash是用Python实现的,同时使用Twisted和QT.T ...

随机推荐

前端基础之BOM和DOM(响应式布局、计时器、搜索框、select联动)
一.BOM和DOM JavaScript分为 ECMAScript,DOM,BOM. BOM(Browser Object Model)是指浏览器对象模型,它使 JavaScript 有能力与浏览器进 ...
【BZOJ5503】[GXOI/GZOI2019]宝牌一大堆（动态规划）
[BZOJ5503][GXOI/GZOI2019]宝牌一大堆(动态规划) 题面 BZOJ 洛谷题解首先特殊牌型直接特判. 然后剩下的部分可以直接\(dp\),直接把所有可以存的全部带进去大力\(d ...
shell之数组和关联数组
数组和关联数组 #!/bin/bash #定义数组1 array_var1=(1 2 3 4 5 6)# #定义数组2 array_var[0]="test1" array_var ...
idea搭建springboot
1.创建新项目 2.继续项目配置 Name:项目名称Type:我们是Maven构建的,那么选择第一个Maven ProjectPackaging:打包类型,打包成Jar文件Java Version: ...
spring-boot-devtools在Idea中热部署方法
1 pom.xml文件注:热部署功能spring-boot-1.3开始有的  <dependency> <groupId>org.sprin ...
go 数组切片字典结构体
数组 ##数组的定义与赋值: 1. var num [3]int num = [3]int{1,2,3} 2. var num [3]int = [3]int {1,2,3} 3. num := [3 ...
CMDB资产管理系统开发【day25】:需求分析
本节内容浅谈ITIL CMDB介绍 Django自定义用户认证 Restful 规范资产管理功能开发浅谈ITIL TIL即IT基础架构库(Information Technology Infra ...
DirectX11 With Windows SDK--22 立方体映射：静态天空盒的读取与实现
前言这一章我们主要学习由6个纹理所构成的立方体映射,以及用它来实现一个静态天空盒. 但是在此之前先要消除两个误区: 认为这一章的天空盒就是简单的在一个超大立方体的六个面内部贴上天空盒纹理: 认为天空 ...
MySQL初步
一写在开头1.1 本节内容本节的主要内容是MySQL的基本操作(来自MySQL 5.7官方文档). 1.2 工具准备一台装好了mysql的ubuntu 16.04 LTS机器. 二 MySQL的连接 ...
抓包工具Charles基本用法
我们在进行B/S架构的Web项目开发时,在前端页面与后台交互的调试的时候,通常使用在JSP中加入“debugger;”断点,然后使用浏览器的F12开发者工具来查看可能出错的地方的数据.或者使用Http ...

scrapy相关：splash 实践