30.Scrapy 对接 Selenium

Scrapy 对接 Selenium（参考代码网址，https://github.com/Python3WebSpider/ScrapySeleniumTest） 

此文就是参考书上的代码拿下来跑，作为借鉴，很多地方都不是很理解，也是我第一次使用mongodb入库数据，一直用的都是mysql对mongodb这种关系型数据库用的并不多，今天就是拿代码跑一下理解作者的整个思路有待消化。

主要核心： Downloader Middleware 的方式实现 Selenium的对接。

缺点：此方法是阻塞式的，破坏了Scrapy异步处理的逻辑，速度会影响（还给出了Scrapy 对接 Splash方法）。

taobao.py

# -*- coding: utf-8 -*-

from scrapy import Request, Spider

from urllib.parse import quote

# quote适用于单个字符

from scrapyseleniumtest.items import ProductItem

class TaobaoSpider(Spider):

    name = 'taobao'

    allowed_domains = ['www.taobao.com']

    base_url = 'https://s.taobao.com/search?q='

    def start_requests(self):

        for keyword in self.settings.get('KEYWORDS'):

            for page in range(1, self.settings.get('MAX_PAGE') + 1):

                url = self.base_url + quote(keyword)

                yield Request(url=url, callback=self.parse, meta={'page': page}, dont_filter=True)

                #分页页码page传递参数，设置dot_filter不去重

    def parse(self, response):

        products = response.xpath(

            '//div[@id="mainsrp-itemlist"]//div[@class="items"][1]//div[contains(@class, "item")]')

        for product in products:

            item = ProductItem()

            item['price'] = ''.join(product.xpath('.//div[contains(@class, "price")]//text()').extract()).strip()

            item['title'] = ''.join(product.xpath('.//div[contains(@class, "title")]//text()').extract()).strip()

            item['shop'] = ''.join(product.xpath('.//div[contains(@class, "shop")]//text()').extract()).strip()

            item['image'] = ''.join(product.xpath('.//div[@class="pic"]//img[contains(@class, "img")]/@data-src').extract()).strip()

            item['deal'] = product.xpath('.//div[contains(@class, "deal-cnt")]//text()').extract_first()

            item['location'] = product.xpath('.//div[contains(@class, "location")]//text()').extract_first()

            yield item

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

#

# See documentation in:

# http://doc.scrapy.org/en/latest/topics/items.html

from scrapy import Item, Field

class ProductItem(Item):

    collection = 'products'

    image = Field()

    price = Field()

    deal = Field()

    title = Field()

    shop = Field()

    location = Field()

midddlewares.py (对接关键点)

# -*- coding: utf-8 -*-

from selenium import webdriver

from selenium.common.exceptions import TimeoutException

from selenium.webdriver.common.by import By

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC

from scrapy.http import HtmlResponse

from logging import getLogger

class SeleniumMiddleware():

    def __init__(self, timeout=None, service_args=[]):

        self.logger = getLogger(__name__)

        self.timeout = timeout

        self.browser = webdriver.PhantomJS(service_args=service_args)

        self.browser.set_window_size(1400, 700)

        self.browser.set_page_load_timeout(self.timeout)

        self.wait = WebDriverWait(self.browser, self.timeout)

    def __del__(self):

        self.browser.close()

    def process_request(self, request, spider):

        """

        用PhantomJS抓取页面

        :param request: Request对象

        :param spider: Spider对象

        :return: HtmlResponse

        """

        self.logger.debug('PhantomJS is Starting')

        page = request.meta.get('page', 1)

        try:

            self.browser.get(request.url)

            if page > 1:

                input = self.wait.until(

                    EC.presence_of_element_located((By.CSS_SELECTOR, '#mainsrp-pager div.form > input')))

                submit = self.wait.until(

                    EC.element_to_be_clickable((By.CSS_SELECTOR, '#mainsrp-pager div.form > span.btn.J_Submit')))

                input.clear()

                input.send_keys(page)

                submit.click()

            self.wait.until(

                EC.text_to_be_present_in_element((By.CSS_SELECTOR, '#mainsrp-pager li.item.active > span'), str(page)))

            self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.m-itemlist .items .item')))

            return HtmlResponse(url=request.url, body=self.browser.page_source, request=request, encoding='utf-8',

                                status=200)

        except TimeoutException:

            return HtmlResponse(url=request.url, status=500, request=request)

    @classmethod

    def from_crawler(cls, crawler):

        return cls(timeout=crawler.settings.get('SELENIUM_TIMEOUT'),

                   service_args=crawler.settings.get('PHANTOMJS_SERVICE_ARGS'))

piplines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymongo

class MongoPipeline(object):

    def __init__(self, mongo_uri, mongo_db):

        self.mongo_uri = mongo_uri

        self.mongo_db = mongo_db

    @classmethod

    def from_crawler(cls, crawler):

        return cls(mongo_uri=crawler.settings.get('MONGO_URI'), mongo_db=crawler.settings.get('MONGO_DB'))

    def open_spider(self, spider):

        self.client = pymongo.MongoClient(self.mongo_uri)

        self.db = self.client[self.mongo_db]

    def process_item(self, item, spider):

        self.db[item.collection].insert(dict(item))

        return item

    def close_spider(self, spider):

        self.client.close()

setting.py

# -*- coding: utf-8 -*-

# Scrapy settings for scrapyseleniumtest project

#

# For simplicity, this file contains only settings considered important or

# commonly used. You can find more settings consulting the documentation:

#

#     http://doc.scrapy.org/en/latest/topics/settings.html

#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html

#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'scrapyseleniumtest'

SPIDER_MODULES = ['scrapyseleniumtest.spiders']

NEWSPIDER_MODULE = 'scrapyseleniumtest.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent

# USER_AGENT = 'scrapyseleniumtest (+http://www.yourdomain.com)'

# Obey robots.txt rules

ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)

# CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)

# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay

# See also autothrottle settings and docs

# DOWNLOAD_DELAY = 3

# The download delay setting will honor only one of:

# CONCURRENT_REQUESTS_PER_DOMAIN = 16

# CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)

# COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)

# TELNETCONSOLE_ENABLED = False

# Override the default request headers:

# DEFAULT_REQUEST_HEADERS = {

#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

#   'Accept-Language': 'en',

# }

# Enable or disable spider middlewares

# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

# SPIDER_MIDDLEWARES = {

#    'scrapyseleniumtest.middlewares.ScrapyseleniumtestSpiderMiddleware': 543,

# }

# Enable or disable downloader middlewares

# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html

DOWNLOADER_MIDDLEWARES = {

    'scrapyseleniumtest.middlewares.SeleniumMiddleware': 543,

}

# Enable or disable extensions

# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html

# EXTENSIONS = {

#    'scrapy.extensions.telnet.TelnetConsole': None,

# }

# Configure item pipelines

# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {

    'scrapyseleniumtest.pipelines.MongoPipeline': 300,

}

# Enable and configure the AutoThrottle extension (disabled by default)

# See http://doc.scrapy.org/en/latest/topics/autothrottle.html

# AUTOTHROTTLE_ENABLED = True

# The initial download delay

# AUTOTHROTTLE_START_DELAY = 5

# The maximum download delay to be set in case of high latencies

# AUTOTHROTTLE_MAX_DELAY = 60

# The average number of requests Scrapy should be sending in parallel to

# each remote server

# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:

# AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)

# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

# HTTPCACHE_ENABLED = True

# HTTPCACHE_EXPIRATION_SECS = 0

# HTTPCACHE_DIR = 'httpcache'

# HTTPCACHE_IGNORE_HTTP_CODES = []

# HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

KEYWORDS = ['iPad']

MAX_PAGE = 100

SELENIUM_TIMEOUT = 20

PHANTOMJS_SERVICE_ARGS = ['--load-images=false', '--disk-cache=true']

MONGO_URI = 'localhost'

MONGO_DB = 'taobao'

导入phantomjs的文件

from selenium import webdriver

driver=webdriver.PhantomJS(executable_path=r"E:\Soft\soft\phantomjs-2.1.1-windows\bin\phantomjs.exe")

driver.get('http://news.sohu.com/scroll/')

print(driver.find_element_by_class_name('title').text)

代码执行入库结果如图：

30.Scrapy 对接 Selenium的更多相关文章

Scrapy实战篇（八）之Scrapy对接selenium爬取京东商城商品数据
本篇目标:我们以爬取京东商城商品数据为例,展示Scrapy框架对接selenium爬取京东商城商品数据. 背景: 京东商城页面为js动态加载页面,直接使用request请求,无法得到我们想要的商品数据 ...
Scrapy对接selenium+phantomjs
1.创建项目 :Jd 2.middlewares.py中添加selenium 1.导模块 :from selenium import webdriver 2.定义中间件 class seleniumM ...
Scrapy对接Selenium
首先pip安装selenium,然后下载浏览器驱动 WebDrive下载地址 chrome的webdriver:http://chromedriver.storage.googleapis.com/i ...
爬虫(十七)：Scrapy框架(四) 对接selenium爬取京东商品数据
1. Scrapy对接Selenium Scrapy抓取页面的方式和requests库类似,都是直接模拟HTTP请求,而Scrapy也不能抓取JavaScript动态谊染的页面.在前面的博客中抓取Ja ...
scrapy结合selenium抓取武汉市环保局空气质量日报
1.前言目标网站:武汉市环境保护局(http://hbj.wuhan.gov.cn/viewAirDarlyForestWaterInfo.jspx).scrapy对接selenium模块抓取空气质 ...
小白学 Python 爬虫（40）：爬虫框架 Scrapy 入门基础（七）对接 Selenium 实战
人生苦短,我用 Python 前文传送门: 小白学 Python 爬虫(1):开篇小白学 Python 爬虫(2):前置准备(一)基本类库的安装小白学 Python 爬虫(3):前置准备(二)Li ...
Scrapy——5 下载中间件常用函数、scrapy怎么对接selenium、常用的Setting内置设置有哪些
Scrapy——5 下载中间件常用的函数 Scrapy怎样对接selenium 常用的setting内置设置对接selenium实战 (Downloader Middleware)下载中间件常用函数 ...
scrapy和selenium结合抓取动态网页
1.安装python (我用的是2.7版本的) 2.安装scrapy: 详情请参考 http://blog.csdn.net/wukaibo1986/article/details/8167590 ...
15，scrapy中selenium的应用
引入在通过scrapy框架进行某些网站数据爬取的时候,往往会碰到页面动态数据加载的情况发生如果直接用scrapy对其url发请求,是获取不到那部分动态加载出来的数据值,但是通过观察会发现,通过浏览器 ...

随机推荐

win2008 server 多IP配置
本人服务器环境 win8 + phpstudy 一个服务器多个IP 以前都是用linux,买了几套源码结果都是win8server 服务器+phpstudy. 渐渐也就随大流了.懒的去琢磨一 ...
VS2010 如何自动生成UML图
项目名---右键----查看类图
测试教程网.unittest教程.4. 实例: 读取测试数据并测试弱密码
From: http://www.testclass.net/pyunit/test_example_2/ 背景接上一节的弱密码例子,我们的用例尽管运行的不错,但还是有点问题. 假如我们需要增加一些 ...
C/C++中字符串和数字互转小结
一. 数字转 char*型 1.sprintf函数(适合C和C++) 示例: char str[50]; int num = 345; sprintf(str,"%d",num) ...
centos7系列-给普通用户sudo权限
对于linux用户来讲,普通用户的权限是有一定限制的,所以在有些操作的时候,是需要sudo权限的,那么如何在linux下赋予普通用户sudo权限呢?此处将讲解一下方法. 在login我们的系统后,如果 ...
51nod 1132 覆盖数字的数量 V2
http://www.51nod.com/onlineJudge/questionCode.html#!problemId=1132 题意是给定a,b,l,r求[l,r]内有几个整数可以表示成ax+b ...
学习笔记之Python for Data Analysis
Python for Data Analysis, 2nd Edition https://www.safaribooksonline.com/library/view/python-for-data ...
在docker宿主机上查找指定容器内运行的所有进程的PID
转载 https://www.cnblogs.com/keithtt/p/7591097.html 找到指定容器的所有进程的PID可以更方便的对容器进程进行管理,特别是在某些容器卡住无法连接的场景. ...
ROS-MikroTik-RouterOS-培训认证各种证书
官方原文: https://mikrotik.com/training/about MikroTik certified training programs MTCNA - MikroTik Cert ...
[UE4]AWP狙击枪开镜
一.使用一张PNG图片,中间是透明的,其他部分是纯黑色.创建一个UserWidget.作为AWP的开镜后的准心.AWP默认状态下是没有准心的. 二.右键开镜.把第一步创建的UserWidget创建出来 ...

30.Scrapy 对接 Selenium

30.Scrapy 对接 Selenium的更多相关文章

随机推荐

热门专题