Scrapy 爬取动态页面

　　目前绝大多数的网站的页面都是冬天页面，动态页面中的部分内容是浏览器运行页面中的JavaScript 脚本动态生成的，爬取相对比较困难

先来看一个很简单的动态页面的例子，在浏览器中打开 http://quotes.toscrape.com/js，显示如下：

页面总有十条名人名言，每一条都包含在<div class = "quote">元素中，现在我们在 Scrapy shell中尝试爬取页面中的名人名言：

$ scrapy shell http://quotes.toscrape.com/js/

...

>>> response.css(''div.quote)

[]

从结果可以看出，爬取失败了，在页面中没有找到任何包含名人名言的 <div class = 'quote'>元素。这些 <div class = 'qoute'>就是动态内容，从服务器下载的页面中并不包含他们（多以我们爬去失败），浏览器执行了页面中的一段 JavaScript 代码后，他们才被生成出来。

图中的 JavaScript 代码如下：

 var data = [

    {

        "tags": [

            "change",

            "deep-thoughts",

            "thinking",

            "world"

        ],

        "author": {

            "name": "Albert Einstein",

            "goodreads_link": "/author/show/9810.Albert_Einstein",

            "slug": "Albert-Einstein"

        },

        "text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d"

    },

    {

        "tags": [

            "abilities",

            "choices"

        ],

        "author": {

            "name": "J.K. Rowling",

            "goodreads_link": "/author/show/1077326.J_K_Rowling",

            "slug": "J-K-Rowling"

        },

        "text": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d"

    },

    {

        "tags": [

            "inspirational",

            "life",

            "live",

            "miracle",

            "miracles"

        ],

        "author": {

            "name": "Albert Einstein",

            "goodreads_link": "/author/show/9810.Albert_Einstein",

            "slug": "Albert-Einstein"

        },

        "text": "\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d"

    },

    {

        "tags": [

            "aliteracy",

            "books",

            "classic",

            "humor"

        ],

        "author": {

            "name": "Jane Austen",

            "goodreads_link": "/author/show/1265.Jane_Austen",

            "slug": "Jane-Austen"

        },

        "text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d"

    },

    {

        "tags": [

            "be-yourself",

            "inspirational"

        ],

        "author": {

            "name": "Marilyn Monroe",

            "goodreads_link": "/author/show/82952.Marilyn_Monroe",

            "slug": "Marilyn-Monroe"

        },

        "text": "\u201cImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.\u201d"

    },

    {

        "tags": [

            "adulthood",

            "success",

            "value"

        ],

        "author": {

            "name": "Albert Einstein",

            "goodreads_link": "/author/show/9810.Albert_Einstein",

            "slug": "Albert-Einstein"

        },

        "text": "\u201cTry not to become a man of success. Rather become a man of value.\u201d"

    },

    {

        "tags": [

            "life",

            "love"

        ],

        "author": {

            "name": "Andr\u00e9 Gide",

            "goodreads_link": "/author/show/7617.Andr_Gide",

            "slug": "Andre-Gide"

        },

        "text": "\u201cIt is better to be hated for what you are than to be loved for what you are not.\u201d"

    },

    {

        "tags": [

            "edison",

            "failure",

            "inspirational",

            "paraphrased"

        ],

        "author": {

            "name": "Thomas A. Edison",

            "goodreads_link": "/author/show/3091287.Thomas_A_Edison",

            "slug": "Thomas-A-Edison"

        },

        "text": "\u201cI have not failed. I've just found 10,000 ways that won't work.\u201d"

    },

    {

        "tags": [

            "misattributed-eleanor-roosevelt"

        ],

        "author": {

            "name": "Eleanor Roosevelt",

            "goodreads_link": "/author/show/44566.Eleanor_Roosevelt",

            "slug": "Eleanor-Roosevelt"

        },

        "text": "\u201cA woman is like a tea bag; you never know how strong it is until it's in hot water.\u201d"

    },

    {

        "tags": [

            "humor",

            "obvious",

            "simile"

        ],

        "author": {

            "name": "Steve Martin",

            "goodreads_link": "/author/show/7103.Steve_Martin",

            "slug": "Steve-Martin"

        },

        "text": "\u201cA day without sunshine is like, you know, night.\u201d"

    }

];

    for (var i in data) {

        var d = data[i];

        var tags = $.map(d['tags'], function(t) {

            return "<a class='tag'>" + t + "</a>";

        }).join(" ");

        document.write("<div class='quote'><span class='text'>" + d['text'] + "</span><span>by <small class='author'>" + d['author']['name'] + "</small></span><div class='tags'>Tags: " + tags + "</div></div>");

        }

　　阅读代码可以了解页面中动态生成的细节，所有名人名言信息被保存在数组 data 中，最后的 for 循环迭代 data 中的每项信息，使用 document。write 生成每条名人名言对应的 <div class = ‘quote’>元素。

　　上面是动态页面中最简单的一个例子，数据被应编码到 JavaScript 代码中，实际中更常见的是JavaScript 通过 HTTP 请求跟网站动态交互获取数据（AJAX），然后使用数据更新 HTMML 页面。爬取此类动态网页需要先执行页面使用 JavaScript 渲染引擎页面，咋进行爬取。

Scrapy 爬取动态页面的更多相关文章

Scrapy爬取静态页面
Scrapy爬取静态页面安装Scrapy框架: Scrapy是python下一个非常有用的一个爬虫框架 Pycharm下: 搜索Scrapy库添加进项目即可终端下: #python2 sudo p ...
scrapy爬取动态分页内容
1.任务定义: 爬取某动态分页页面中所有子话题的内容. 所谓"动态分页":是指通过javascript(简称"js")点击实现翻页,很多时候翻页后的页面地址ur ...
Python 爬虫实例（8）—— 爬取动态页面
今天使用python 和selenium爬取动态数据,主要是通过不停的更新页面,实现数据的爬取,要爬取的数据如下图源代码: #-*-coding:utf-8-*- import time from ...
scrapy爬取相似页面及回调爬取问题（以慕课网为例）
以爬取慕课网数据为例慕课网的数据很简单,就是通过get方式获取的连接地址为https://www.imooc.com/course/list?page=2 根据page参数来分页
selenium自动化测试爬取动态页面大全
目录一:浏览器信息测试二:查找结点三:测试动作四:获取节点信息五:切换子页面Frame 六,延时请求七:前进和后退八:Cookies 八:选项卡处理九:捕获异常这里之讲解用法,安 ...
selenium+phantomjs爬取动态页面数据
1.安装selenium pip/pip3 install selenium 注意依赖关系 2.phantomjs for windows 下载地址:http://phantomjs.org/down ...
scrapy(四): 爬取二级页面的内容
scrapy爬取二级页面的内容 1.定义数据结构item.py文件 # -*- coding: utf-8 -*- ''' field: item.py ''' # Define here the m ...
【图文详解】scrapy爬虫与动态页面——爬取拉勾网职位信息（2）
上次挖了一个坑,今天终于填上了,还记得之前我们做的拉勾爬虫吗?那时我们实现了一页的爬取,今天让我们再接再厉,实现多页爬取,顺便实现职位和公司的关键词搜索功能. 之前的内容就不再介绍了,不熟悉的请一定要 ...
Scrapy 框架使用 selenium 爬取动态加载内容
使用 selenium 爬取动态加载内容开启中间件 DOWNLOADER_MIDDLEWARES = { 'wangyiPro.middlewares.WangyiproDownloaderMidd ...

随机推荐

Qt QQuickView设置成无边框无标题栏
#include <QGuiApplication> #include <QQmlApplicationEngine> #include <QQuickView> ...
《C语言程序设计》王希杰课后答案
仅供参考,好好学习,不要骗自己哦! 在线预览预览链接: https://www.kdocs.cn/l/shOy4IgXl 下载: 链接1: http://t.cn/AiBK2mgJ 链接2: htt ...
XMOS发布集单片机，AI，FPGA，DSP于一身的跨界处理器完全体xcore.ai，致力于AIOT，售价1美元起步
说明:XMOS这次致力于打造全新的,颠覆性的嵌入式平台,简化开发人员要学一堆东西才能开发一款高性能AIOT产品的痛点. XCORE.AI集单片机,AI,FPGA,DSP于一身,嵌入式软件开发人员可以灵 ...
python3中的正则表达式
精确匹配: \d: 匹配一个数字 \w: 匹配一个字母或数字 . : 匹配任意一个字符 \s: 匹配一个空格(包括tab等空白符) 匹配变长的字符: * : 匹配任意个 ...
iOS性能优化之内存（memory）优化
https://www.jianshu.com/p/8662b2efbb23 近期在工作中,对APP进行了内存占用优化,减少了不少内存占用,在此将经验进行总结和分享,也欢迎大家进行交流. 在优化的过程 ...
html学习3-CSS补充
position fixed:把标签固定在页面的某处例子:使用fixed制作“回到顶部”按钮 <!DOCTYPE html> <html lang="en"&g ...
Hyperledger Explorer
简介 Hyperledger Explorer is a simple, powerful, easy-to-use, well maintained, open source utility to ...
celery参考
1,https://www.wandouip.com/t5i377365/ 2,https://www.cnblogs.com/zhangmingcheng/p/6050270.html (syste ...
基于springboot实现轮询线程自动执行任务
本文使用: Timer:这是java自带的java.util.Timer类,这个类允许你调度一个java.util.TimerTask任务.使用这种方式可以让你的程序按照某一个频度执行,但不能在指定时 ...
java 如何快速的获取浏览量
最近公司做了一个类似于发帖,交友圈一个这样的功能在如何精确快速的获取用户的浏览量,且及时的更新显示,最初我是这样想,把每条帖子内容浏览量放到reids 里面,但是redis只是用来存零时数据,想想 ...

Scrapy 爬取动态页面

Scrapy 爬取动态页面的更多相关文章

随机推荐

热门专题