13.CrawlSpider类爬虫

1.CrawlSpider介绍

Scrapy框架中分两类爬虫，Spider类和CrawlSpider类。
此案例采用的是CrawlSpider类实现爬虫。

它是Spider的派生类，Spider类的设计原则是只爬取start_url列表中的网页，而CrawlSpider类定义了一些规则(rule)来提供跟进link的方便的机制，从爬取的网页中获取link并继续爬取的工作更适合。

创建项目指令：

scrapy startproject baidu

模版创建：

scrapy genspider -t crawl baidu 'tieba.baidu.com'

CrawlSpider继承于Spider类，除了继承过来的属性外（name、allow_domains），还提供了新的属性和方法:

LinkExtractors

class scrapy.linkextractors.LinkExtractor

Link Extractors 的目的很简单: 提取链接｡
每个LinkExtractor有唯一的公共方法是 extract_links()，它接收一个 Response 对象，并返回一个 scrapy.link.Link 对象。
Link Extractors要实例化一次，并且 extract_links 方法会根据不同的 response 调用多次提取链接｡

主要参数：

            allow：满足括号中“正则表达式”的值会被提取，如果为空，则全部匹配。

            deny：与这个正则表达式(或正则表达式列表)不匹配的URL一定不提取。

            allow_domains：会被提取的链接的domains。

            deny_domains：一定不会被提取链接的domains。

            restrict_xpaths：使用xpath表达式，和allow共同作用过滤链接。

rules

在rules中包含一个或多个Rule对象，每个Rule对爬取网站的动作定义了特定操作。如果多个rule匹配了相同的链接，则根据规则在本集合中被定义的顺序，第一个会被使用。

参数介绍：
link_extractor：是一个Link Extractor对象，用于定义需要提取的链接

callback： 从link_extractor中每获取到链接时，参数所指定的值作为回调函数，该回调函数接受一个response作为其第一个参数。

    注意：当编写爬虫规则时，避免使用parse作为回调函数。由于CrawlSpider使用parse方法来实现其逻辑，如果覆盖了 parse方法，crawl spider将会运行失败。

    follow：是一个布尔(boolean)值，指定了根据该规则从response提取的链接是否需要跟进。 如果callback为None，follow 默认设置为True ，否则默认为False。

    process_links：指定该spider中哪个的函数将会被调用，从link_extractor中获取到链接列表时将会调用该函数。该方法主要用来过滤。

    process_request：指定该spider中哪个的函数将会被调用， 该规则提取到每个request时都会调用该函数。 (用来过滤request)

2.创建案例

a.开始一个项目

scrapy startproject wxapp

b.创建模板

scrapy genspider -t crawl wxapp_spider "wxapp-union.com"

c.settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for wxapp project

#

# For simplicity, this file contains only settings considered important or

# commonly used. You can find more settings consulting the documentation:

#

#     https://doc.scrapy.org/en/latest/topics/settings.html

#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'wxapp'

SPIDER_MODULES = ['wxapp.spiders']

NEWSPIDER_MODULE = 'wxapp.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent

#USER_AGENT = 'wxapp (+http://www.yourdomain.com)'

# Obey robots.txt rules

ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)

#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)

# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay

# See also autothrottle settings and docs

DOWNLOAD_DELAY = 2

# The download delay setting will honor only one of:

#CONCURRENT_REQUESTS_PER_DOMAIN = 16

#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)

#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)

#TELNETCONSOLE_ENABLED = False

# Override the default request headers:

DEFAULT_REQUEST_HEADERS = {

  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

  'Accept-Language': 'en',

    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'

}

# Enable or disable spider middlewares

# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html

#SPIDER_MIDDLEWARES = {

#    'wxapp.middlewares.WxappSpiderMiddleware': 543,

#}

# Enable or disable downloader middlewares

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

#DOWNLOADER_MIDDLEWARES = {

#    'wxapp.middlewares.WxappDownloaderMiddleware': 543,

#}

# Enable or disable extensions

# See https://doc.scrapy.org/en/latest/topics/extensions.html

#EXTENSIONS = {

#    'scrapy.extensions.telnet.TelnetConsole': None,

#}

# Configure item pipelines

# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {

   'wxapp.pipelines.WxappPipeline': 300,

}

# Enable and configure the AutoThrottle extension (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/autothrottle.html

#AUTOTHROTTLE_ENABLED = True

# The initial download delay

#AUTOTHROTTLE_START_DELAY = 5

# The maximum download delay to be set in case of high latencies

#AUTOTHROTTLE_MAX_DELAY = 60

# The average number of requests Scrapy should be sending in parallel to

# each remote server

#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:

#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

#HTTPCACHE_ENABLED = True

#HTTPCACHE_EXPIRATION_SECS = 0

#HTTPCACHE_DIR = 'httpcache'

#HTTPCACHE_IGNORE_HTTP_CODES = []

#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

wxapp_spider.py

# -*- coding: utf-8 -*-

import scrapy

from scrapy.linkextractors import LinkExtractor

from scrapy.spiders import CrawlSpider, Rule

from wxapp.items import WxappItem

class WxappSpiderSpider(CrawlSpider):

    name = 'wxapp_spider'

    allowed_domains = ['wxapp-union.com']

    start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1']

    rules = (

        Rule(LinkExtractor(allow=r'.+mod=list&catid=2&page=\d'), follow=True),

        Rule(LinkExtractor(allow=r'.+article-.+\.html'),callback="parse_detail",follow=False)

    )

    def parse_detail(self, response):

        title=response.xpath('//h1[@class="ph"]/text()').get()

        author_p=response.xpath('//p[@class="authors"]')

        author=author_p.xpath('.//a/text()').get()

        pub_time=author_p.xpath('.//span/text()').get()

        content=response.xpath('//td[@id="article_content"]//text()').getall()

        item=WxappItem(title=title,author=author,pub_time=pub_time,content=content)

        yield item

        #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()

        #i['name'] = response.xpath('//div[@id="name"]').extract()

        #i['description'] = response.xpath('//div[@id="description"]').extract()

items.py

import scrapy

class WxappItem(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    title=scrapy.Field()

    author=scrapy.Field()

    pub_time=scrapy.Field()

    content=scrapy.Field()

pipelines.py

from scrapy.exporters import JsonLinesItemExporter

class WxappPipeline(object):

    def __init__(self):

        self.fp=open("wxapp.json","wb")

        self.exporter=JsonLinesItemExporter(self.fp,ensure_ascii=False,encoding="utf-8")

    def process_item(self, item, spider):

        self.exporter.export_item(item)

        return item

    def close_spider(self,spider):

        self.fp.close()

在wxapp目录下创建：start.py

from scrapy import cmdline

cmdline.execute("scrapy crawl wxapp_spider".split())

执行start.py即可

13.CrawlSpider类爬虫的更多相关文章

Scrapy框架——CrawlSpider类爬虫案例
Scrapy--CrawlSpider Scrapy框架中分两类爬虫,Spider类和CrawlSpider类. 此案例采用的是CrawlSpider类实现爬虫. 它是Spider的派生类,Spide ...
Scrapy的Spider类和CrawlSpider类
Scrapy shell 用来调试Scrapy 项目代码的命令行工具,启动的时候预定义了Scrapy的一些对象设置 shell Scrapy 的shell是基于运行环境中的python 解释器sh ...
python爬虫入门（八）Scrapy框架之CrawlSpider类
CrawlSpider类通过下面的命令可以快速创建 CrawlSpider模板的代码: scrapy genspider -t crawl tencent tencent.com CrawSpid ...
爬虫笔记之刷小怪练级：yymp3爬虫（音乐类爬虫）
一.目标爬取http://www.yymp3.com网站歌曲相关信息,包括歌曲名字.作者相关信息.歌曲的音频数据.歌曲的歌词数据. 二.分析 2.1 歌曲信息.歌曲音频数据下载地址的获取随便打开一 ...
scrapy的CrawlSpider类
了解CrawlSpider 踏实爬取一般网站的常用spider,其中定义了一些规则(rule)来提供跟进link的方便机制,也许该spider不适合你的目标网站,但是对于大多数情况是可以使用的.因此, ...
scrapy项目4：爬取当当网中机器学习的数据及价格（CrawlSpider类）
scrapy项目3中已经对网页规律作出解析,这里用crawlspider类对其内容进行爬取: 项目结构与项目3中相同如下图,唯一不同的为book.py文件 crawlspider类的爬虫文件book的 ...
C++ primer plus读书笔记——第13章类继承
第13章类继承 1. 如果购买厂商的C库,除非厂商提供库函数的源代码,否则您将无法根据自己的需求,对函数进行扩展或修改.但如果是类库,只要其提供了类方法的头文件和编译后的代码,仍可以使用库中的类派生 ...
【Spider】使用CrawlSpider进行爬虫时，无法爬取数据，运行后很快结束，但没有报错
在学习<python爬虫开发与项目实践>的时候有一个关于CrawlSpider的例子,当我在运行时发现,没有爬取到任何数据,以下是我敲的源代码:import scrapyfrom UseS ...
scrapy框架用CrawlSpider类爬取电影天堂.
本文使用CrawlSpider方法爬取电影天堂网站内国内电影分类下的所有电影的名称和下载地址 CrawlSpider其实就是Spider的一个子类. CrawlSpider功能更加强大(链接提取器,规 ...

随机推荐

Linux网络基础-总
目录 Linux网络基础一.网卡和数据包的转发 1.收包流程二.多网卡bonding 三.SR-IOV 四.DPDK 五.TUN/TAP 六.Linux bridge 和VLAN 七.TCP/IP ...
canvas路径剪切和判断是否在路径内
1.剪切路径 clip() var ctx=mycanvas.getContext('2d'); ctx.beginPath(); // 建一个矩形路径 ctx.moveTo(20,10) ctx.l ...
MYSQL主从复制制作配置方案
1. 主从复制机器配置操作系统:centos7 x64 基于vagrant下的virtual box的虚拟机两台 master ip:192.168.21.11, slave ip 192.168. ...
为什么分布式一定要有redis?
为什么分布式一定要有redis? 孤独烟架构师小秘圈昨天作者:孤独烟来自:http://rjzheng.cnblogs.com/ 1.为什么使用redis 分析:博主觉得在项目中使用red ...
Lisp经典算法
求平方根 SUCCESSIVE AVERAGING DUE TO HERON OF ALEXANDRIA ** TO FIND AN APPROXIMATION TO SQRT(X) ** MAKR ...
js小结
1,浏览器对json支持的方法: JSON.parse(jsonstr);将string转为json的对象. JSON.stringify(jsonobj);将json对象转为string. 2,js ...
BigInteger与BigDecimal
BigInteger与BigDecimal Java大数字运算(BigInteger类和BigDecimal类) 在 Java 中提供了用于大数字运算的类,即 java.math.BigInteger ...
【洛谷P1303 A*B Problem】
题目描述求两数的积. 输入输出格式输入格式: 两行,两个数. 输出格式: 积输入输出样例输入样例#1: 1 2 输出样例#1: 2 说明每个数字不超过10^2000,需用高精 emm,显然本 ...
正向选择（positive selection）、中性选择（neutral selection）、平衡选择（balancing selection）示意图
正向选择:某一位点逐渐积累,成优势的位点,具体表现为:随着时间延长,该位点的突变allele频率越来越高,远远超过野生型allele: 中性选择:随着时间的延长,总体频率没有改变太多: 平衡选择:位点 ...
C++基础知识－Day5
今天主要讲的是类的扩展 1.类成员函数的存储方式首先我们介绍类成员函数的存储方式,C++引入面向对象的概念之后,C语言中的一些比如static/const等原有语义,作一些升级,此时既要保持兼容,还 ...

13.CrawlSpider类爬虫

rules

13.CrawlSpider类爬虫的更多相关文章

随机推荐

热门专题