CrawlSpider介绍

  CrawlSpider是Spider的一个子类,意味着拥有Spider的方法,以及自己的方法,更加高效简洁。其中最显著的功能就是"LinkExtractors"链接提取器。Spider是所有爬虫的基类,其设计只是为了爬取start_urls列表中的网页。然而CrawlSpider更适合在网页中提取url继续进行爬取。

CrawlSpider使用

  1、创建scrapy工程:

scrapy startproject projectName

  2、创建爬虫文件:

scrapy genspider -t crawl SpiderName www.xxx.com
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule class A4567tvSpider(CrawlSpider):
name = '4567Tv'
# allowed_domains = ['www.xxx.com']
start_urls = ['http://www.xxx.com/'] rules = (
Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
) def parse_item(self, response):
item = {}
#item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
#item['name'] = response.xpath('//div[@id="name"]').get()
#item['description'] = response.xpath('//div[@id="description"]').get()
return item

创建的爬虫文件代码

LinkExtractor连接提取器:根据指定规则(正则)进行连接的提取

  Rule规则解析器:将链接提取器提取到的链接进行请求发送,然后对获取的页面数据进行
  指定规则(callback)的解析
   一个链接提取器对应唯一一个规则解析器

爬取4567tv.tv的全栈电影名字以及演员名字进行持久化储存:

spider/4567tv.py:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from crawlProject.items import CrawlprojectItem
#"/frim/index1-2.html"
class A4567tvSpider(CrawlSpider):
name = '4567Tv'
# allowed_domains = ['www.xxx.com']
start_urls = ['https://www.4567tv.tv/frim/index1.html']
link = LinkExtractor(allow=r'/frim/index1-\d+\.html')#链接采集器 正则表达式
#如果正则为空,则匹配所有的链接
link1 = LinkExtractor(allow=r'/movie/indexd+\.html')
rules = (
Rule(link, callback='parse_item', follow=True),#参数三True就是采集所有的网页
Rule(link1, callback='parse_detail'),
)
#rules=():指定不同规则解析器。一个Rule对象表示一种提取规则
#Rule:规则解析器。根据链接提取器中提取到的链接,根据指定规则提取解析器链接网页的内容
def parse_item(self, response):
first_list = response.xpath('/html/body/div[1]/div/div/div/div[2]/ul/li')
for url in first_list:
title = url.xpath('./div/a/@title').extract_first()
name = url.xpath('./div/div/p/text()').extract_first()
item = CrawlprojectItem()
item["title"] = title
item["name"] = name
yield item #CrawlSpider的爬取流程:
"""爬虫文件首先根据起始的url、获取该url的网页内容。
链接提取器会根据指定提取规则将步骤a中网页内容中的链接进行提取
规则解析器会根据指定解析规则将链接提取器中的网页中的内容根据指定的规则进行解析
将解析数据封装到item中。提交给管道进行持久化储存
"""

items.py:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html import scrapy class CrawlprojectItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
name = scrapy.Field()

pipelins.py:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html class CrawlprojectPipeline(object):
def __init__(self):
self.fp = None
def open_spider(self,spider):
print("开始爬虫!!!")
self.fp = open("./movies.txt","w",encoding="utf-8")
def process_item(self, item, spider):
self.fp.write(item["title"]+":"+item["name"]+"\n")
return item
def close_spider(self,spider):
print("爬虫结束!!!")
self.fp.close()
# -*- coding: utf-8 -*-

# Scrapy settings for crawlProject project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'crawlProject' SPIDER_MODULES = ['crawlProject.spiders']
NEWSPIDER_MODULE = 'crawlProject.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'crawlProject (+http://www.yourdomain.com)'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
LOG_LEVEL = "ERROR"
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default)
#COOKIES_ENABLED = False # Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False # Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#} # Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'crawlProject.middlewares.CrawlprojectSpiderMiddleware': 543,
#} # Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'crawlProject.middlewares.CrawlprojectDownloaderMiddleware': 543,
#} # Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#} # Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'crawlProject.pipelines.CrawlprojectPipeline': 300,
} # Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

settings.py

14-scrapy框架(CrawlSpider)的更多相关文章

  1. 全栈爬取-Scrapy框架(CrawlSpider)

    引入 提问:如果想要通过爬虫程序去爬取”糗百“全站数据新闻数据的话,有几种实现方法? 方法一:基于Scrapy框架中的Spider的递归爬取进行实现(Request模块递归回调parse方法). 方法 ...

  2. Scrapy框架——CrawlSpider类爬虫案例

    Scrapy--CrawlSpider Scrapy框架中分两类爬虫,Spider类和CrawlSpider类. 此案例采用的是CrawlSpider类实现爬虫. 它是Spider的派生类,Spide ...

  3. Scrapy框架——CrawlSpider爬取某招聘信息网站

    CrawlSpider Scrapy框架中分两类爬虫,Spider类和CrawlSpider类. 它是Spider的派生类,Spider类的设计原则是只爬取start_url列表中的网页, 而Craw ...

  4. python爬虫之Scrapy框架(CrawlSpider)

    提问:如果想要通过爬虫程序去爬取”糗百“全站数据新闻数据的话,有几种实现方法? 方法一:基于Scrapy框架中的Spider的递归爬去进行实现的(Request模块回调) 方法二:基于CrawlSpi ...

  5. 爬虫开发14.scrapy框架之分布式操作

    分布式爬虫 一.redis简单回顾 1.启动redis: mac/linux:   redis-server redis.conf windows: redis-server.exe redis-wi ...

  6. 网络爬虫之scrapy框架(CrawlSpider)

    一.简介 CrawlSpider其实是Spider的一个子类,除了继承到Spider的特性和功能之外,还派生了其自己独有的更强大的特性和功能.其中最显著的功能就是"LinkExtractor ...

  7. Scrapy框架-CrawlSpider

    目录 1.CrawlSpider介绍 2.CrawlSpider源代码 3. LinkExtractors:提取Response中的链接 4. Rules 5.重写Tencent爬虫 6. Spide ...

  8. Scrapy 框架 CrawlSpider 全站数据爬取

    CrawlSpider 全站数据爬取 创建 crawlSpider 爬虫文件 scrapy genspider -t crawl chouti www.xxx.com import scrapy fr ...

  9. 爬虫Scrapy框架-Crawlspider链接提取器与规则解析器

    Crawlspider 一:Crawlspider简介 CrawlSpider其实是Spider的一个子类,除了继承到Spider的特性和功能外,还派生除了其自己独有的更加强大的特性和功能.其中最显著 ...

  10. 16.Python网络爬虫之Scrapy框架(CrawlSpider)

    引入 提问:如果想要通过爬虫程序去爬取”糗百“全站数据新闻数据的话,有几种实现方法? 方法一:基于Scrapy框架中的Spider的递归爬取进行实现(Request模块递归回调parse方法). 方法 ...

随机推荐

  1. 阿里云ECS服务器部署HADOOP集群(七):Sqoop 安装

    本篇将在 阿里云ECS服务器部署HADOOP集群(一):Hadoop完全分布式集群环境搭建 阿里云ECS服务器部署HADOOP集群(二):HBase完全分布式集群搭建(使用外置ZooKeeper) 阿 ...

  2. 企业级堡垒机 jumpserver

    环境准备 系统:CentOS 7 IP:192.168.10.101 关闭selinux 和防火墙 # CentOS $ setenforce # 可以设置配置文件永久关闭 $ systemctl s ...

  3. dependencyManagement

    maven中的继承是在子类工程中加入父类的groupId,artifactId,version并用parent标签囊括 depenentManagement标签作用: 当父类的pom.xml中没有de ...

  4. Python目录教程集和资料分享

    以下整理的是python的基础笔记,需要python视频资料或更多的请关注我的公众号! 查看内容点击链接: Python简介及安装 Python的3种执行方式 变量及变量计算和引用 if, elif, ...

  5. day100_12_4DataFrame和matplotlib模块

    一.Dataframe的分组. 再网页表格数据 的分析中,可以使用以下语句进行爬取表格. res = pd.read_html('https://baike.baidu.com/item/NBA%E6 ...

  6. RabbitMQ与Spring的框架整合之Spring AMQP实战

    1.SpringAMQP用户管理组件RabbitAdmin. RabbitAdmin类可以很好的操作RabbitMQ,在Spring中直接进行注入即可.注意,autoStartup必须设置为true, ...

  7. (三十八)c#Winform自定义控件-圆形进度条-HZHControls

    官网 http://www.hzhcontrols.com 前提 入行已经7,8年了,一直想做一套漂亮点的自定义控件,于是就有了本系列文章. GitHub:https://github.com/kww ...

  8. VS2019 .Net Core 3.0 Web 项目启用动态编译

    VS2019 中 .Net Core 3.0 项目默认没有启用动态编译, 这导致按F5调试的时候,修改了 HTML 代码,在浏览器上刷新没有效果. 启用动态编译方法如下: 1. 安装 Microsof ...

  9. Redis 数据类型及应用场景

    一. redis 特点 所有数据存储在内存中,高速读写 提供丰富多样的数据类型:string. hash. set. sorted set.bitmap.hyperloglog 提供了 AOF 和 R ...

  10. 如何搭建node - express 项目

    基于博主也是个菜鸟,亲身体验后步骤如下: 首先,我们需要安装node.js,  https://www.runoob.com/nodejs/nodejs-install-setup.html 安装完成 ...