14-scrapy框架(CrawlSpider)
CrawlSpider介绍
CrawlSpider是Spider的一个子类,意味着拥有Spider的方法,以及自己的方法,更加高效简洁。其中最显著的功能就是"LinkExtractors"链接提取器。Spider是所有爬虫的基类,其设计只是为了爬取start_urls列表中的网页。然而CrawlSpider更适合在网页中提取url继续进行爬取。
CrawlSpider使用
1、创建scrapy工程:
scrapy startproject projectName
2、创建爬虫文件:
scrapy genspider -t crawl SpiderName www.xxx.com
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule class A4567tvSpider(CrawlSpider):
name = '4567Tv'
# allowed_domains = ['www.xxx.com']
start_urls = ['http://www.xxx.com/'] rules = (
Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
) def parse_item(self, response):
item = {}
#item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
#item['name'] = response.xpath('//div[@id="name"]').get()
#item['description'] = response.xpath('//div[@id="description"]').get()
return item
创建的爬虫文件代码
LinkExtractor连接提取器:根据指定规则(正则)进行连接的提取
Rule规则解析器:将链接提取器提取到的链接进行请求发送,然后对获取的页面数据进行
指定规则(callback)的解析
一个链接提取器对应唯一一个规则解析器
爬取4567tv.tv的全栈电影名字以及演员名字进行持久化储存:
spider/4567tv.py:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from crawlProject.items import CrawlprojectItem
#"/frim/index1-2.html"
class A4567tvSpider(CrawlSpider):
name = '4567Tv'
# allowed_domains = ['www.xxx.com']
start_urls = ['https://www.4567tv.tv/frim/index1.html']
link = LinkExtractor(allow=r'/frim/index1-\d+\.html')#链接采集器 正则表达式
#如果正则为空,则匹配所有的链接
link1 = LinkExtractor(allow=r'/movie/indexd+\.html')
rules = (
Rule(link, callback='parse_item', follow=True),#参数三True就是采集所有的网页
Rule(link1, callback='parse_detail'),
)
#rules=():指定不同规则解析器。一个Rule对象表示一种提取规则
#Rule:规则解析器。根据链接提取器中提取到的链接,根据指定规则提取解析器链接网页的内容
def parse_item(self, response):
first_list = response.xpath('/html/body/div[1]/div/div/div/div[2]/ul/li')
for url in first_list:
title = url.xpath('./div/a/@title').extract_first()
name = url.xpath('./div/div/p/text()').extract_first()
item = CrawlprojectItem()
item["title"] = title
item["name"] = name
yield item #CrawlSpider的爬取流程:
"""爬虫文件首先根据起始的url、获取该url的网页内容。
链接提取器会根据指定提取规则将步骤a中网页内容中的链接进行提取
规则解析器会根据指定解析规则将链接提取器中的网页中的内容根据指定的规则进行解析
将解析数据封装到item中。提交给管道进行持久化储存
"""
items.py:
# -*- coding: utf-8 -*- # Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html import scrapy class CrawlprojectItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
name = scrapy.Field()
pipelins.py:
# -*- coding: utf-8 -*- # Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html class CrawlprojectPipeline(object):
def __init__(self):
self.fp = None
def open_spider(self,spider):
print("开始爬虫!!!")
self.fp = open("./movies.txt","w",encoding="utf-8")
def process_item(self, item, spider):
self.fp.write(item["title"]+":"+item["name"]+"\n")
return item
def close_spider(self,spider):
print("爬虫结束!!!")
self.fp.close()
# -*- coding: utf-8 -*- # Scrapy settings for crawlProject project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'crawlProject' SPIDER_MODULES = ['crawlProject.spiders']
NEWSPIDER_MODULE = 'crawlProject.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'crawlProject (+http://www.yourdomain.com)'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
LOG_LEVEL = "ERROR"
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default)
#COOKIES_ENABLED = False # Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False # Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#} # Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'crawlProject.middlewares.CrawlprojectSpiderMiddleware': 543,
#} # Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'crawlProject.middlewares.CrawlprojectDownloaderMiddleware': 543,
#} # Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#} # Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'crawlProject.pipelines.CrawlprojectPipeline': 300,
} # Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
settings.py
14-scrapy框架(CrawlSpider)的更多相关文章
- 全栈爬取-Scrapy框架(CrawlSpider)
引入 提问:如果想要通过爬虫程序去爬取”糗百“全站数据新闻数据的话,有几种实现方法? 方法一:基于Scrapy框架中的Spider的递归爬取进行实现(Request模块递归回调parse方法). 方法 ...
- Scrapy框架——CrawlSpider类爬虫案例
Scrapy--CrawlSpider Scrapy框架中分两类爬虫,Spider类和CrawlSpider类. 此案例采用的是CrawlSpider类实现爬虫. 它是Spider的派生类,Spide ...
- Scrapy框架——CrawlSpider爬取某招聘信息网站
CrawlSpider Scrapy框架中分两类爬虫,Spider类和CrawlSpider类. 它是Spider的派生类,Spider类的设计原则是只爬取start_url列表中的网页, 而Craw ...
- python爬虫之Scrapy框架(CrawlSpider)
提问:如果想要通过爬虫程序去爬取”糗百“全站数据新闻数据的话,有几种实现方法? 方法一:基于Scrapy框架中的Spider的递归爬去进行实现的(Request模块回调) 方法二:基于CrawlSpi ...
- 爬虫开发14.scrapy框架之分布式操作
分布式爬虫 一.redis简单回顾 1.启动redis: mac/linux: redis-server redis.conf windows: redis-server.exe redis-wi ...
- 网络爬虫之scrapy框架(CrawlSpider)
一.简介 CrawlSpider其实是Spider的一个子类,除了继承到Spider的特性和功能之外,还派生了其自己独有的更强大的特性和功能.其中最显著的功能就是"LinkExtractor ...
- Scrapy框架-CrawlSpider
目录 1.CrawlSpider介绍 2.CrawlSpider源代码 3. LinkExtractors:提取Response中的链接 4. Rules 5.重写Tencent爬虫 6. Spide ...
- Scrapy 框架 CrawlSpider 全站数据爬取
CrawlSpider 全站数据爬取 创建 crawlSpider 爬虫文件 scrapy genspider -t crawl chouti www.xxx.com import scrapy fr ...
- 爬虫Scrapy框架-Crawlspider链接提取器与规则解析器
Crawlspider 一:Crawlspider简介 CrawlSpider其实是Spider的一个子类,除了继承到Spider的特性和功能外,还派生除了其自己独有的更加强大的特性和功能.其中最显著 ...
- 16.Python网络爬虫之Scrapy框架(CrawlSpider)
引入 提问:如果想要通过爬虫程序去爬取”糗百“全站数据新闻数据的话,有几种实现方法? 方法一:基于Scrapy框架中的Spider的递归爬取进行实现(Request模块递归回调parse方法). 方法 ...
随机推荐
- 原创【cocos2d-x】CCMenuItemToggle 在lua中的使用
说明:1,所使用的cocos2dx版本为2.1.3 ;09:48:05 2,本人仍是在学习中的小菜鸟,此博客只是为了记录我学习过程中的点滴,同时也希望同样lua开发的童鞋,一起交流: 3,本人whj0 ...
- Linux第二章-Linux常用命令
一.Linux常用快捷键 快捷键 作用 Tab 补全文件名或者路径 Ctrl + L 清除屏幕,然后,在最上面重新显示目前光标所在的这一行的内容. Ctrl + C 终止当前进程 Ctrl + D 注 ...
- 06-Node.js学习笔记-创建web服务器
创建web服务器 //引用系统模块 const http = require('http'); //创建web服务器 //用于处理url地址 const url = require('url'); c ...
- 【cf932E】E. Team Work(第二类斯特林数)
传送门 题意: 求\(\displaystyle \sum_{i=0}^n{n\choose i}i^k,n\leq 10^9,k\leq 5000\). 思路: 将\(i^k\)用第二类斯特林数展开 ...
- Mybatis基本类型参数非空判断(异常:There is no getter for property...)
先看一小段代码 <select id="queryByPhone" parameterType="java.lang.String" resultType ...
- 【问题记录】 Linux分区磁盘占满,导致ssh登陆闪退
问题描述 今天要去后台看日志查个问题,通过ssh登陆到服务器后准备用平时非常熟悉的less命令打开日志查看,突然xshell客户端就闪退了.一时感觉很蒙,怎么回事??由于之前有同事遇到类似的问题,提醒 ...
- 分布式的cap原理
由来 1998年的加州大学的计算机科学家 Eric Brewer 提出,分布式有三个指标. Consistency,Availability,Partition tolerance. 简称即为CAP. ...
- Django入门必知必会操作
一.Django基础必备三件套 HttpRseponse 内部传入一个字符串参数,返回给浏览器. 在app目录下的views.py添加函数,添加函数之前必须在urls.py添加函数对应关系,否则访问不 ...
- 可能是最详细的UMD模块入门指南
学习UMD 介绍 这个仓库记录了一些关于javascript UMD模块规范的demo,对我学习UMD规范有了很大帮助,希望也能帮助到你. 回顾 之前也写了几篇关于javascript模块的博客,链接 ...
- 基于STM32F429的ADS1115驱动程序
1.ADS1115中文资料:https://wenku.baidu.com/view/8bab101feef9aef8941ea76e58fafab069dc44e7.html?rec_flag=de ...