14-scrapy框架(CrawlSpider)

CrawlSpider介绍

　　CrawlSpider是Spider的一个子类，意味着拥有Spider的方法，以及自己的方法，更加高效简洁。其中最显著的功能就是"LinkExtractors"链接提取器。Spider是所有爬虫的基类，其设计只是为了爬取start_urls列表中的网页。然而CrawlSpider更适合在网页中提取url继续进行爬取。

CrawlSpider使用

　　1、创建scrapy工程：

scrapy startproject projectName

　　2、创建爬虫文件：

scrapy genspider -t crawl SpiderName www.xxx.com

# -*- coding: utf-8 -*-

import scrapy

from scrapy.linkextractors import LinkExtractor

from scrapy.spiders import CrawlSpider, Rule

class A4567tvSpider(CrawlSpider):

    name = '4567Tv'

    # allowed_domains = ['www.xxx.com']

    start_urls = ['http://www.xxx.com/']

    rules = (

        Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),

    )

    def parse_item(self, response):

        item = {}

        #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()

        #item['name'] = response.xpath('//div[@id="name"]').get()

        #item['description'] = response.xpath('//div[@id="description"]').get()

        return item

创建的爬虫文件代码

LinkExtractor连接提取器：根据指定规则（正则）进行连接的提取

　　Rule规则解析器：将链接提取器提取到的链接进行请求发送，然后对获取的页面数据进行
　　指定规则（callback）的解析
　　一个链接提取器对应唯一一个规则解析器

爬取4567tv.tv的全栈电影名字以及演员名字进行持久化储存：

spider/4567tv.py:

# -*- coding: utf-8 -*-

import scrapy

from scrapy.linkextractors import LinkExtractor

from scrapy.spiders import CrawlSpider, Rule

from crawlProject.items import CrawlprojectItem

#"/frim/index1-2.html"

class A4567tvSpider(CrawlSpider):

    name = '4567Tv'

    # allowed_domains = ['www.xxx.com']

    start_urls = ['https://www.4567tv.tv/frim/index1.html']

    link = LinkExtractor(allow=r'/frim/index1-\d+\.html')#链接采集器 正则表达式

    #如果正则为空，则匹配所有的链接

    link1 = LinkExtractor(allow=r'/movie/indexd+\.html')

    rules = (

        Rule(link, callback='parse_item', follow=True),#参数三True就是采集所有的网页

        Rule(link1, callback='parse_detail'),

    )

  #rules=（）：指定不同规则解析器。一个Rule对象表示一种提取规则

#Rule：规则解析器。根据链接提取器中提取到的链接，根据指定规则提取解析器链接网页的内容

    def parse_item(self, response):

        first_list = response.xpath('/html/body/div[1]/div/div/div/div[2]/ul/li')

        for url in first_list:

            title = url.xpath('./div/a/@title').extract_first()

            name = url.xpath('./div/div/p/text()').extract_first()

            item = CrawlprojectItem()

            item["title"] = title

            item["name"] = name

            yield item

#CrawlSpider的爬取流程：

"""爬虫文件首先根据起始的url、获取该url的网页内容。

    链接提取器会根据指定提取规则将步骤a中网页内容中的链接进行提取

    规则解析器会根据指定解析规则将链接提取器中的网页中的内容根据指定的规则进行解析

    将解析数据封装到item中。提交给管道进行持久化储存

"""

items.py:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

#

# See documentation in:

# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class CrawlprojectItem(scrapy.Item):

    # define the fields for your item here like:

    title = scrapy.Field()

    name = scrapy.Field()

pipelins.py:

# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

class CrawlprojectPipeline(object):

    def __init__(self):

        self.fp = None

    def open_spider(self,spider):

        print("开始爬虫！！！")

        self.fp = open("./movies.txt","w",encoding="utf-8")

    def process_item(self, item, spider):

        self.fp.write(item["title"]+":"+item["name"]+"\n")

        return item

    def close_spider(self,spider):

        print("爬虫结束！！！")

        self.fp.close()

# -*- coding: utf-8 -*-

# Scrapy settings for crawlProject project

#

# For simplicity, this file contains only settings considered important or

# commonly used. You can find more settings consulting the documentation:

#

#     https://docs.scrapy.org/en/latest/topics/settings.html

#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html

#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'crawlProject'

SPIDER_MODULES = ['crawlProject.spiders']

NEWSPIDER_MODULE = 'crawlProject.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent

#USER_AGENT = 'crawlProject (+http://www.yourdomain.com)'

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'

# Obey robots.txt rules

ROBOTSTXT_OBEY = False

LOG_LEVEL = "ERROR"

# Configure maximum concurrent requests performed by Scrapy (default: 16)

#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)

# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay

# See also autothrottle settings and docs

#DOWNLOAD_DELAY = 3

# The download delay setting will honor only one of:

#CONCURRENT_REQUESTS_PER_DOMAIN = 16

#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)

#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)

#TELNETCONSOLE_ENABLED = False

# Override the default request headers:

#DEFAULT_REQUEST_HEADERS = {

#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

#   'Accept-Language': 'en',

#}

# Enable or disable spider middlewares

# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html

#SPIDER_MIDDLEWARES = {

#    'crawlProject.middlewares.CrawlprojectSpiderMiddleware': 543,

#}

# Enable or disable downloader middlewares

# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html

#DOWNLOADER_MIDDLEWARES = {

#    'crawlProject.middlewares.CrawlprojectDownloaderMiddleware': 543,

#}

# Enable or disable extensions

# See https://docs.scrapy.org/en/latest/topics/extensions.html

#EXTENSIONS = {

#    'scrapy.extensions.telnet.TelnetConsole': None,

#}

# Configure item pipelines

# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {

   'crawlProject.pipelines.CrawlprojectPipeline': 300,

}

# Enable and configure the AutoThrottle extension (disabled by default)

# See https://docs.scrapy.org/en/latest/topics/autothrottle.html

#AUTOTHROTTLE_ENABLED = True

# The initial download delay

#AUTOTHROTTLE_START_DELAY = 5

# The maximum download delay to be set in case of high latencies

#AUTOTHROTTLE_MAX_DELAY = 60

# The average number of requests Scrapy should be sending in parallel to

# each remote server

#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:

#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)

# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

#HTTPCACHE_ENABLED = True

#HTTPCACHE_EXPIRATION_SECS = 0

#HTTPCACHE_DIR = 'httpcache'

#HTTPCACHE_IGNORE_HTTP_CODES = []

#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

settings.py

14-scrapy框架(CrawlSpider)的更多相关文章

全栈爬取-Scrapy框架(CrawlSpider)
引入提问:如果想要通过爬虫程序去爬取”糗百“全站数据新闻数据的话,有几种实现方法? 方法一:基于Scrapy框架中的Spider的递归爬取进行实现(Request模块递归回调parse方法). 方法 ...
Scrapy框架——CrawlSpider类爬虫案例
Scrapy--CrawlSpider Scrapy框架中分两类爬虫,Spider类和CrawlSpider类. 此案例采用的是CrawlSpider类实现爬虫. 它是Spider的派生类,Spide ...
Scrapy框架——CrawlSpider爬取某招聘信息网站
CrawlSpider Scrapy框架中分两类爬虫,Spider类和CrawlSpider类. 它是Spider的派生类,Spider类的设计原则是只爬取start_url列表中的网页, 而Craw ...
python爬虫之Scrapy框架(CrawlSpider)
提问:如果想要通过爬虫程序去爬取”糗百“全站数据新闻数据的话,有几种实现方法? 方法一:基于Scrapy框架中的Spider的递归爬去进行实现的(Request模块回调) 方法二:基于CrawlSpi ...
爬虫开发14.scrapy框架之分布式操作
分布式爬虫一.redis简单回顾 1.启动redis: mac/linux: redis-server redis.conf windows: redis-server.exe redis-wi ...
网络爬虫之scrapy框架(CrawlSpider)
一.简介 CrawlSpider其实是Spider的一个子类,除了继承到Spider的特性和功能之外,还派生了其自己独有的更强大的特性和功能.其中最显著的功能就是"LinkExtractor ...
Scrapy框架-CrawlSpider
目录 1.CrawlSpider介绍 2.CrawlSpider源代码 3. LinkExtractors:提取Response中的链接 4. Rules 5.重写Tencent爬虫 6. Spide ...
Scrapy 框架 CrawlSpider 全站数据爬取
CrawlSpider 全站数据爬取创建 crawlSpider 爬虫文件 scrapy genspider -t crawl chouti www.xxx.com import scrapy fr ...
爬虫Scrapy框架-Crawlspider链接提取器与规则解析器
Crawlspider 一:Crawlspider简介 CrawlSpider其实是Spider的一个子类,除了继承到Spider的特性和功能外,还派生除了其自己独有的更加强大的特性和功能.其中最显著 ...
16.Python网络爬虫之Scrapy框架（CrawlSpider）
引入提问:如果想要通过爬虫程序去爬取”糗百“全站数据新闻数据的话,有几种实现方法? 方法一:基于Scrapy框架中的Spider的递归爬取进行实现(Request模块递归回调parse方法). 方法 ...

随机推荐

Appium(四)：真实机第一个appium程序、模拟器第一个appium程序、查看元素
1. 真实机第一个appium程序学完了前面的知识,也将环境搭建好了,接下来我们就正式开始appium的学习了. 在做app自动化的时候,我们肯定是针对某个产品.某个软件进行测试,那么我们一定是先让 ...
Codechef RIN 「Codechef14DEC」Course Selection 最小割离散变量模型
问题描述提供中文版本好评,一直以为 Rin 是题目名字... pdf submit 题解参考了东营市胜利第一中学姜志豪的<网络流的一些建模方法>(2016年信息学奥林匹克中国国家队 ...
python访问kafka
操作系统 : CentOS7.3.1611_x64 Python 版本 : 3.6.8 kafka 版本 : 2.3.1 本文记录python访问kafka的简单使用,是入门教程,高阶读者请直接忽略. ...
SPARQL入门（一）SPARQL简介与简单使用
知识图谱(Knowledge Graph)是当前互联网最炙手可热的技术之一,它的典型应用场景就是搜索引擎,比如Google搜索,百度搜索.我们在百度搜索中输入问题"中国银行的总部在哪&q ...
Java：程序不过是几行代码的集合
程序不过是几行代码的集合.就像下面这样: public class Test { public static void main(String[] args) { System.out.println ...
spring的简易实现（一）
[练习]spring的简易实现(一) 在第一部分我们实现读取xml的配置,然后实例化xml中的bean 首先定义一个xml和相关的class类 <?xml version="1.0&q ...
阿里面试实战题3----String,StringBuilder,StringBuffer区别
String public final class String implements java.io.Serializable, Comparable<String>, CharSequ ...
ASP.NET Core Web 应用程序开发期间部署到IIS自定义主机域名并附加到进程调试
想必大家之前在进行ASP.NET Web 应用程序开发期间都有用到过将我们的网站部署到IIS自定义主机域名并附加到进程进行调试. 那我们的ASP.NET Core Web 应用程序又是如何部署到我们的 ...
CSS3动画的使用
0921自我总结 CSS3动画的使用一.动画的创建 @keyframes规则是创建动画浏览器兼容 1.@keyframes myfirst 2.@-webkit-keyframes myfirst ...
kvm2
kvm虚拟机的桥接网络默认的虚拟机网络是NAT模式,网段192.168.122.0/24 1:创建桥接网卡创建桥接网卡命令 virsh iface-bridge eth0 br0 取消桥接网卡命令 ...

14-scrapy框架(CrawlSpider)

CrawlSpider介绍

CrawlSpider使用

14-scrapy框架(CrawlSpider)的更多相关文章

随机推荐

热门专题