scrapy快速入门

1. 什么是scrapy？

　　其官网是这样简述的，“A Fast & Powerful Scraping &Crawling Framework ”, 并且其底层以twisted作为网络架构( Python实现的基于事件驱动的网络引擎框架)，所以爬取效率及性能出色。

　　定义·：Scrapy是一个为了爬取网站数据，提取结构性数据而编写的应用框架。其可以应用在数据挖掘，信息处理或存储历史数据等一系列的程序中。其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的，也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫。Scrapy用途广泛，可以用于数据挖掘、监测和自动化测试

2. scrapy模块执行与通信流程

　　2.1 流程图：

　　2.2 各大组件的作用：

　　　　引擎 (Scrapy Engine)：用来处理整个系统的数据流，触发事物（框架核心）

　　　　调度器 (Scheduler)：用来接收引擎发送的请求，压入请求队列中，并在引擎再次请求的时候返回，可以想象成一个url（待爬取网页的url）的优先队列，由他决定下一个要抓取的网址是什么，同时去除重复的网址

　　　　下载器 (Downloader)：顾名思义，用于下载网页代码，并将结果返回给spider

　　　　爬虫 (Spiders): 蜘蛛呢，就是执行者，用于解析网页中的信息，即实体（Item）

　　　　项目管道(Pipeline): 负责处理爬虫从网页中抽取的实体，主要功能是持久化实体，验证实体的有效性，清除不需要的信息。当爬虫被解析后，将被发送至项目管道，并经过几个特定次序来处理数据

　　　　下载中间件 (Downloader Middlewares): 位于Engine及Spider之间的框架，主要工作是处理scrapy engine与Downloader之间的请求及响应

　　　　爬虫中间件 (Spider Middlewares): 位于Engine及Spider之间的框架，主要工作是处理scrapy engine与spider之间的响应输入及请求输出

　　　　调度中间件 (Schedule Middlewares): 介于scrapy engine及scheduler之间的中间件，处理从scrapy engine发送到scheduler的请求响应

　　2.3 是不是感觉一头雾水，那我们直接上过程就很清楚啦

3. scrapy的基本使用

　　3.1 安装

# 这里我们用conda创建一个名为rawling_py36的虚拟环境

conda create --name crawling_py36 python=3.6

# 进入环境

activate crawling_py36

# 安装scrapy

conda install scrapy

# 退出当前环境(windows)

deactivate

　　说明：也可以使用pip install scrapy命令进行安装，但是在windows平台下， scrapy依赖pypiwin32模块，在执行pip install scrapy命令前，请先执行pip install pypiwin32

　　3.2 开始（这里以爬取糗事百科段子为例）

　　　　a. 新建项目

# 新建一个文件夹 并进入文件夹

mkdir spider_projects

cd ./spider_projects

# 进入虚拟环境

activate crawling_py36

# 新建爬虫项目,

scrapy startproject qsbk

#提示成功

　　　　b.新建爬虫

　　　　其域名为qiushibaike.com，我们这里根据域名新建一个spider

# 一定要在项目文件夹下

cd qsbk

scrapy genspider qsbk_spider qiushibaike.com

　　　　成功之后，其文件树结构为：

　　　　c. 分析网页结构（https://www.qiushibaike.com/text/）

　　　　　那么红圈里的div就是当前页的所有段子信息，我们使用xpath进行解析，那么每一个段子的xpath语法为 //div[@id='content-left']/div[contains(@class,'article')]

　　　　　提取下一页的地址的xpath语法为 //ul[@class='pagination']/li[last()]/a/@href

　　　　d. 开始

　　　　爬虫配置 (settings.py):

 # Obey robots.txt rules

 ROBOTSTXT_OBEY = False

 # Override the default request headers:

 DEFAULT_REQUEST_HEADERS = {

   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

   'Accept-Language': 'en',

   'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36',

 }

 # Configure item pipelines

 # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html

 ITEM_PIPELINES = {

    'qsbk.pipelines.QsbkPipeline': 300,

 }

　　　　解析数据（qsbk_spider.py）：

 # -*- coding: utf-8 -*-

 import scrapy

 from ..items import QsbkItem

 class QsbkSpiderSpider(scrapy.Spider):

     name = 'qsbk_spider'

     allowed_domains = ['qiushibaike.com']

     start_urls = ['https://www.qiushibaike.com/text/page/1/']

     base_domain = "https://www.qiushibaike.com"

     def parse(self, response):

         # 获取所有段子信息

         texts = response.xpath("//div[@id='content-left']/div[contains(@class, 'article')]")

         # 提取单个段子信息

         for text in texts:

             author_info = text.xpath("./div[contains(@class,'author')]")

             head_img = 'http:' + author_info.xpath(".//img/@src").get().strip()

             author_name = author_info.xpath(".//h2/text()").get().strip()

             content_list = text.xpath(".//div[@class='content']/span/text()").getall()

             content = ''.join(content_list).strip()

             article_item = QsbkItem(author_name=author_name, head_img=head_img, content=content)

             yield article_item

         # 下一页的地址

         next_page_url = response.xpath("//ul[@class='pagination']/li[last()]/a/@href").get()

         if next_page_url:

             yield scrapy.Request(url=self.base_domain+next_page_url, callback=self.parse)

         else:

             return

　　　　结构化实体（items.py）：

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

#

# See documentation in:

# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class QsbkItem(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    author_name = scrapy.Field()

    head_img = scrapy.Field()

    content = scrapy.Field()

　　　　持久化数据（pipelines.py）:

 # -*- coding: utf-8 -*-

 # Define your item pipelines here

 #

 # Don't forget to add your pipeline to the ITEM_PIPELINES setting

 # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

 # import json

 # class QsbkPipeline(object):

 #

 #     def open_spider(self, spider):

 #         print('start crawling....')

 #         self.file = open('qsbk.json','w' , encoding='utf-8')

 #

 #     def process_item(self, item, spider):

 #         item_json = json.dumps(dict(item), ensure_ascii=False)

 #         self.file.write(item_json+'\n')

 #         return item

 #

 #     def close_spider(self, spider):

 #         self.file.close()

 #         print('stop crawling.....')

 # 一次性写入

 # from scrapy.exporters import JsonItemExporter, JsonLinesItemExporter

 # class QsbkPipeline(object):

 #

 #     def open_spider(self, spider):

 #         print('start crawling....')

 #         self.file = open('qsbk.json', 'wb')

 #         self.exporter = JsonItemExporter(self.file, ensure_ascii=False)

 #         self.exporter.start_exporting()

 #

 #     def process_item(self, item, spider):

 #         self.exporter.export_item(item)

 #

 #     def close_spider(self, spider):

 #         self.exporter.finish_exporting()

 #         self.file.close()

 #         print('stop crawling.....')

 # 分行写入，（在数据量很大的时候推荐使用）

 from scrapy.exporters import JsonLinesItemExporter

 class QsbkPipeline(object):

     def open_spider(self, spider):

         print('start crawling....')

         self.file = open('qsbk.json', 'wb')

         self.exporter = JsonLinesItemExporter(self.file, ensure_ascii=False)

     def process_item(self, item, spider):

         self.exporter.export_item(item)

         return item

     def close_spider(self, spider):

         self.file.close()

         print('stop crawling.....')

　　说明：为了避免每次爬取使用命令行，可以在项目根目录下新建 run.py 代替命令行执行，而scrapy也同样提供了执行cmd命令这一模块

　　启动爬虫（run.py）

 from scrapy import cmdline

 cmdline.execute("scrapy crawl qsbk_spider".split())

　　e.爬取结束（qsbk.json）