scrapy框架安装配置

scrapy介绍

Scrapy一个开源和协作的框架，其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的，使用它可以以快速、简单、可扩展的方

式从网站中提取所需的数据。但目前Scrapy的用途十分广泛，可用于如数据挖掘、监测和自动化测试等领域，也可以应用在获取API

所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫。

Scrapy 是基于twisted框架开发而来，twisted是一个流行的事件驱动的python网络框架。因此Scrapy使用了一种非阻塞（又名异

步）的代码来实现并发。整体架构大致如下

1.引擎(EGINE)

引擎负责控制系统所有组件之间的数据流，并在某些动作发生时触发事件。有关详细信息，请参见上面的数据流部分。

2.调度器(SCHEDULER)

用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL的优先级队列, 由它来决定下一个

要抓取的网址是什么, 同时去除重复的网址

3.下载器(DOWLOADER)

用于下载网页内容, 并将网页内容返回给EGINE，下载器是建立在twisted这个高效的异步模型上的

4.爬虫(SPIDERS)

SPIDERS是开发人员自定义的类，用来解析responses，并且提取items，或者发送新的请求

5.项目管道(ITEM PIPLINES)

在items被提取后负责处理它们，主要包括清理、验证、持久化（比如存到数据库）等操作

6.下载器中间件(Downloader Middlewares)

位于Scrapy引擎和下载器之间，主要用来处理从EGINE传到DOWLOADER的请求request，已经从DOWNLOADER传到EGINE的

响应response，你可用该中间件做以下几件事

1. process a request just before it is sent to the Downloader (i.e. right before Scrapy sends the request to the website);

2. change received response before passing it to a spider;

3. send a new Request instead of passing received response to a spider;

4. pass response to a spider without fetching a web page;

5. silently drop some requests.

7.爬虫中间件(Spider Middlewares)

位于EGINE和SPIDERS之间，主要工作是处理SPIDERS的输入（即responses）和输出（即requests）

scrapy安装（win）

安装

1.pip insatll wheel #支持本地安装的模块

  pip install lxml

  pip install pyopenssl

2.下载合适的版本的twisted：http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted

3.安装twisted,到同一个目录，然后pip install

4.pip install pywin32

5.pip intstall scrapy

如果：在终端输入scrapy没有问题就是安装成功了

命令行常用指令

#创建工程

scrapy startproject name

#创建爬虫文件

scrapy genspider spiderName www.xxx.com

#执行爬虫任务

scrapy crawl 工程名字

其他命令

#查看帮助

 scrapy -h

 scrapy <command> -h

#有两种命令：其中Project-only必须切到项目文件夹下才能执行，而Global的命令则不需要

 Global commands:

 	startproject #创建项目

 	genspider #创建爬虫程序

 	settings #如果是在项目目录下，则得到的是该项目的配置

 	runspider #运行一个独立的python文件，不必创建项目

 	shell #scrapy shell url地址 在交互式调试，如选择器规则正确与否

 	fetch #独立于程单纯地爬取一个页面，可以拿到请求头

 	view #下载完毕后直接弹出浏览器，以此可以分辨出哪些数据是ajax请求

 	version #scrapy version 查看scrapy的版本，scrapy version -v查看scrapy依赖库的版本

 	Project-only commands:

 	crawl #运行爬虫，必须创建项目才行，确保配置文件中ROBOTSTXT_OBEY = False

 	check #检测项目中有无语法错误

 	list #列出项目中所包含的爬虫名

爬虫文件信息

# -*- coding: utf-8 -*-

import scrapy

class ZxSpider(scrapy.Spider):

    #工程名称，唯一标志

    name = 'zx'

    #允许爬取的域名（一般不用）

    # allowed_domains = ['www.baidu.com']

    #起始爬取的url,可以是多个

    start_urls = ['http://www.baidu.com/',"https://docs.python.org/zh-cn/3/library/index.html#library-index"]

    #回调函数,返回请求回来的信息

    def parse(self, response):

        print(response)

配置文件修改(setting.py)

修改UA和是否遵守爬虫协议添加日志打印等级

# Crawl responsibly by identifying yourself (and your website) on the user-agent

#USER_AGENT = 'zx_spider (+http://www.yourdomain.com)'

# Obey robots.txt rules，君子协议不遵守

ROBOTSTXT_OBEY = True

LOG_LEVEL='ERROR'

最后测试下配置成功没有

简单案例(爬段子)

# -*- coding: utf-8 -*-

import scrapy

class DuanziSpider(scrapy.Spider):

    name = 'duanzi'

    # allowed_domains = ['www.xxx.com']

    start_urls = ['http://duanziwang.com/']

    def parse(self, response):

        div_list=response.xpath('//main/article')

        for i in div_list:

            title=i.xpath('.//h1/a/text()').extract_first()

            #xpath返回的是存放selector对象的列表，想要拿到数据需要调用extract()函数取出内容，如果列表长度为1可以使用extract_first()

            content=i.xpath('./div[@class="post-content"]/p/text()').extract_first()

            print(title)

            print(content)

执行流程

五大核心组件

引擎(Scrapy)

（创建对象，根据数据调度方法等）

用来处理整个系统的数据流处理, 触发事务(框架核心)

调度器(Scheduler)

用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL（抓取网页的网址或者说是链接）的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址

下载器(Downloader)

用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的)

爬虫(Spiders)

爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面

项目管道(Pipeline)

负责处理爬虫从网页中抽取的实体，主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后，将被发送到项目管道，并经过几个特定的次序处理数据。

执行流程

1.爬虫程序将url封装后发送给引擎

2.引擎拿到url后，把它给调度器

3.调度器首先过滤重复的url

4.将过滤好的url压入队列

5.将队列发给引擎

6.引擎将队列发给下载器

7.下载器向互联网请求数据

8.获取数据

9.将数据response发给引擎

10.引擎将数据发给爬虫程序的回调

11.数据处理好，在此发给引擎

12.引擎将数据发给管道，由管道进行数据的持久化存储

配置run文件启动项目

1.新建run.py

2.

from scrapy.cmdline import execute

execute(['scrapy','crawl','jd'])

高级设置

修改初始请求

#默认初始请求是这个

start_urls = ['https://www.jd.com']

#重写__init__()函数(qs)

    def __init__(self,qs=None,*args,**kwargs):

        super(JdSpider,self).__init__(*args,**kwargs)

        self.api = "http://list.tmall.com/search_product.htm?"

        self.qs = eval(qs)

#重写的start_requests函数

#初始化请求

    def start_requests(self):

        for q in self.qs:

            self.param = {

                "q": q,

                "totalPage": 1,

                'jumpto': 1,

            }

            url = self.api + urlencode(self.param)

            yield scrapy.Request(url=url,callback=self.gettotalpage,dont_filter=True)

#后续请求

    def gettotalpage(self, response):

        totalpage = response.css('[name="totalPage"]::attr(value)').extract_first()

        self.param['totalPage'] = int(totalpage)

        for i in range(1,self.param['totalPage']+1):

        # for i in range(1,3):

            self.param['jumpto'] = i

            url  = self.api + urlencode(self.param)

            yield scrapy.Request(url=url,callback=self.get_info,dont_filter=True)

自定义解析函数

    #即对应请求函数的callback函数

    def get_info(self,response):

        product_list = response.css('.product')

        for product in product_list:

            title = product.css('.productTitle a::attr(title)').extract_first()

            price = product.css('.productPrice em::attr(title)').extract_first()

            status = product.css('.productStatus em::text').extract_first()

            # print(title,price,status)

            item = items.MyxiaopapaItem()

            item['title'] = title

            item['price'] = price

            item['status'] = status

            yield item

item使用

1.items.py里面规定可以接收的参数

import scrapy

class MyxiaopapaItem(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    title = scrapy.Field()

    price = scrapy.Field()

    status = scrapy.Field()

2.生成items对象，并返回

from myxiaopapa import items

#解析函数返回item对象

            item = items.MyxiaopapaItem()

            item['title'] = title

            item['price'] = price

            item['status'] = status

            yield item

pipelines

存储

#yield item之后就会执行pipelines里面的方法

#前提条件是settings里面有配置

#数字为优先级，越小越优先，可以配置多个，一般用于多个存储

#ITEM_PIPELINES = {

#    'zx.pipelines.ZxPipeline': 300,

#}

配置数据库

import pymongo

import  json

class MyxiaopapaPipeline(object):

    def __init__(self,host,port,db,table):

        self.host = host

        self.port = port

        self.db = db

        self.table = table

	#优先于__init__()执行

    @classmethod

    def from_crawler(cls,crawl):

        port = crawl.settings.get('PORT')

        host = crawl.settings.get('HOST')

        db = crawl.settings.get('DB')

        table = crawl.settings.get('TABLE')

        return cls(host,port,db,table)

	#爬虫启动执行，可以用来开启数据库连接

    def open_spider(self,crawl):

        self.client = pymongo.MongoClient(port=self.port,host=self.host)

        db_obj = self.client[self.db]

        self.table_obj = db_obj[self.table]

	#爬虫结束执行，可以用来关闭数据库连接

    def close_spider(self,crawl):

        self.client.close()

    def process_item(self, item, spider):

        self.table_obj.insert(dict(item))

        print(item['title'],'存储成功')

        return item

配置请求头

#settings里面默认有

#DEFAULT_REQUEST_HEADERS = {

#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

#   'Accept-Language': 'en',

#}

#如果想要自定义,优先走自定义

    # custom_settings = {

    #     'NAME':"MAC",

    #     'DEFAULT_REQUEST_HEADERS':{

    #     'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

    #     'Accept-Language': 'en',

    #     "User-Agent": "XXXX"

    # }

    #

    #

    # }

DownloaderMiddleware

request

#None

执行下一个中间件的process_request

#Response

执行最后中间件的process_response在往前执行

#Request

请求放到队列重新开始

#异常

执行最后中间件process_exception，在往前执行

Response

#默认response

正常执行

#Response(url)

执行最后中间件的process_response在往前执行

#Request

请求放到队列重新开始

#异常

执行spider的错误执行

代理

    def process_request(self, request, spider):

        request.meta['Download_timeout'] = 10

        request.meta['proxy'] = "http://" + get_proxy()

        return None

参考链接

https://www.cnblogs.com/xiaoyuanqujing/protected/articles/11805810.html