scrapy 学习笔记2

本章学习爬虫的

回调和跟踪链接
使用参数

回调和跟踪链接

上一篇的另一个爬虫,这次是为了抓取作者信息

# -*- coding: utf-8 -*-

import scrapy

class MyspiderAuthorSpider(scrapy.Spider):

    name = 'myspider_author'

    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):

        # 链接到作者页面

        for href in response.xpath('//div[@class="quote"]/span/a/@href'):

            yield response.follow(href, self.parse_author)

        # 链接到下一页

        for href in response.xpath('//li[@class="next"]/a/@href'):

            yield response.follow(href, self.parse)

    def parse_author(self, response):

        yield {

            'name':response.xpath('//h3[@class="author-title"]/text()').extract_first(),

            'birthdate':response.xpath('//span[@class="author-born-date"]/text()').extract_first()

        }

这个爬虫将从主页面开始，以 parse_author 回调方法跟踪所有到作者页面的链接，以 parse 回调方法跟踪其它页面。

这里我们将回调方法作为参数直接传递给 response.follow，这样代码更短，也可以传递给 scrapy.Request。

这个爬虫演示的另一个有趣的事是，即使同一作者有许多名言，我们也不用担心多次访问同一作者的页面。默认情况下，Scrapy 会将重复的请求过滤出来，避免了由于编程错误而导致的重复服务器的问题。如果你非要重复,改成这样:

yield response.follow(href, self.parse_author,dont_filter=True)

通过这样的爬虫,我们做了这样的一个事:获得了网站地图,挨着进去访问,获取信息.

上一篇最基础的爬虫,是根据"下一页",不停的往下找,中间可能会断掉,注意两者的区别

spider类参数传递

在运行爬虫时，可以通过 -a 选项为您的爬虫提供命令行参数：

dahu@dahu-OptiPlex-:~/PycharmProjects/SpiderLearning/quotesbot$ scrapy crawl toscrape-xpath-tag -a tag=humor -o t1.jl

默认情况下，这些参数将传递给 Spider 的 __init__ 方法并成为爬虫的属性。

在此示例中，通过 self.tag 获取命令行中参数 tag 的值。您可以根据命令行参数构建 URL，使您的爬虫只爬取特点标签的名言：

# -*- coding: utf-8 -*-

import scrapy

class ToScrapeSpiderXPath(scrapy.Spider):

    name = 'toscrape-xpath-tag'

    start_urls = [

        'http://quotes.toscrape.com/',

    ]

    def start_requests(self):

        url = 'http://quotes.toscrape.com/'

        tag = getattr(self, 'tag', None)

        if tag is not None:

            url = url + 'tag/' + tag

        yield scrapy.Request(url, self.parse)

    def parse(self, response):

        for quote in response.xpath('//div[@class="quote"]'):

            yield {

                'text': quote.xpath('./span[@class="text"]/text()').extract_first(),

                'author': quote.xpath('.//small[@class="author"]/text()').extract_first(),

                'tag': quote.xpath('.//div[@class="tags"]/a[@class="tag"]/text()').extract()

            }

        next_page_url = response.xpath('//li[@class="next"]/a/@href').extract_first()

        if next_page_url is not None:

            yield scrapy.Request(response.urljoin(next_page_url))

当然你运行爬虫的时候,要是不加-a参数,也是可以正常运行的,这个方法是修改start_requests()方法

另个例子,直接修改__init__()方法

# -*- coding: utf-8 -*-

import scrapy

class Dahu2Spider(scrapy.Spider):

    name = 'dahu2'

    allowed_domains = ['www.sina.com.cn']

    start_urls = ['http://slide.news.sina.com.cn/s/slide_1_2841_197495.html']

    def __init__(self,myurl=None,*args,**kwargs):

        super(Dahu2Spider,self).__init__(*args,**kwargs)

        if myurl==None:

            myurl=Dahu2Spider.start_urls[0]

        print("要爬取的网址为:%s"%myurl)

        self.start_urls=["%s"%myurl]

    def parse(self, response):

        yield {

            'title':response.xpath('//title/text()').extract_first()

        }

        print response.xpath('//title/text()').extract_first()

运行:

dahu@dahu-OptiPlex-:~/PycharmProjects/SpiderLearning/quotesbot$ scrapy crawl dahu2 --nolog

要爬取的网址为:http://slide.news.sina.com.cn/s/slide_1_2841_197495.html

沈阳：男子养猪养出新花样 天天逼“二师兄”跳水锻炼_高清图集_新浪网

dahu@dahu-OptiPlex-:~/PycharmProjects/SpiderLearning/quotesbot$ scrapy crawl dahu2 -a myurl=http://www.sina.com.cn --nolog

要爬取的网址为:http://www.sina.com.cn

新浪首页

这里注意,yield方法,生成的是个字典的结构,我试了下别的,只能是这4个

-- :: [scrapy.core.scraper] ERROR: Spider must return Request, BaseItem, dict or None, got 'unicode' in <GET http://www.sina.com.cn>

当然我们这里用print打印出来显得很粗糙,用yield生成出来,就是这样子:

{'title': u'\u65b0\u6d6a\u9996\u9875'}

这里编码问题,可以通过json的库来解决,把内容输出到文件里,可以解决编码问题,这个就不细说了.

skill:

scrapy 在不同的抓取级别的Request之间传递参数的办法，下面的范例中，parse_item通过meat传递给了parse_details参数item，这样就可以再parse_details抓取完成所有的数据后一次返回

class MySpider(BaseSpider):

    name = 'myspider'

    start_urls = (

        'http://example.com/page1',

        'http://example.com/page2',

        )

    def parse(self, response):

        # collect `item_urls`

        for item_url in item_urls:

            yield Request(url=item_url, callback=self.parse_item)

    def parse_item(self, response):

        item = MyItem()

        # populate `item` fields

        yield Request(url=item_details_url, meta={'item': item},

            callback=self.parse_details)

    def parse_details(self, response):

        item = response.meta['item']

        # populate more `item` fields

        return item