[Scrapy] Some things about Scrapy

1. Pause and resume a crawl

Scrapy supports this functionality out of the box by providing > the following facilities:

a scheduler that persists scheduled > >requests on disk

a duplicates filter that persists >visited requests on disk

an extension that keeps some spider state (key/value pairs) > persistent between > batches

run a crawl by

scrapy crawl somespider -s JOBDIR=crawls/somespider_dir

use Ctrl+C to close a drawl and resume by the same command above

2. 发起一次get请求

e.g.

页面A是新闻的列表，包含了每个新闻的链接

要发起一个请求去获取新闻的内容

通过设置request.meta，可以将参数带到callback函数中去，用response.meta接收

def parse(self, response):

    newslist = response.xpath('//ul[@class="linkNews"]/li')

    for item in newslist:

        news = News()

        news['title'] = item.xpath('a/text()').extract_first(default = '')

        contentUri = item.xpath('a/@href').extract_first(default = '')

        request = scrapy.Request(contentUri,

                    callback = self.getContent_callback,

                    headers = headers)

        request.meta['item'] = news

        yield request

def getContent_callback(self, response):

    news = response.meta['item']

    item['content'] = response.xpath('//article[@class="art_box"]').xpath('string(.)').extract_first(default = '').strip()

    yield item

3. 交互式shell

可以在这里交互式地获取各种信息,如response.status

我主要用来调试xpath（！shell中调试结果并不可靠）

PS C:\Users\patrick\Documents\Visual Studio 2017\Projects\ScrapyProjects> scrapy shell --nolog 'http://mil.news.sina.com.cn/2011-03-31/1342640379.html'

[s] Available Scrapy objects:

[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)

[s]   crawler    <scrapy.crawler.Crawler object at 0x0000026EA72752B0>

[s]   item       {}

[s]   request    <GET http://mil.news.sina.com.cn/2011-03-31/1342640379.html>

[s]   response   <200 http://mil.news.sina.com.cn/2011-03-31/1342640379.html>

[s]   settings   <scrapy.settings.Settings object at 0x0000026EA8586940>

[s]   spider     <DefaultSpider 'default' at 0x26ea884bb38>

[s] Useful shortcuts:

[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)

[s]   fetch(req)                  Fetch a scrapy.Request and update local objects

[s]   shelp()           Shell help (print this help)

[s]   view(response)    View response in a browser

In [1]: response.status

Out[1]: 200

在交互式环境里设置自定义headers

$ scrapy shell --nolog

...

...

>>> from scrapy import Request

>>> req = Request('douban.com', headers = {'User-Agent' : '...'})

>>> fetch(req)

if you just want to set user agent

scrapy shell -s USER_AGENT='useragent' 'https://movie.douban.com'

4. 命令行下向爬虫传参数

scrapy crawl myspider -a category=electronics

在爬虫中获取参数，直接通过参数名获取，如下面代码中的category

import scrapy

class MySpider(scrapy.Spider):

    name = 'myspider'

    def __init__(self, category=None, *args, **kwargs):

        super(MySpider, self).__init__(*args, **kwargs)

        self.start_urls = ['http://www.example.com/categories/%s' % category]

        # ...

5. 去除网页中的\r\n

用xpath中的normalize-space

以及extract_first是个好东西，还能加默认值

item['content'] = response.xpath('normalize-space(//div[@class="blkContainerSblkCon" and @id="artibody"])').extract_first(default = '')

6. 以编程方式停止一个爬虫

方法是抛出一个内置的异常CloseSpider

exception scrapy.exceptions.CloseSpider(reason='cancelled')

This exception can be raised from a spider callback to request the spider to be closed/stopped. Supported arguments:

Parameters: reason (str) – the reason for closing

def parse_page(self, response):

    if 'Bandwidth exceeded' in response.body:

        raise CloseSpider('bandwidth_exceeded')

7. [mysql] Incorrect string value: '\xF0\x9F\x8C\xB9' for column 'title' at row 1

连接数据库时的charset参数设置成utf8mb4

8. 写入文件时为utf-8编码而不是中文

在settings.py 文件末加上 FEED_EXPORT_ENCODING = 'utf-8'

9. soome things about Item

>>> import scrapy

>>> class A(scrapy.Item):

...     post_id = scrapy.Field()

...     user_id = scrapy.Field()

...     content = scrapy.Field()

...

>>> type(A)

<class 'scrapy.item.ItemMeta'>

这里的post_id和user_id可以存储任何类型的数据

取数据的时候也可以像是操作dic一样

>>> a = A(post_id = '12312312', author_id = '2342_author_id')

>>> a['post_id']

'12312312'

>>> a['author_id']

'2342_author_id'

如果field未被赋值，直接用dic['key']的方法取数据会报'KeyError'，解决办法是改用get方法

>>> a.get('content', default = 'empty')

'empty'

>>> a.get('content', 'empty')

'empty'

判断Item中是否存在某个field以及是否被赋值

>>> 'name' in a   # name是否被赋值

False

>>> 'name' in a.fields  # a的属性里是否有 'name

False

>>> 'content' in a  # content是否被赋值

False

>>> 'content' in a.fields

True

建议所有dic['key']都改成dic.get('key', '')

10. 日志写入到文件

在settings.py中插入

LOG_STDOUT = True

LOG_FILE = 'scrapy_log.txt'

或

scrapy crawl MyCrawler -s LOG_FILE=/var/log/crawler_mycrawler.log

Reference