『Scrapy』爬取腾讯招聘网站

分析爬取对象

初始网址，

http://hr.tencent.com/position.php?@start=0&start=0#a

（可选）由于含有多页数据，我们可以查看一下这些网址有什么相关

page2：http://hr.tencent.com/position.php?@start=0&start=10#a

page3：http://hr.tencent.com/position.php?@start=0&start=20#a

也就是说末尾id每次递增10（#a无实际意义，输入start=0也能进入第一页）。

确定想爬取的信息：

我们爬取表格中的5类信息和每个招聘的具体网页地址，共6个条目，在查看源码的过程中我们可以使用F12开发者工具辅助定位，

其中class=event的tr表示白色背景条目，class=odd表示灰色背景条目，点击开查看具体信息如下，

爬虫编写

使用框架初始化项目，

scrapy startproject Tencent

修改items.py，对应上面需要记录的六组数据，

import scrapy

class TencentItem(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    # 职位名

    positionName = scrapy.Field()

    # 职位详情链接

    positionLink = scrapy.Field()

    # 职位类别

    positionType = scrapy.Field()

    # 招聘人数

    peopleNumber = scrapy.Field()

    # 工作地点

    workLocation = scrapy.Field()

    # 发布时间

    publishtime = scrapy.Field()

生成初始爬虫spider命名为tensent.py，

scrapy genspider tencent "tencent.com"

修改tencent.py，注意函数需要返回item

import scrapy

from Tencent.items import TencentItem

class TencentSpider(scrapy.Spider):

    name = "tencent"

    allowed_domains = ["tencent.com"]

    baseURL = "http://hr.tencent.com/position.php?@start="

    offset = 0

    start_urls = [baseURL + str(offset)]

    def parse(self, response):

        node_list = response.xpath("//tr[@class='even'] | //tr[@class='odd']")

        for node in node_list:

            item = TencentItem()

            # 职位名

            item['positionName'] = node.xpath("./td[1]/a/text()").extract()[0]

            print(node.xpath("./td[1]/a/text()").extract())

            # 职位详情链接

            item['positionLink'] = node.xpath("./td[1]/a/@href").extract()[0]

            # 职位类别

            if len(node.xpath("./td[2]/text()")):

                item['positionType'] = node.xpath("./td[2]/text()").extract()[0]

            else:

                item['positionType'] = ''

            # 招聘人数

            item['peopleNumber'] = node.xpath("./td[3]/text()").extract()[0]

            # 工作地点

            item['workLocation'] = node.xpath("./td[4]/text()").extract()[0]

            # 发布时间

            item['publishtime'] = node.xpath("./td[5]/text()").extract()[0]

            yield item

        # 换页方法一：直接构建url

        if self.offset <2190:

            self.offset += 10

            url = self.baseURL + str(self.offset)

            yield scrapy.Request(url, callback=self.parse)  # callback函数可以更换，即可以使用不同的处理方法处理不同的页面

两个yield连用使得不同的调用次数函数输出不同的表达式，这是一个很好的技巧，不过第二个yield是可以替换为return的，毕竟提交一个新请求后引擎会自动调用parse去处理响应

这里面使用提取下一页的方法是自己拼接之后的网址，这是一种相对而言笨拙一点的手法，一般会直接在网页中提取下一页的网址，但是这对于一些无法提取下一页网址的情况很实用。

更新一下直接在网页提取下一页的方法，

        # 换页方法二：提取下页链接

        if not len(response.xpath("//a[@class='noactive' and @id='next']")):

            url = 'http://hr.tencent.com/' + response.xpath("//a[@id='next']/@href").extract()[0]

            yield scrapy.Request(url, callback=self.parse)

对于静态页面这很容易，但是如果是动态页面就可能需要其他的辅助手段了。另外settings中有有关请求头文件的设置部分，有需求的话可以改写之。

取消settings.py对于管线文件的注释，

修改pipelines.py文件，

import json

class TencentPipeline(object):

    def __init__(self):

        self.f = open('tencent.json','w')

    def process_item(self, item, spider):

        content = json.dumps(dict(item),ensure_ascii=False) + ',\n'

        self.f.write(content)

        return item

    def close_spider(self,spider):

        self.f.close()

这样一个初级的爬虫项目就完成了。

测试并运行，

scrapy check tencent

scarpy srawl tencent

打开保存的json文件，可以看到类似下面的输出，每一行为一条招聘信息，

{"workLocation": "深圳", "positionType": "技术类", "positionName": "24111-安全架构师", "peopleNumber": "1", "publishtime": "2017-08-26", "positionLink": "position_detail.php?id=32378&keywords=&tid=0&lid=0"},

完成后整个文件夹变化如下，

实际爬取过程是要消耗一点时间的。