Tencent 招聘信息网站

创建项目

scrapy startproject Tencent

创建爬虫

scrapy  genspider -t crawl tencent

1. 起始url start_url = 'https://hr.tencent.com/position.php'

在起始页面，需要获取该也页面上的每个职位的详情页的url,同时需要提取下一页的url地址，做同样的操作。

因此起始页url地址的提取，分为两类：

　　1. 每个职位详情页的url地址的提取

　　2. 下一页url地址的提取，并且得到的页面做的操作和起始页的操作一样。

url地址的提取

1. 提取详情页url，详情页的url地址如下：

提取规则详情页的规则：

rules = (

        # 提取详情页的url地址  ，详情页url地址对应的响应，需要进行数据提取，所有需要有回调函数，用来解析数据

        Rule(LinkExtractor(restrict_xpaths=("//table[@class='tablelist']//td[@class='l square']")), callback='parse_item')

    )

提取下一页的htmlj所在的位置：

2 获取下一页的url 规则：

rules = (

        # 提取详情页的url地址

        # Rule(LinkExtractor(allow=r'position_detail.php?id=\d+\&keywords=&tid=0&lid=0'), callback='parse_item'), # 这个表达式有错，这里不用正则

        Rule(LinkExtractor(restrict_xpaths=("//table[@class='tablelist']//td[@class='l square']")), callback='parse_item'),

        # 翻页

        Rule(LinkExtractor(restrict_xpaths=("//a[@id='next']")), follow=True),

    )

获取详情页数据

1.详情数据提取(爬虫逻辑)

1.获取标题

xpath:

item['title'] = response.xpath('//td[@id="sharetitle"]/text()').extract_first()

2. 获取工作地点，职位，招聘人数

xpath:

 item['addr'] = response.xpath('//tr[@class="c bottomline"]/td[1]//text()').extract()[1]

 item['position'] = response.xpath('//tr[@class="c bottomline"]/td[2]//text()').extract()[1]

 item['num'] = response.xpath('//tr[@class="c bottomline"]/td[3]//text()').extract()[1]

3.工作要求抓取

xpath:

item['skill'] =response.xpath('//ul[@class="squareli"]/li/text()').extract()

爬虫的代码：

# -*- coding: utf-8 -*-

import scrapy

from scrapy.linkextractors import LinkExtractor

from scrapy.spiders import CrawlSpider, Rule

from ..items import TencentItem

class TencentSpider(CrawlSpider):

    name = 'tencent'

    allowed_domains = ['hr.tencent.com']

    start_urls = ['https://hr.tencent.com/position.php']

    rules = (

        # 提取详情页的url地址

        # Rule(LinkExtractor(allow=r'position_detail.php?id=\d+\&keywords=&tid=0&lid=0'), callback='parse_item'), # 这个表达式有错

        Rule(LinkExtractor(restrict_xpaths=("//table[@class='tablelist']//td[@class='l square']")), callback='parse_item'),

        # 翻页

        Rule(LinkExtractor(restrict_xpaths=("//a[@id='next']")), follow=True),

    )

    def parse_item(self, response):

        item = TencentItem()

        item['title'] = response.xpath('//td[@id="sharetitle"]/text()').extract_first()

        item['addr'] = response.xpath('//tr[@class="c bottomline"]/td[1]//text()').extract()[0]

        item['position'] = response.xpath('//tr[@class="c bottomline"]/td[2]//text()').extract()[0]

        item['num'] = response.xpath('//tr[@class="c bottomline"]/td[3]//text()').extract()[0]

        item['skill'] =response.xpath('//ul[@class="squareli"]/li/text()').extract()

        print(dict(item))

        return item

tencent.py

2. 数据存储

1.settings.py 配置文件，配置如下信息

ROBOTSTXT_OBEY = False

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'

ITEM_PIPELINES = {

   'jd.pipelines.TencentPipeline': 300,

}

2. items.py 中：

import scrapy

class TencentItem(scrapy.Item):

    # define the fields for your item here like:

    title = scrapy.Field()

    addr = scrapy.Field()

    position = scrapy.Field()

    num = scrapy.Field()

    skill = scrapy.Field()

3. pipeline.py中：

import  pymongo

class TencentPipeline(object):

    def open_spider(self,spider):

        # 爬虫开启是连接数据库

        client = pymongo.MongoClient()

        collention = client.tencent.ten

        self.client =client

        self.collention = collention

        pass

    def process_item(self, item, spider):

        # 数据保存在mongodb 中

        self.collention.insert(dict(item))

        return item

    def colse_spdier(self,spider):

        # 爬虫结束，关闭数据库

        self.client.close()

启动项目

1.先将MongoDB数据库跑起来。

2.执行爬虫命令：

scrapy  crawl  tencent

3. 执行程序后的效果：

使用scrapy-crawlSpider 爬取tencent 招聘的更多相关文章

Scrapy框架——CrawlSpider爬取某招聘信息网站
CrawlSpider Scrapy框架中分两类爬虫,Spider类和CrawlSpider类. 它是Spider的派生类,Spider类的设计原则是只爬取start_url列表中的网页, 而Craw ...
Python爬虫【实战篇】scrapy 框架爬取某招聘网存入mongodb
创建项目 scrapy startproject zhaoping 创建爬虫 cd zhaoping scrapy genspider hr zhaopingwang.com 目录结构 items.p ...
Python+Scrapy+Crawlspider 爬取数据且存入MySQL数据库
1.Scrapy使用流程 1-1.使用Terminal终端创建工程,输入指令:scrapy startproject ProName 1-2.进入工程目录:cd ProName 1-3.创建爬虫文件( ...
简单的scrapy实战:爬取腾讯招聘北京地区的相关招聘信息
简单的scrapy实战:爬取腾讯招聘北京地区的相关招聘信息简单的scrapy实战:爬取腾讯招聘北京地区的相关招聘信息系统环境:Fedora22(昨天已安装scrapy环境) 爬取的开始URL:ht ...
爬虫07 /scrapy图片爬取、中间件、selenium在scrapy中的应用、CrawlSpider、分布式、增量式
爬虫07 /scrapy图片爬取.中间件.selenium在scrapy中的应用.CrawlSpider.分布式.增量式目录爬虫07 /scrapy图片爬取.中间件.selenium在scrapy ...
scrapy-redis + Bloom Filter分布式爬取tencent社招信息
scrapy-redis + Bloom Filter分布式爬取tencent社招信息什么是scrapy-redis 什么是 Bloom Filter 为什么需要使用scrapy-redis + B ...
scrapy-redis分布式爬取tencent社招信息
scrapy-redis分布式爬取tencent社招信息什么是scrapy-redis 目标任务安装爬虫创建爬虫编写 items.py 编写 spiders/tencent.py 编写 pip ...
python-scrapy爬取某招聘网站(二)
首先要准备python3+scrapy+pycharm 一.首先让我们了解一下网站拉勾网https://www.lagou.com/ 和Boss直聘类似的网址设计方式,与智联招聘不同,它采用普通的页 ...
使用scrapy框架爬取自己的博文（2）
之前写了一篇用scrapy框架爬取自己博文的博客,后来发现对于中文的处理一直有问题- - 显示的时候 [u'python\u4e0b\u722c\u67d0\u4e2a\u7f51\u9875\u76 ...

随机推荐

Redis 安装，配置以及数据操作
Nosql介绍 Nosql:一类新出现的数据库(not only sql)的特点不支持SQL语法存储结构跟传统关系型数据库中那种关系表完全不同,nosql中存储的数据都是k-v形式 Nosql的世 ...
Linux电源管理_autosleep－－（五）【转】
本文转载自:https://blog.csdn.net/wlsfling/article/details/46005409 1. 前言 Autosleep也是从Android wakelocks补丁集 ...
Icons - Material Design各种ICON图标大全
Icons - Material Design https://material.io/tools/icons/?icon=account_balance&style=baseline
MongoDB ReplicaSet 集群搭建
说明本文创建的集群的名字为test,在同一台机器上创建了三个mongo实例,端口不同即可. 安装mongodb的教程,之前总结过,请参考:CentOS安装MongoDB笔记创建实例 # 本机默认原 ...
FJUT 倒水（倒水问题）题解
题意:开学了, fold拿着两个无刻度, 容量分别是5L和7L的量筒来问Anxdada, 说水是无限的, 并且可以无限次将杯子装满或者清空, 那怎么用这个两个量筒倒出恰好4L水了? 我说简单啊, 先装 ...
sonarqube中new issue的标准
https://docs.sonarqube.org/latest/user-guide/issues/#header-4 Understanding which Issues are "N ...
【重新分配分片】Elasticsearch通过reroute api重新分配分片
elasticsearch可以通过reroute api来手动进行索引分片的分配. 不过要想完全手动,必须先把cluster.routing.allocation.disable_allocation ...
website for .Net Core
5 Ways to Build Routing in ASP.NET Core Bundling in .NET Core MVC Applications with BundlerMinifier. ...
python学习 day07打卡文件操作
本节主要内容: 初识文件操作只读(r,rb) 只读(w,wb) 追加(a,ab) r+读写 w+写读 a+追加写读其他操作方法文件的修改以及另一种打开文件句柄的方法一. 初识文件操作使用py ...
1st，Python基础——01
1 Python介绍 2 Python发展史 3 Python2 or 3? 4 Python安装就不写了,各路大牛的博客都很详细. 5 Hello World程序 #!/usr/bin/env p ...

使用scrapy-crawlSpider 爬取tencent 招聘

Tencent 招聘信息网站

url地址的提取

1. 提取详情页url，详情页的url地址如下：

2 获取下一页的url 规则：

获取详情页数据

1.详情数据提取(爬虫逻辑)

2. 数据存储

1.settings.py 配置文件，配置如下信息

2. items.py 中：

3. pipeline.py中：

启动项目

使用scrapy-crawlSpider 爬取tencent 招聘的更多相关文章

随机推荐

热门专题