<scrapy爬虫>爬取腾讯社招信息

1.创建scrapy项目

dos窗口输入:

scrapy startproject tencent

cd tencent

2.编写item.py文件(相当于编写模板,需要爬取的数据在这里定义)

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

#

# See documentation in:

# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class TencentItem(scrapy.Item):

    # define the fields for your item here like:

    #职位名

    positionname = scrapy.Field()

    #链接

    positionlink = scrapy.Field()

    #类别

    positionType = scrapy.Field()

    #招聘人数

    positionNum = scrapy.Field()

    #工作地点

    positioncation = scrapy.Field()

    #职位名称

    positionTime = scrapy.Field()

3.创建爬虫文件

dos窗口输入:

scrapy genspider myspider tencent.com

4.编写myspider.py文件(接收响应,处理数据)

# -*- coding: utf-8 -*-

import scrapy

from tencent.items import TencentItem

class MyspiderSpider(scrapy.Spider):

    name = 'myspider'

    allowed_domains = ['tencent.com']

    url = 'https://hr.tencent.com/position.php?&start='

    offset = 0

    start_urls = [url+str(offset)]

    def parse(self, response):

        for each in response.xpath('//tr[@class="even"]|//tr[class="odd"]'):

            #初始化模型对象

            item = TencentItem()

            # 职位名

            item['positionname'] = each.xpath("./td[1]/a/text()").extract()[0]

            # 链接

            item['positionlink'] = 'http://hr.tencent.com/' + each.xpath("./td[1]/a/@href").extract()[0]

            # 类别

            item['positionType'] = each.xpath("./td[2]/text()").extract()[0]

            # 招聘人数

            item['positionNum'] = each.xpath("./td[3]/text()").extract()[0]

            # 工作地点

            item['positioncation'] = each.xpath("./td[4]/text()").extract()[0]

            # 职位名称

            item['positionTime'] = each.xpath("./td[5]/text()").extract()[0]

            yield item

        if self.offset < 2820:

            self.offset += 10

        else:

            raise ("程序结束")

        yield scrapy.Request(self.url+str(self.offset),callback=self.parse)

5.编写pipelines.py(存储数据)

# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import json

class TencentPipeline(object):

    def __init__(self):

        self.filename = open('tencent.json','wb')

    def process_item(self, item, spider):

        text =json.dumps(dict(item),ensure_ascii=False) + ',\n'

        self.filename.write(text.encode('utf-8'))

        return item

    def close_spider(self):

        self.filename.close()

6.编写settings.py(设置headers,pipelines等)

robox协议

# Obey robots.txt rules

ROBOTSTXT_OBEY = False

headers

DEFAULT_REQUEST_HEADERS = {

    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',

    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

  # 'Accept-Language': 'en',

}

pipelines

ITEM_PIPELINES = {

    'tencent.pipelines.TencentPipeline': 300,

}

7.运行爬虫

dos窗口输入:

scrapy crawl myspider

运行结果:

查看debug:

2019-02-18 16:02:22 [scrapy.core.scraper] ERROR: Spider error processing <GET https://hr.tencent.com/position.php?&start=520> (referer: https://hr.tencent.com/position.php?&start=510)

Traceback (most recent call last):

  File "E:\software\ANACONDA\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback

    yield next(it)

  File "E:\software\ANACONDA\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 30, in process_spider_output

    for x in result:

  File "E:\software\ANACONDA\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>

    return (_set_referer(r) for r in result or ())

  File "E:\software\ANACONDA\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>

    return (r for r in result or () if _filter(r))

  File "E:\software\ANACONDA\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>

    return (r for r in result or () if _filter(r))

  File "C:\Users\123\tencent\tencent\spiders\myspider.py", line 22, in parse

    item['positionType'] = each.xpath("./td[2]/text()").extract()[0]

去网页查看:

这个职位少一个属性- -!!!(城市套路多啊!)

那就改一下myspider.py里面的一行:

item['positionType'] = each.xpath("./td[2]/text()").extract()[0]

加个判断,改为:

if len(each.xpath("./td[2]/text()").extract()) > 0:

　　item['positionType'] = each.xpath("./td[2]/text()").extract()[0]

else:

　　item['positionType'] = "None"

　运行结果:

　看网站上最后一页:

爬取成功!

<scrapy爬虫>爬取腾讯社招信息的更多相关文章

<scrapy爬虫>爬取猫眼电影top100详细信息
1.创建scrapy项目 dos窗口输入: scrapy startproject maoyan cd maoyan 2.编写item.py文件(相当于编写模板,需要爬取的数据在这里定义) # -*- ...
利用scrapy爬取腾讯的招聘信息
利用scrapy框架抓取腾讯的招聘信息,爬取地址为:https://hr.tencent.com/position.php 抓取字段包括:招聘岗位,人数,工作地点,发布时间,及具体的工作要求和工作任务 ...
简单的scrapy实战:爬取腾讯招聘北京地区的相关招聘信息
简单的scrapy实战:爬取腾讯招聘北京地区的相关招聘信息简单的scrapy实战:爬取腾讯招聘北京地区的相关招聘信息系统环境:Fedora22(昨天已安装scrapy环境) 爬取的开始URL:ht ...
使用scrapy爬虫,爬取17k小说网的案例-方法一
无意间看到17小说网里面有一些小说小故事,于是决定用爬虫爬取下来自己看着玩,下图这个页面就是要爬取的来源. a 这个页面一共有125个标题,每个标题里面对应一个内容,如下图所示下面直接看最核心spi ...
使用Scrapy框架爬取腾讯新闻
昨晚没事写的爬取腾讯新闻代码,在此贴出,可以参考完善. # -*- coding: utf-8 -*- import json from scrapy import Spider from scrap ...
使用scrapy爬虫,爬取今日头条搜索吉林疫苗新闻（scrapy+selenium+PhantomJS）
这一阵子吉林疫苗案,备受大家关注,索性使用爬虫来爬取今日头条搜索吉林疫苗的新闻依然使用三件套(scrapy+selenium+PhantomJS)来爬取新闻以下是搜索页面,得到吉林疫苗的搜索信息, ...
『Scrapy』爬取腾讯招聘网站
分析爬取对象初始网址, http://hr.tencent.com/position.php?@start=0&start=0#a (可选)由于含有多页数据,我们可以查看一下这些网址有什么相 ...
Python写网络爬虫爬取腾讯新闻内容
最近学了一段时间的Python,想写个爬虫,去网上找了找,然后参考了一下自己写了一个爬取给定页面的爬虫. Python的第三方库特别强大,提供了两个比较强大的库,一个requests, 另外一个Bea ...
<scrapy爬虫>爬取360妹子图存入mysql(mongoDB还没学会,学会后加上去)
1.创建scrapy项目 dos窗口输入: scrapy startproject images360 cd images360 2.编写item.py文件(相当于编写模板,需要爬取的数据在这里定义) ...

随机推荐

基于第三方开源库的OPC服务器开发指南（1）——OPC与DCOM
事儿太多,好多事情并不以我的意志为转移,原想沉下心好好研究.学习图像识别,继续丰富我的机器视觉库,并继续<机器视觉及图像处理系列>博文的更新,但计划没有变化快,好多项目要完成,只好耽搁下来 ...
【转】WebResource实现在自定义控件中内嵌JS文件
在类库中的资源其他项目中要使用需要嵌入才行参考文献:WebResource实现在自定义控件中内嵌JS文件 1. WebResource简介 ASP.NET(1.0/1.1)给我们提供了一个开发 ...
more指令和less指令使用的区别
more和less都是可以一页一页的翻动 more翻页的时候,显示有百分比在最下一行 less没有 more可以用来查询空白键 (space):代表向下翻一页:Enter :代表向下翻『一行』:/字 ...
UEditor 编辑模板
读取模板,放到ueditor中进行编辑 @model WeiXin_Shop.Models.WX_GoodsDetails @Html.Partial("_MasterPage") ...
[笔记]Laravel TDD 胡乱记录
TDD: 测试驱动开发(Test-Driven Development),TDD的原理是在开发功能代码之前,先编写单元测试用例代码,测试代码确定需要编写什么产品代码. -- 载自TDD百度百科参考 ...
js结巴程序
var str="我.....我是一个个......帅帅帅帅哥!"; var reg=/\./gi; str=str.replace(reg,""); reg= ...
MyEclipse搭建Structs2开发环境
MyEclipse10搭建Strust2开发环境 - 孤傲苍狼 - 博客园https://www.cnblogs.com/xdp-gacl/p/3496242.html
单层感知机_线性神经网络_BP神经网络
单层感知机单层感知机基础总结很详细的博客关于单层感知机的视频最终y=t,说明经过训练预测值和真实值一致.下面图是sign函数根据感知机规则实现的上述题目的代码 import numpy as ...
Systm.IO.File.cs
ylbtech-Systm.IO.File.cs 1.程序集 mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c5619 ...
详解Android广播机制
应用场景(常见的场景1) (1)同一应用具有多个进程的不同组件之间的消息通信 a)不同应用间的组件之间的消息通信 b)与Android系统在特定情况下的通信,如:系统开机,网络变化等 (2)同一应用内 ...

<scrapy爬虫>爬取腾讯社招信息

1.创建scrapy项目

2.编写item.py文件(相当于编写模板,需要爬取的数据在这里定义)

3.创建爬虫文件

4.编写myspider.py文件(接收响应,处理数据)

5.编写pipelines.py(存储数据)

6.编写settings.py(设置headers,pipelines等)

7.运行爬虫

<scrapy爬虫>爬取腾讯社招信息的更多相关文章

随机推荐

热门专题