Tencent社会招聘scrapy爬虫 --- 已经解决
1.用 scrapy 新建一个 tencent 项目

2.在 items.py 中确定要爬去的内容
# -*- coding: utf- -*- # Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html import scrapy class TencentItem(scrapy.Item):
# define the fields for your item here like:
# 职位
position_name = scrapy.Field()
# 详情链接
positin_link = scrapy.Field()
# 职业类别
position_type = scrapy.Field()
# 招聘人数
people_number = scrapy.Field()
# 工作地点
work_location = scrapy.Field()
# 发布时间
publish_time = scrapy.Field()
3.在当前命令下创建一个名为 tencent_spider 的爬虫, 并指定爬取域的范围

4.打开tencent_spider.py已经初始化了格式, 修改一下就好了
# -*- coding: utf-8 -*-
import scrapy
from tencent.items import TencentItem class TencentSpiderSpider(scrapy.Spider):
name = "tencent_spider"
allowed_domains = ["tencent.com"] url = "http://hr.tencent.com/position.php?&start="
offset = 0 start_urls =[
url + str(offset),
] def parse(self, response):
for each in response.xpath("//tr[@class='even'] | //tr[@class='odd']"):
# 初始化模型对象
item = TencentItem()
item['positionname'] = each.xpath("./td[1]/a/text()").extract()[0]
# 详情连接
item['positionlink'] = each.xpath("./td[1]/a/@href").extract()[0]
# 职位类别
item['positiontype'] = each.xpath("./td[2]/text()").extract()[0]
# 招聘人数
item['peoplenumber'] = each.xpath("./td[3]/text()").extract()[0]
# 工作地点
item['worklocation'] = each.xpath("./td[4]/text()").extract()[0]
# 发布时间
item['publishtime'] = each.xpath("./td[5]/text()").extract()[0] yield item
if self.offset < 1000:
self.offset += 10 # 每次处理完一页的数据之后,重新发送下一页页面请求
# self.offset自增10,同时拼接为新的url,并调用回调函数self.parse处理Response
yield scrapy.Request(self.url + str(self.offset), callback = self.parse)
5.在 piplines.py 中写入文件
# -*- coding: utf-8 -*- # Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html import json class TencentPipeline(object):
def open_spider(self, spider):
self.filename = open("tencent.json", "w") def process_item(self, item, spider):
text = json.dumps(dict(item), ensure_ascii = False) + "\n"
self.filename.write(text.encode("utf-8")
return item def close_spider(self, spider):
self.filename.close()
6.进入settings中, 设置一下 headers 和 piplines
7.在命令输入以下命令运行

出现以下错误...还未解决...
2017-10-03 20:03:20 [scrapy] INFO: Scrapy 1.1.1 started (bot: tencent)
2017-10-03 20:03:20 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tencent.spiders', 'SPIDER_MODULES': ['tencent.spiders'], 'DOWNLOAD_DELAY': 2, 'BOT_NAME': 'tencent'}
2017-10-03 20:03:20 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2017-10-03 20:03:20 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-10-03 20:03:20 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
Unhandled error in Deferred:
2017-10-03 20:03:20 [twisted] CRITICAL: Unhandled error in Deferred: Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/scrapy/commands/crawl.py", line 57, in run
self.crawler_process.crawl(spname, **opts.spargs)
File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 163, in crawl
return self._crawl(crawler, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 167, in _crawl
d = crawler.crawl(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1274, in unwindGenerator
return _inlineCallbacks(None, gen, Deferred())
--- <exception caught here> ---
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1128, in _inlineCallbacks
result = g.send(result)
File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 90, in crawl
six.reraise(*exc_info)
File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 72, in crawl
self.engine = self._create_engine()
File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 97, in _create_engine
return ExecutionEngine(self, lambda _: self.stop())
File "/usr/local/lib/python2.7/dist-packages/scrapy/core/engine.py", line 69, in __init__
self.scraper = Scraper(crawler)
File "/usr/local/lib/python2.7/dist-packages/scrapy/core/scraper.py", line 71, in __init__
self.itemproc = itemproc_cls.from_crawler(crawler)
File "/usr/local/lib/python2.7/dist-packages/scrapy/middleware.py", line 58, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "/usr/local/lib/python2.7/dist-packages/scrapy/middleware.py", line 34, in from_settings
mwcls = load_object(clspath)
File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/misc.py", line 44, in load_object
mod = import_module(module)
File "/usr/lib/python2.7/importlib/__init__.py", line 37, in import_module
__import__(name)
exceptions.SyntaxError: invalid syntax (pipelines.py, line 17)
2017-10-03 20:03:20 [twisted] CRITICAL:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1128, in _inlineCallbacks
result = g.send(result)
File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 90, in crawl
six.reraise(*exc_info)
File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 72, in crawl
self.engine = self._create_engine()
File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 97, in _create_engine
return ExecutionEngine(self, lambda _: self.stop())
File "/usr/local/lib/python2.7/dist-packages/scrapy/core/engine.py", line 69, in __init__
self.scraper = Scraper(crawler)
File "/usr/local/lib/python2.7/dist-packages/scrapy/core/scraper.py", line 71, in __init__
self.itemproc = itemproc_cls.from_crawler(crawler)
File "/usr/local/lib/python2.7/dist-packages/scrapy/middleware.py", line 58, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "/usr/local/lib/python2.7/dist-packages/scrapy/middleware.py", line 34, in from_settings
mwcls = load_object(clspath)
File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/misc.py", line 44, in load_object
mod = import_module(module)
File "/usr/lib/python2.7/importlib/__init__.py", line 37, in import_module
__import__(name)
File "/home/python/Desktop/tencent/tencent/pipelines.py", line 17
return item
^
SyntaxError: invalid syntax
还是自己太粗心了, 在 5 中的piplines.py 少了一个括号...感谢 sun shine 的帮助
thank you !!!
Tencent社会招聘scrapy爬虫 --- 已经解决的更多相关文章
- Python2.7集成scrapy爬虫错误解决
运行报错: NotSupported: Unsupported URL scheme 'https':.... 解决方法:降低对应package的版本 主要是scrapy和pyOpenSSL的版本 具 ...
- scrapy爬虫系列之二--翻页爬取及日志的基本用法
功能点:如何翻页爬取信息,如何发送请求,日志的简单实用 爬取网站:腾讯社会招聘网 完整代码:https://files.cnblogs.com/files/bookwed/tencent.zip 主要 ...
- [Python爬虫] scrapy爬虫系列 <一>.安装及入门介绍
前面介绍了很多Selenium基于自动测试的Python爬虫程序,主要利用它的xpath语句,通过分析网页DOM树结构进行爬取内容,同时可以结合Phantomjs模拟浏览器进行鼠标或键盘操作.但是,更 ...
- 进阶——scrapy登录豆瓣解决cookie传递问题并爬取用户参加过的同城活动©seven_clear
最近在用scrapy重写以前的爬虫,由于豆瓣的某些信息要登录后才有权限查看,故要实现登录功能.豆瓣登录偶尔需要输入验证码,这个在以前写的爬虫里解决了验证码的问题,所以只要搞清楚scrapy怎么提交表单 ...
- Linux搭建Scrapy爬虫集成开发环境
安装Python 下载地址:http://www.python.org/, Python 有 Python 2 和 Python 3 两个版本, 语法有些区别,ubuntu上自带了python2.7. ...
- scrapy爬虫成长日记之将抓取内容写入mysql数据库
前面小试了一下scrapy抓取博客园的博客(您可在此查看scrapy爬虫成长日记之创建工程-抽取数据-保存为json格式的数据),但是前面抓取的数据时保存为json格式的文本文件中的.这很显然不满足我 ...
- Scrapy爬虫框架第一讲(Linux环境)
1.What is Scrapy? 答:Scrapy是一个使用python语言(基于Twistec框架)编写的开源网络爬虫框架,其结构清晰.模块之间的耦合程度低,具有较强的扩张性,能满足各种需求.(前 ...
- Scrapy爬虫框架(实战篇)【Scrapy框架对接Splash抓取javaScript动态渲染页面】
(1).前言 动态页面:HTML文档中的部分是由客户端运行JS脚本生成的,即服务器生成部分HTML文档内容,其余的再由客户端生成 静态页面:整个HTML文档是在服务器端生成的,即服务器生成好了,再发送 ...
- Scrapy爬虫错误日志汇总
1.数组越界问题(list index out of range) 原因:第1种可能情况:list[index]index超出范围,也就是常说的数组越界. 第2种可能情况:list是一个空的, 没有一 ...
随机推荐
- WPF 完美截图 <序>
最近由于工作需要(话说总是工作需要哈),老大交给个任务,我鼓捣了2个星期,有点心得与大伙共享,希望对同被此问题困扰的同学有所帮助. 费话不说,上图: 此为完成后运行时状态图,先扔在这,下午有空开始正式 ...
- [Egret]长按图片分享、分享图片、本地存储
egret 分享有API可以把一个显示对象树渲染成一个位图纹理,我把它赋值给 HTML 的 Image 元素,就实现了图片的显示,在微信中,通过长按图片可以分享出去.当然在其他浏览器可以保存在本地. ...
- Oracle函数sys_connect_by_path 详解
Oracle函数sys_connect_by_path 详解 语法:Oracle函数:sys_connect_by_path 主要用于树查询(层次查询) 以及 多列转行.其语法一般为: s ...
- redis数据类型及常用命令介绍(图文实例)
aaarticlea/png;base64,iVBORw0KGgoAAAANSUhEUgAAAhgAAAFLCAYAAACUdvXUAAAgAElEQVR4nO3da1da58L2fT5KfPvc96 ...
- IMDB TOP 250爬虫
这个小学期Python大作业搞了个获取IMDB TOP 250电影全部信息的爬虫.第二次写爬虫,比在暑假集训时写的熟练多了.欢迎大家评论. ''' ************************** ...
- Java URL类踩坑指南
背景介绍 最近再做一个RSS阅读工具给自己用,其中一个环节是从服务器端获取一个包含了RSS源列表的json文件,再根据这个json文件下载.解析RSS内容.核心代码如下: class Presente ...
- 部分小程序无法获取UnionId原因
问题背景 通过观察数据,发现有一部分用户是无法获取到UnionId的 也就是接口返回的参数中不包含UnionId参数 看了微信文档的解释,只要小程序在开放平台绑定,就一定会分配UnionId 网上也有 ...
- unity中.meta提交错误操作导致空脚本
工作时遇到了一个奇葩的问题,同事做的界面,再策划那里死活无法运行,其他同事的都没有问题.简单一查,是界面上挂了个空脚本,但是同事提交了对应的脚本,其他人那里脚本是正常.随后想到是否是.meta的问题. ...
- UVA 1508 - Equipment dp状态压缩
题意: 已知n个5元组,从中选出k组,使得这些组中5个位置,每个位置上最大数之和最大. 分析:当k>5时,就是n个5元组最大的数之和,当k<5时,就当做5元组,状态压缩,用00000表示 ...
- 算法提高 9-3摩尔斯电码 map
算法提高 9-3摩尔斯电码 时间限制:1.0s 内存限制:256.0MB 问题描述 摩尔斯电码破译.类似于乔林教材第213页的例6.5,要求输入摩尔斯码,返回英文.请不要使用"z ...