scrapy实战2分布式爬取lagou招聘（加入了免费的User-Agent随机动态获取库 fake-useragent 使用方法查看：https://github.com/hellysmile/fake-useragent）

items.py

 # -*- coding: utf-8 -*-

 # Define here the models for your scraped items

 #

 # See documentation in:

 # http://doc.scrapy.org/en/latest/topics/items.html

 import scrapy

 class LagouItem(scrapy.Item):

     # define the fields for your item here like:

     # name = scrapy.Field()

     #id

     obj_id=scrapy.Field()

     #职位名

     positon_name=scrapy.Field()

     #工作地点

     work_place=scrapy.Field()

     #发布日期

     publish_time=scrapy.Field()

     #工资

     salary=scrapy.Field()

     #工作经验

     work_experience=scrapy.Field()

     #学历

     education=scrapy.Field()

     #full_time

     full_time=scrapy.Field()

     #标签

     tags=scrapy.Field()

     #公司名字

     company_name=scrapy.Field()

     # #产业

     # industry=scrapy.Field()

     #职位诱惑

     job_temptation=scrapy.Field()

     #工作描述

     job_desc=scrapy.Field()

     #公司logo地址

     logo_image=scrapy.Field()

      #领域

     field=scrapy.Field()

     #发展阶段

     stage=scrapy.Field()

     #公司规模

     company_size=scrapy.Field()

     # 公司主页

     home = scrapy.Field()

     #职位发布者

     job_publisher=scrapy.Field()

     #投资机构

     financeOrg=scrapy.Field()

     #爬取时间

     crawl_time=scrapy.Field()

spiders>lagou.py

 # -*- coding: utf-8 -*-

 import scrapy

 from scrapy.linkextractors import LinkExtractor

 from scrapy.spiders import CrawlSpider, Rule

 from LaGou.items import LagouItem

 from LaGou.utils.MD5 import get_md5

 from datetime import datetime

 class LagouSpider(CrawlSpider):

     name = 'lagou'

     allowed_domains = ['lagou.com']

     start_urls = ['https://www.lagou.com/zhaopin/']

     content_links=LinkExtractor(allow=(r"https://www.lagou.com/jobs/\d+.html"))

     page_links=LinkExtractor(allow=(r"https://www.lagou.com/zhaopin/\d+"))

     rules = (

         Rule(content_links, callback="parse_item", follow=False),

         Rule(page_links,follow=True)

     )

     def parse_item(self, response):

         item=LagouItem()

         #获取到公司拉钩主页的url作为ID

         item["obj_id"]=get_md5(response.url)

         #公司名称

         item["company_name"]=response.xpath('//dl[@class="job_company"]//a/img/@alt').extract()[0]

         # 职位

         item["positon_name"]=response.xpath('//div[@class="job-name"]//span[@class="name"]/text()').extract()[0]

         #工资

         item["salary"]=response.xpath('//dd[@class="job_request"]//span[1]/text()').extract()[0]

         # 工作地点

         work_place=response.xpath('//dd[@class="job_request"]//span[2]/text()').extract()[0]

         item["work_place"]=work_place.replace("/","")

         # 工作经验

         work_experience=response.xpath('//dd[@class="job_request"]//span[3]/text()').extract()[0]

         item["work_experience"]=work_experience.replace("/","")

         # 学历

         education=response.xpath('//dd[@class="job_request"]//span[4]/text()').extract()[0]

         item["education"]=education.replace("/","")

         # full_time

         item['full_time']=response.xpath('//dd[@class="job_request"]//span[5]/text()').extract()[0]

         #tags

         tags=response.xpath('//dd[@class="job_request"]//li[@class="labels"]/text()').extract()

         item["tags"]=",".join(tags)

         #publish_time

         item["publish_time"]=response.xpath('//dd[@class="job_request"]//p[@class="publish_time"]/text()').extract()[0]

         # 职位诱惑

         job_temptation=response.xpath('//dd[@class="job-advantage"]/p/text()').extract()

         item["job_temptation"]=",".join(job_temptation)

         # 工作描述

         job_desc=response.xpath('//dd[@class="job_bt"]/div//p/text()').extract()

         item["job_desc"]=",".join(job_desc).replace("\xa0","").strip()

         #job_publisher

         item["job_publisher"]=response.xpath('//div[@class="publisher_name"]//span[@class="name"]/text()').extract()[0]

         # 公司logo地址

         logo_image=response.xpath('//dl[@class="job_company"]//a/img/@src').extract()[0]

         item["logo_image"]=logo_image.replace("//","")

         # 领域

         field=response.xpath('//ul[@class="c_feature"]//li[1]/text()').extract()

         item["field"]="".join(field).strip()

         # 发展阶段

         stage=response.xpath('//ul[@class="c_feature"]//li[2]/text()').extract()

         item["stage"]="".join(stage).strip()

         # 投资机构

         financeOrg=response.xpath('//ul[@class="c_feature"]//li[3]/p/text()').extract()

         if financeOrg:

             item["financeOrg"]="".join(financeOrg)

         else:

             item["financeOrg"]=""

         #公司规模

         if financeOrg:

              company_size= response.xpath('//ul[@class="c_feature"]//li[4]/text()').extract()

              item["company_size"]="".join(company_size).strip()

         else:

             company_size = response.xpath('//ul[@class="c_feature"]//li[3]/text()').extract()

             item["company_size"] = "".join(company_size).strip()

         # 公司主页

         item["home"]=response.xpath('//ul[@class="c_feature"]//li/a/@href').extract()[0]

         # 爬取时间

         item["crawl_time"]=datetime.now()

         yield item

pipelines.py

 # -*- coding: utf-8 -*-

 # Define your item pipelines here

 #

 # Don't forget to add your pipeline to the ITEM_PIPELINES setting

 # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

 import pymysql

 class LagouPipeline(object):

     def process_item(self, item, spider):

         con = pymysql.connect(host="127.0.0.1", user="root", passwd="", db="lagou",charset="utf8")

         cur = con.cursor()

         sql = ("insert into lagouwang(obj_id,company_name,positon_name,salary,work_place,work_experience,education,full_time,tags,publish_time,job_temptation,job_desc,job_publisher,logo_image,field,stage,financeOrg,company_size,home,crawl_time)"

                "VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)")

         lis=(item["obj_id"],item["company_name"],item["positon_name"],item["salary"],item["work_place"],item["work_experience"],item["education"],item['full_time'],item["tags"],item["publish_time"],item["job_temptation"],item["job_desc"],item["job_publisher"],item["logo_image"],item["field"],item["stage"],item["financeOrg"],item["company_size"],item["home"],item["crawl_time"])

         cur.execute(sql, lis)

         con.commit()

         cur.close()

         con.close()

         return item

middlewares.py

 from scrapy import signals

 import random

 #from LaGou.settings import USER_AGENTS

 from fake_useragent import UserAgent

 class RandomUserAgent(object):

     # def __init__(self,crawl):

     #     super(RandomUserAgent,self).__init__()

     #     self.ua=UserAgent()

     def process_request(self, request, spider):

         #useragent = random.choice(USER_AGENTS)

         ua=UserAgent()

         request.headers.setdefault("User-Agent",ua.random)

settings.py

 # -*- coding: utf-8 -*-

 # Scrapy settings for LaGou project

 #

 # For simplicity, this file contains only settings considered important or

 # commonly used. You can find more settings consulting the documentation:

 #

 #     http://doc.scrapy.org/en/latest/topics/settings.html

 #     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html

 #     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

 BOT_NAME = 'LaGou'

 SPIDER_MODULES = ['LaGou.spiders']

 NEWSPIDER_MODULE = 'LaGou.spiders'

 # Crawl responsibly by identifying yourself (and your website) on the user-agent

 #USER_AGENT = 'LaGou (+http://www.yourdomain.com)'

 # Obey robots.txt rules

 ROBOTSTXT_OBEY = False

 # Configure maximum concurrent requests performed by Scrapy (default: 16)

 #CONCURRENT_REQUESTS = 32

 # Configure a delay for requests for the same website (default: 0)

 # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay

 # See also autothrottle settings and docs

 DOWNLOAD_DELAY = 5

 # The download delay setting will honor only one of:

 #CONCURRENT_REQUESTS_PER_DOMAIN = 16

 #CONCURRENT_REQUESTS_PER_IP = 16

 # Disable cookies (enabled by default)

 COOKIES_ENABLED = False

 # USER_AGENTS = [

 #     "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",

 #     "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",

 #     "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",

 #     "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",

 #     "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",

 #     "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",

 #     "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",

 #     "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5"

 #    ]

 # Disable Telnet Console (enabled by default)

 #TELNETCONSOLE_ENABLED = False

 # Override the default request headers:

 #DEFAULT_REQUEST_HEADERS = {

 #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

 #   'Accept-Language': 'en',

 #}

 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

 SCHEDULER = "scrapy_redis.scheduler.Scheduler"

 SCHEDULER_PERSIST = True

 # Enable or disable spider middlewares

 # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

 #SPIDER_MIDDLEWARES = {

 #    'LaGou.middlewares.LagouSpiderMiddleware': 543,

 #}

 # Enable or disable downloader middlewares

 # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html

 DOWNLOADER_MIDDLEWARES = {

       'LaGou.middlewares.RandomUserAgent': 1,

 #    'LaGou.middlewares.MyCustomDownloaderMiddleware': 543,

 }

 # Enable or disable extensions

 # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html

 #EXTENSIONS = {

 #    'scrapy.extensions.telnet.TelnetConsole': None,

 #}

 # Configure item pipelines

 # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html

 ITEM_PIPELINES = {

       'scrapy_redis.pipelines.RedisPipeline':300,

     #'LaGou.pipelines.LagouPipeline': 300,

 }

 # Enable and configure the AutoThrottle extension (disabled by default)

 # See http://doc.scrapy.org/en/latest/topics/autothrottle.html

 #AUTOTHROTTLE_ENABLED = True

 # The initial download delay

 #AUTOTHROTTLE_START_DELAY = 5

 # The maximum download delay to be set in case of high latencies

 #AUTOTHROTTLE_MAX_DELAY = 60

 # The average number of requests Scrapy should be sending in parallel to

 # each remote server

 #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

 # Enable showing throttling stats for every response received:

 #AUTOTHROTTLE_DEBUG = False

 # Enable and configure HTTP caching (disabled by default)

 # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

 #HTTPCACHE_ENABLED = True

 #HTTPCACHE_EXPIRATION_SECS = 0

 #HTTPCACHE_DIR = 'httpcache'

 #HTTPCACHE_IGNORE_HTTP_CODES = []

 #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

redis数据：

mysql数据：

申明：以上只限于参考学习交流！！！更多：https://github.com/huwei86/spiderlagou

scrapy实战2分布式爬取lagou招聘（加入了免费的User-Agent随机动态获取库 fake-useragent 使用方法查看：https://github.com/hellysmile/fake-useragent）的更多相关文章

scrapy实战1分布式爬取有缘网（6.22接口已挂）：
直接上代码: items.py # -*- coding: utf-8 -*- # Define here the models for your scraped items # # See docu ...
scrapy基础知识之 CrawlSpiders爬取lagou招聘保存在mysql（分布式）：
items.py import scrapy class LagouItem(scrapy.Item): # define the fields for your item here like: # ...
scrapy实战--登陆人人网爬取个人信息
今天把scrapy的文档研究了一下,感觉有点手痒,就写点东西留点念想吧,也做为备忘录.随意写写,看到的朋友觉得不好,不要喷我哈. 创建scrapy工程 cd C:\Spider_dev\app\scr ...
scrapy-redis实现爬虫分布式爬取分析与实现
本文链接:http://blog.csdn.net/u012150179/article/details/38091411 一 scrapy-redis实现分布式爬取分析所谓的scrapy-redi ...
Scrapy 分布式爬取
由于受到计算机能力和网络带宽的限制,单台计算机运行的爬虫咋爬取数据量较大时,需要耗费很长时间.分布式爬取的思想是“人多力量大”,在网络中的多台计算机同时运行程序,公童完成一个大型爬取任务, Scrap ...
scrapy-redis + Bloom Filter分布式爬取tencent社招信息
scrapy-redis + Bloom Filter分布式爬取tencent社招信息什么是scrapy-redis 什么是 Bloom Filter 为什么需要使用scrapy-redis + B ...
scrapy-redis分布式爬取tencent社招信息
scrapy-redis分布式爬取tencent社招信息什么是scrapy-redis 目标任务安装爬虫创建爬虫编写 items.py 编写 spiders/tencent.py 编写 pip ...
python-scrapy爬取某招聘网站(二)
首先要准备python3+scrapy+pycharm 一.首先让我们了解一下网站拉勾网https://www.lagou.com/ 和Boss直聘类似的网址设计方式,与智联招聘不同,它采用普通的页 ...
一个scrapy框架的爬虫(爬取京东图书)
我们的这个爬虫设计来爬取京东图书(jd.com). scrapy框架相信大家比较了解了.里面有很多复杂的机制,超出本文的范围. 1.爬虫spider tips: 1.xpath的语法比较坑,但是你可以 ...

随机推荐

XF相对控件布局
using System; using Xamarin.Forms; using Xamarin.Forms.Xaml; [assembly: XamlCompilation (XamlCompila ...
sql 从未连续的Id中选择10~30条数据
select * from(select *,row_number()over(order by ProductID) as num from Products) as t where t.num&g ...
CreateThread传递多个参数的方法（利用结构体的参数指针）
很多朋友一直都在问CreateThread如何传递多个参数,CreateThread传递参数的方式是指针传递的,所以这里也可以利用指针来做!Demo 关键代码如下: type TfrmTestThr ...
在Windows IoT上生成和识别二维码
在Windows IoT生成和识别二维码,实际上由于是UWP上实现,所以,理论上,这些生成和识别二维码的方法也可以在其它平台上运行. 关于二维码的生成有很多库可以实现,比如QRCoder,这个库可以在 ...
linux+php+swoole解决方案
服务器接收巨量的并发我使用linux+php+swoole解决方案.简单快速高效并发量大稳定 http://www.swoole.com/
Delphi子类调用祖父类的虚函数
因为看Delphi书的时候,就产生了疑惑.老讲调用父类虚函数,但是万一需要调用祖父虚函数怎么办?后来又经常在C++里看到,就更疑惑了 type TA = class procedure ShowMsg ...
Win8下安装MAC OS
参考: win7下安装OSX10.8及XCODE4.5 http://cleris.diandian.com/VB-Mountain-Lion 1,本机环境: win8 64位, 8G内存. ...
TDD（测试驱动开发）死了吗？
01.前言很早之前,曾在网络上见到过 TDD 这 3 个大写的英文字母,它是 Test Driven Development 这三个单词的缩写,也就是“测试驱动开发”的意思——听起来很不错的一种理念 ...
python连接数据库(1)——mysql
mysql是世界上应用最广的免费数据库,python当然也提供了对它的调用. 首先pip install pymysql,当然自己要知道数据库的用户名和密码,本地数据库的host就是localhost ...
c#两种方式调用google地球,调用COM API以及调用GEPLUGIN 与js交互，加载kml文件，dae文件。将二维高德地图覆盖到到三维谷歌地球表面。
网络上资源很多不全面,自己在开发的时候走了不少弯路,在这里整理了最全面的google全套开发,COM交互,web端交互.封装好了各种模块功能. 直接就可以调用. 第一种方式:调用COMAPI实现调用g ...

scrapy实战2分布式爬取lagou招聘（加入了免费的User-Agent随机动态获取库 fake-useragent 使用方法查看：https://github.com/hellysmile/fake-useragent）

scrapy实战2分布式爬取lagou招聘（加入了免费的User-Agent随机动态获取库 fake-useragent 使用方法查看：https://github.com/hellysmile/fake-useragent）的更多相关文章

随机推荐

热门专题