scrapy实战2分布式爬取lagou招聘（加入了免费的User-Agent随机动态获取库 fake-useragent 使用方法查看：https://github.com/hellysmile/fake-useragent）

items.py

 # -*- coding: utf-8 -*-

 # Define here the models for your scraped items

 #

 # See documentation in:

 # http://doc.scrapy.org/en/latest/topics/items.html

 import scrapy

 class LagouItem(scrapy.Item):

     # define the fields for your item here like:

     # name = scrapy.Field()

     #id

     obj_id=scrapy.Field()

     #职位名

     positon_name=scrapy.Field()

     #工作地点

     work_place=scrapy.Field()

     #发布日期

     publish_time=scrapy.Field()

     #工资

     salary=scrapy.Field()

     #工作经验

     work_experience=scrapy.Field()

     #学历

     education=scrapy.Field()

     #full_time

     full_time=scrapy.Field()

     #标签

     tags=scrapy.Field()

     #公司名字

     company_name=scrapy.Field()

     # #产业

     # industry=scrapy.Field()

     #职位诱惑

     job_temptation=scrapy.Field()

     #工作描述

     job_desc=scrapy.Field()

     #公司logo地址

     logo_image=scrapy.Field()

      #领域

     field=scrapy.Field()

     #发展阶段

     stage=scrapy.Field()

     #公司规模

     company_size=scrapy.Field()

     # 公司主页

     home = scrapy.Field()

     #职位发布者

     job_publisher=scrapy.Field()

     #投资机构

     financeOrg=scrapy.Field()

     #爬取时间

     crawl_time=scrapy.Field()

spiders>lagou.py

 # -*- coding: utf-8 -*-

 import scrapy

 from scrapy.linkextractors import LinkExtractor

 from scrapy.spiders import CrawlSpider, Rule

 from LaGou.items import LagouItem

 from LaGou.utils.MD5 import get_md5

 from datetime import datetime

 class LagouSpider(CrawlSpider):

     name = 'lagou'

     allowed_domains = ['lagou.com']

     start_urls = ['https://www.lagou.com/zhaopin/']

     content_links=LinkExtractor(allow=(r"https://www.lagou.com/jobs/\d+.html"))

     page_links=LinkExtractor(allow=(r"https://www.lagou.com/zhaopin/\d+"))

     rules = (

         Rule(content_links, callback="parse_item", follow=False),

         Rule(page_links,follow=True)

     )

     def parse_item(self, response):

         item=LagouItem()

         #获取到公司拉钩主页的url作为ID

         item["obj_id"]=get_md5(response.url)

         #公司名称

         item["company_name"]=response.xpath('//dl[@class="job_company"]//a/img/@alt').extract()[0]

         # 职位

         item["positon_name"]=response.xpath('//div[@class="job-name"]//span[@class="name"]/text()').extract()[0]

         #工资

         item["salary"]=response.xpath('//dd[@class="job_request"]//span[1]/text()').extract()[0]

         # 工作地点

         work_place=response.xpath('//dd[@class="job_request"]//span[2]/text()').extract()[0]

         item["work_place"]=work_place.replace("/","")

         # 工作经验

         work_experience=response.xpath('//dd[@class="job_request"]//span[3]/text()').extract()[0]

         item["work_experience"]=work_experience.replace("/","")

         # 学历

         education=response.xpath('//dd[@class="job_request"]//span[4]/text()').extract()[0]

         item["education"]=education.replace("/","")

         # full_time

         item['full_time']=response.xpath('//dd[@class="job_request"]//span[5]/text()').extract()[0]

         #tags

         tags=response.xpath('//dd[@class="job_request"]//li[@class="labels"]/text()').extract()

         item["tags"]=",".join(tags)

         #publish_time

         item["publish_time"]=response.xpath('//dd[@class="job_request"]//p[@class="publish_time"]/text()').extract()[0]

         # 职位诱惑

         job_temptation=response.xpath('//dd[@class="job-advantage"]/p/text()').extract()

         item["job_temptation"]=",".join(job_temptation)

         # 工作描述

         job_desc=response.xpath('//dd[@class="job_bt"]/div//p/text()').extract()

         item["job_desc"]=",".join(job_desc).replace("\xa0","").strip()

         #job_publisher

         item["job_publisher"]=response.xpath('//div[@class="publisher_name"]//span[@class="name"]/text()').extract()[0]

         # 公司logo地址

         logo_image=response.xpath('//dl[@class="job_company"]//a/img/@src').extract()[0]

         item["logo_image"]=logo_image.replace("//","")

         # 领域

         field=response.xpath('//ul[@class="c_feature"]//li[1]/text()').extract()

         item["field"]="".join(field).strip()

         # 发展阶段

         stage=response.xpath('//ul[@class="c_feature"]//li[2]/text()').extract()

         item["stage"]="".join(stage).strip()

         # 投资机构

         financeOrg=response.xpath('//ul[@class="c_feature"]//li[3]/p/text()').extract()

         if financeOrg:

             item["financeOrg"]="".join(financeOrg)

         else:

             item["financeOrg"]=""

         #公司规模

         if financeOrg:

              company_size= response.xpath('//ul[@class="c_feature"]//li[4]/text()').extract()

              item["company_size"]="".join(company_size).strip()

         else:

             company_size = response.xpath('//ul[@class="c_feature"]//li[3]/text()').extract()

             item["company_size"] = "".join(company_size).strip()

         # 公司主页

         item["home"]=response.xpath('//ul[@class="c_feature"]//li/a/@href').extract()[0]

         # 爬取时间

         item["crawl_time"]=datetime.now()

         yield item

pipelines.py

 # -*- coding: utf-8 -*-

 # Define your item pipelines here

 #

 # Don't forget to add your pipeline to the ITEM_PIPELINES setting

 # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

 import pymysql

 class LagouPipeline(object):

     def process_item(self, item, spider):

         con = pymysql.connect(host="127.0.0.1", user="root", passwd="", db="lagou",charset="utf8")

         cur = con.cursor()

         sql = ("insert into lagouwang(obj_id,company_name,positon_name,salary,work_place,work_experience,education,full_time,tags,publish_time,job_temptation,job_desc,job_publisher,logo_image,field,stage,financeOrg,company_size,home,crawl_time)"

                "VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)")

         lis=(item["obj_id"],item["company_name"],item["positon_name"],item["salary"],item["work_place"],item["work_experience"],item["education"],item['full_time'],item["tags"],item["publish_time"],item["job_temptation"],item["job_desc"],item["job_publisher"],item["logo_image"],item["field"],item["stage"],item["financeOrg"],item["company_size"],item["home"],item["crawl_time"])

         cur.execute(sql, lis)

         con.commit()

         cur.close()

         con.close()

         return item

middlewares.py

 from scrapy import signals

 import random

 #from LaGou.settings import USER_AGENTS

 from fake_useragent import UserAgent

 class RandomUserAgent(object):

     # def __init__(self,crawl):

     #     super(RandomUserAgent,self).__init__()

     #     self.ua=UserAgent()

     def process_request(self, request, spider):

         #useragent = random.choice(USER_AGENTS)

         ua=UserAgent()

         request.headers.setdefault("User-Agent",ua.random)

settings.py

 # -*- coding: utf-8 -*-

 # Scrapy settings for LaGou project

 #

 # For simplicity, this file contains only settings considered important or

 # commonly used. You can find more settings consulting the documentation:

 #

 #     http://doc.scrapy.org/en/latest/topics/settings.html

 #     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html

 #     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

 BOT_NAME = 'LaGou'

 SPIDER_MODULES = ['LaGou.spiders']

 NEWSPIDER_MODULE = 'LaGou.spiders'

 # Crawl responsibly by identifying yourself (and your website) on the user-agent

 #USER_AGENT = 'LaGou (+http://www.yourdomain.com)'

 # Obey robots.txt rules

 ROBOTSTXT_OBEY = False

 # Configure maximum concurrent requests performed by Scrapy (default: 16)

 #CONCURRENT_REQUESTS = 32

 # Configure a delay for requests for the same website (default: 0)

 # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay

 # See also autothrottle settings and docs

 DOWNLOAD_DELAY = 5

 # The download delay setting will honor only one of:

 #CONCURRENT_REQUESTS_PER_DOMAIN = 16

 #CONCURRENT_REQUESTS_PER_IP = 16

 # Disable cookies (enabled by default)

 COOKIES_ENABLED = False

 # USER_AGENTS = [

 #     "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",

 #     "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",

 #     "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",

 #     "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",

 #     "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",

 #     "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",

 #     "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",

 #     "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5"

 #    ]

 # Disable Telnet Console (enabled by default)

 #TELNETCONSOLE_ENABLED = False

 # Override the default request headers:

 #DEFAULT_REQUEST_HEADERS = {

 #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

 #   'Accept-Language': 'en',

 #}

 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

 SCHEDULER = "scrapy_redis.scheduler.Scheduler"

 SCHEDULER_PERSIST = True

 # Enable or disable spider middlewares

 # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

 #SPIDER_MIDDLEWARES = {

 #    'LaGou.middlewares.LagouSpiderMiddleware': 543,

 #}

 # Enable or disable downloader middlewares

 # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html

 DOWNLOADER_MIDDLEWARES = {

       'LaGou.middlewares.RandomUserAgent': 1,

 #    'LaGou.middlewares.MyCustomDownloaderMiddleware': 543,

 }

 # Enable or disable extensions

 # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html

 #EXTENSIONS = {

 #    'scrapy.extensions.telnet.TelnetConsole': None,

 #}

 # Configure item pipelines

 # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html

 ITEM_PIPELINES = {

       'scrapy_redis.pipelines.RedisPipeline':300,

     #'LaGou.pipelines.LagouPipeline': 300,

 }

 # Enable and configure the AutoThrottle extension (disabled by default)

 # See http://doc.scrapy.org/en/latest/topics/autothrottle.html

 #AUTOTHROTTLE_ENABLED = True

 # The initial download delay

 #AUTOTHROTTLE_START_DELAY = 5

 # The maximum download delay to be set in case of high latencies

 #AUTOTHROTTLE_MAX_DELAY = 60

 # The average number of requests Scrapy should be sending in parallel to

 # each remote server

 #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

 # Enable showing throttling stats for every response received:

 #AUTOTHROTTLE_DEBUG = False

 # Enable and configure HTTP caching (disabled by default)

 # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

 #HTTPCACHE_ENABLED = True

 #HTTPCACHE_EXPIRATION_SECS = 0

 #HTTPCACHE_DIR = 'httpcache'

 #HTTPCACHE_IGNORE_HTTP_CODES = []

 #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

redis数据：

mysql数据：

申明：以上只限于参考学习交流！！！更多：https://github.com/huwei86/spiderlagou

scrapy实战2分布式爬取lagou招聘（加入了免费的User-Agent随机动态获取库 fake-useragent 使用方法查看：https://github.com/hellysmile/fake-useragent）的更多相关文章

scrapy实战1分布式爬取有缘网（6.22接口已挂）：
直接上代码: items.py # -*- coding: utf-8 -*- # Define here the models for your scraped items # # See docu ...
scrapy基础知识之 CrawlSpiders爬取lagou招聘保存在mysql（分布式）：
items.py import scrapy class LagouItem(scrapy.Item): # define the fields for your item here like: # ...
scrapy实战--登陆人人网爬取个人信息
今天把scrapy的文档研究了一下,感觉有点手痒,就写点东西留点念想吧,也做为备忘录.随意写写,看到的朋友觉得不好,不要喷我哈. 创建scrapy工程 cd C:\Spider_dev\app\scr ...
scrapy-redis实现爬虫分布式爬取分析与实现
本文链接:http://blog.csdn.net/u012150179/article/details/38091411 一 scrapy-redis实现分布式爬取分析所谓的scrapy-redi ...
Scrapy 分布式爬取
由于受到计算机能力和网络带宽的限制,单台计算机运行的爬虫咋爬取数据量较大时,需要耗费很长时间.分布式爬取的思想是“人多力量大”,在网络中的多台计算机同时运行程序,公童完成一个大型爬取任务, Scrap ...
scrapy-redis + Bloom Filter分布式爬取tencent社招信息
scrapy-redis + Bloom Filter分布式爬取tencent社招信息什么是scrapy-redis 什么是 Bloom Filter 为什么需要使用scrapy-redis + B ...
scrapy-redis分布式爬取tencent社招信息
scrapy-redis分布式爬取tencent社招信息什么是scrapy-redis 目标任务安装爬虫创建爬虫编写 items.py 编写 spiders/tencent.py 编写 pip ...
python-scrapy爬取某招聘网站(二)
首先要准备python3+scrapy+pycharm 一.首先让我们了解一下网站拉勾网https://www.lagou.com/ 和Boss直聘类似的网址设计方式,与智联招聘不同,它采用普通的页 ...
一个scrapy框架的爬虫(爬取京东图书)
我们的这个爬虫设计来爬取京东图书(jd.com). scrapy框架相信大家比较了解了.里面有很多复杂的机制,超出本文的范围. 1.爬虫spider tips: 1.xpath的语法比较坑,但是你可以 ...

随机推荐

Wireshark基本介绍和学习TCP三次握手专题
wireshark有两种过滤器: 捕捉过滤器(CaptureFilters):用于决定将什么样的信息记录在捕捉结果中.显示过滤器(DisplayFilters):用于在捕捉结果中进行详细查找. 捕捉过 ...
WPF 4 日期选择器（DatePicker）
原文:WPF 4 日期选择器(DatePicker) 前一篇<WPF 4 日历控件(Calendar)> 中我们对日历控件的使用方式有了基本了解,本篇将继续介绍WPF 4 中另一 ...
CefSharp For WPF响应页面点击事件
初始化  <cefSharpWPF:ChromiumWebBrowser Name="webBrowser" Grid.Row="0 ...
ASP.NET MVC 学习笔记1 Talk about controller & route
For the sake of learning programming better, I'd like to increase the frequency of using English. So ...
每日一题：Java异常处理
什么是异常在理想情况下,程序总会运行在很完美的环境中,网络不会终端,文件一定存在,程序不会有 BUG.但是,理想很丰满,现实很骨干,实际生产环境中,网络可能会中断,文件可能会找不到,内存可能会溢出, ...
SQL Server 事务复制分发到订阅同步慢
原文:SQL Server 事务复制分发到订阅同步慢最近发现有一个发布经常出现问题,每几天就出错不同步,提示要求初始化.重新调整同步后,复制还是很慢!每天白天未分发的命令就达五六百万条!要解决慢的问 ...
scp 专题
Tips:阿里云中需要使用内网ip,否则会一直阻塞Linux scp命令用于Linux之间复制文件和目录,具体如何使用这里好好介绍一下,从本地复制到远程.从远程复制到本地是两种使用方式.这里有具体举例 ...
SQL SERVER中UPDLOCK ,READPAST使用
原文:SQL SERVER中UPDLOCK ,READPAST使用 SQL SERVER中中获取不重复数据: select top 1 * from orders with(UPDLOCK ,READ ...
Mysql下载(on windows-noinstall zip archive)
所有内容,都是针对Mysql5.7.18介绍. 1.首先你需要下载一个完整的包,Mysql目前有两个版本可以使用: a. MySql Enterprise Edition:企业版 b. MySql C ...
Qt：解析命令行（使用QCommandLineOption和QCommandLineParser）
Qt从5.2版开始提供了两个类QCommandLineOption和QCommandLineParser来解析应用的命令行参数. 一.命令行写法命令行:"-abc" 在QComma ...

scrapy实战2分布式爬取lagou招聘（加入了免费的User-Agent随机动态获取库 fake-useragent 使用方法查看：https://github.com/hellysmile/fake-useragent）

scrapy实战2分布式爬取lagou招聘（加入了免费的User-Agent随机动态获取库 fake-useragent 使用方法查看：https://github.com/hellysmile/fake-useragent）的更多相关文章

随机推荐

热门专题