什么是scrapy-redis

虽然 scrapy 框架是异步加多线程的,但是我们只能在一台主机上运行,爬取效率还是有限的,scrapy-redis 库是基于 scrapy 修改,为我们提供了 scrapy分布式的队列,调度器,去重等等功能,并且原有的 scrapy 单机版爬虫代码只需做很小的改动。有了它,就可以将多台主机组合起来,共同完成一个爬取任务,抓取的效率又提高了。再配合 Scrapyd 与 Gerapy 可以很方便的实现爬虫的分布式部署与运行。

目标任务

使用scrapy-redis爬取 https://hr.tencent.com/position.php?&start= 招聘信息,爬取的内容包括:职位名、详情连接 、职位类别、招聘人数、工作地点、发布时间、具体要求信息。

安装爬虫

  1. pip install scrapy
  2. pip install scrapy-redis
  • python 版本 3.7, scrapy 版本 1.6.0, scrapy-redis 版本 0.6.8

创建爬虫

  1. # 创建工程
  2. scrapy startproject TencentSpider
  3. # 创建爬虫
  4. cd TencentSpider
  5. scrapy genspider -t crawl tencent tencent.com
  • 爬虫名称 tencent , 作用域 tencent.com,爬虫类型 crawl

编写 items.py

  1. # -*- coding: utf-8 -*-
  2. # Define here the models for your scraped items
  3. #
  4. # See documentation in:
  5. # https://doc.scrapy.org/en/latest/topics/items.html
  6. import scrapy
  7. class TencentspiderItem(scrapy.Item):
  8. # define the fields for your item here like:
  9. # name = scrapy.Field()
  10. # 职位名
  11. positionname = scrapy.Field()
  12. # 详情连接
  13. positionlink = scrapy.Field()
  14. # 职位类别
  15. positionType = scrapy.Field()
  16. # 招聘人数
  17. peopleNum = scrapy.Field()
  18. # 工作地点
  19. workLocation = scrapy.Field()
  20. # 发布时间
  21. publishTime = scrapy.Field()
  22. # 职位详情
  23. positiondetail = scrapy.Field()
  • 定义需求爬取的 item 项

编写 spiders/tencent.py

  1. # -*- coding: utf-8 -*-
  2. import scrapy
  3. from scrapy_redis.spiders import RedisCrawlSpider
  4. # 导入CrawlSpider类和Rule
  5. from scrapy.spiders import CrawlSpider, Rule
  6. # 导入链接规则匹配类,用来提取符合规则的连接
  7. from scrapy.linkextractors import LinkExtractor
  8. from TencentSpider.items import TencentspiderItem
  9. class TencentSpider(RedisCrawlSpider): # 普通的scrapy爬虫继承自CrawlSpider
  10. name = 'tencent'
  11. # allowed_domains = ['tencent.com']
  12. allowed_domains = ['hr.tencent.com']
  13. # 普通的scrapy爬虫需要在这里定义start_urls,并且无redis_key变量
  14. # start_urls = ['https://hr.tencent.com/position.php?&start=0#a']
  15. redis_key = 'tencent:start_urls'
  16. # Response里链接的提取规则,返回的符合匹配规则的链接匹配对象的列表
  17. pagelink = LinkExtractor(allow=("start=\d+"))
  18. rules = (
  19. # 获取这个列表里的链接,依次发送请求,并且继续跟进,调用指定回调函数处理
  20. Rule(pagelink, callback='parse_item', follow=True),
  21. )
  22. # CrawlSpider的rules属性是直接从response对象的文本中提取url,然后自动创建新的请求。
  23. # 与Spider不同的是,CrawlSpider已经重写了parse函数
  24. # scrapy crawl spidername开始运行,程序自动使用start_urls构造Request并发送请求,
  25. # 然后调用parse函数对其进行解析,在这个解析过程中使用rules中的规则从html(或xml)文本中提取匹配的链接,
  26. # 通过这个链接再次生成Request,如此不断循环,直到返回的文本中再也没有匹配的链接,或调度器中的Request对象用尽,程序才停止。
  27. # 如果起始的url解析方式有所不同,那么可以重写CrawlSpider中的另一个函数parse_start_url(self, response)用来解析第一个url返回的Response,但这不是必须的。
  28. def parse_item(self, response):
  29. # print(response.request.headers)
  30. items = []
  31. url1 = "https://hr.tencent.com/"
  32. for each in response.xpath("//tr[@class='even'] | //tr[@class='odd']"):
  33. # 初始化模型对象
  34. item = TencentspiderItem()
  35. # 职位名称
  36. try:
  37. item['positionname'] = each.xpath("./td[1]/a/text()").extract()[0].strip()
  38. except BaseException:
  39. item['positionname'] = ""
  40. # 详情连接
  41. try:
  42. item['positionlink'] = "{0}{1}".format(url1, each.xpath("./td[1]/a/@href").extract()[0].strip())
  43. except BaseException:
  44. item['positionlink'] = ""
  45. # 职位类别
  46. try:
  47. item['positionType'] = each.xpath("./td[2]/text()").extract()[0].strip()
  48. except BaseException:
  49. item['positionType'] = ""
  50. # 招聘人数
  51. try:
  52. item['peopleNum'] = each.xpath("./td[3]/text()").extract()[0].strip()
  53. except BaseException:
  54. item['peopleNum'] = ""
  55. # 工作地点
  56. try:
  57. item['workLocation'] = each.xpath("./td[4]/text()").extract()[0].strip()
  58. except BaseException:
  59. item['workLocation'] = ""
  60. # 发布时间
  61. try:
  62. item['publishTime'] = each.xpath("./td[5]/text()").extract()[0].strip()
  63. except BaseException:
  64. item['publishTime'] = ""
  65. items.append(item)
  66. # yield item
  67. for item in items:
  68. yield scrapy.Request(url=item['positionlink'], meta={'meta_1': item}, callback=self.second_parseTencent)
  69. def second_parseTencent(self, response):
  70. item = TencentspiderItem()
  71. meta_1 = response.meta['meta_1']
  72. item['positionname'] = meta_1['positionname']
  73. item['positionlink'] = meta_1['positionlink']
  74. item['positionType'] = meta_1['positionType']
  75. item['peopleNum'] = meta_1['peopleNum']
  76. item['workLocation'] = meta_1['workLocation']
  77. item['publishTime'] = meta_1['publishTime']
  78. tmp = []
  79. tmp.append(response.xpath("//tr[@class='c']")[0])
  80. tmp.append(response.xpath("//tr[@class='c']")[1])
  81. positiondetail = ''
  82. for i in tmp:
  83. positiondetail_title = i.xpath("./td[1]/div[@class='lightblue']/text()").extract()[0].strip()
  84. positiondetail = positiondetail + positiondetail_title
  85. positiondetail_detail = i.xpath("./td[1]/ul[@class='squareli']/li/text()").extract()
  86. positiondetail = positiondetail + ' '.join(positiondetail_detail) + ' '
  87. # positiondetail_title = response.xpath("//div[@class='lightblue']").extract()
  88. # positiondetail_detail = response.xpath("//ul[@class='squareli']").extract()
  89. # positiondetail = positiondetail_title[0] + '\n' + positiondetail_detail[0] + '\n' + positiondetail_title[1] + '\n' + positiondetail_detail[1]
  90. item['positiondetail'] = positiondetail.strip()
  91. yield item
  • 爬虫的主逻辑

编写 pipelines.py

  1. # -*- coding: utf-8 -*-
  2. # Define your item pipelines here
  3. #
  4. # Don't forget to add your pipeline to the ITEM_PIPELINES setting
  5. # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
  6. import json
  7. class TencentspiderPipeline(object):
  8. """
  9. 功能:保存item数据
  10. """
  11. def __init__(self):
  12. self.filename = open("tencent.json", "w", encoding='utf-8')
  13. def process_item(self, item, spider):
  14. try:
  15. text = json.dumps(dict(item), ensure_ascii=False) + "\n"
  16. self.filename.write(text)
  17. except BaseException as e:
  18. print(e)
  19. return item
  20. def close_spider(self, spider):
  21. self.filename.close()
  • 处理每个页面爬取得到的 item 项

编写 middlewares.py

  1. # -*- coding: utf-8 -*-
  2. # Define here the models for your spider middleware
  3. #
  4. # See documentation in:
  5. # https://doc.scrapy.org/en/latest/topics/spider-middleware.html
  6. import scrapy
  7. from scrapy import signals
  8. from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
  9. import random
  10. class TencentspiderSpiderMiddleware(object):
  11. # Not all methods need to be defined. If a method is not defined,
  12. # scrapy acts as if the spider middleware does not modify the
  13. # passed objects.
  14. @classmethod
  15. def from_crawler(cls, crawler):
  16. # This method is used by Scrapy to create your spiders.
  17. s = cls()
  18. crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
  19. return s
  20. def process_spider_input(self, response, spider):
  21. # Called for each response that goes through the spider
  22. # middleware and into the spider.
  23. # Should return None or raise an exception.
  24. return None
  25. def process_spider_output(self, response, result, spider):
  26. # Called with the results returned from the Spider, after
  27. # it has processed the response.
  28. # Must return an iterable of Request, dict or Item objects.
  29. for i in result:
  30. yield i
  31. def process_spider_exception(self, response, exception, spider):
  32. # Called when a spider or process_spider_input() method
  33. # (from other spider middleware) raises an exception.
  34. # Should return either None or an iterable of Response, dict
  35. # or Item objects.
  36. pass
  37. def process_start_requests(self, start_requests, spider):
  38. # Called with the start requests of the spider, and works
  39. # similarly to the process_spider_output() method, except
  40. # that it doesn’t have a response associated.
  41. # Must return only requests (not items).
  42. for r in start_requests:
  43. yield r
  44. def spider_opened(self, spider):
  45. spider.logger.info('Spider opened: %s' % spider.name)
  46. class TencentspiderDownloaderMiddleware(object):
  47. # Not all methods need to be defined. If a method is not defined,
  48. # scrapy acts as if the downloader middleware does not modify the
  49. # passed objects.
  50. @classmethod
  51. def from_crawler(cls, crawler):
  52. # This method is used by Scrapy to create your spiders.
  53. s = cls()
  54. crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
  55. return s
  56. def process_request(self, request, spider):
  57. # Called for each request that goes through the downloader
  58. # middleware.
  59. # Must either:
  60. # - return None: continue processing this request
  61. # - or return a Response object
  62. # - or return a Request object
  63. # - or raise IgnoreRequest: process_exception() methods of
  64. # installed downloader middleware will be called
  65. return None
  66. def process_response(self, request, response, spider):
  67. # Called with the response returned from the downloader.
  68. # Must either;
  69. # - return a Response object
  70. # - return a Request object
  71. # - or raise IgnoreRequest
  72. return response
  73. def process_exception(self, request, exception, spider):
  74. # Called when a download handler or a process_request()
  75. # (from other downloader middleware) raises an exception.
  76. # Must either:
  77. # - return None: continue processing this exception
  78. # - return a Response object: stops process_exception() chain
  79. # - return a Request object: stops process_exception() chain
  80. pass
  81. def spider_opened(self, spider):
  82. spider.logger.info('Spider opened: %s' % spider.name)
  83. class MyUserAgentMiddleware(UserAgentMiddleware):
  84. """
  85. 随机设置User-Agent
  86. """
  87. def __init__(self, user_agent):
  88. self.user_agent = user_agent
  89. @classmethod
  90. def from_crawler(cls, crawler):
  91. return cls(
  92. user_agent=crawler.settings.get('MY_USER_AGENT')
  93. )
  94. def process_request(self, request, spider):
  95. agent = random.choice(self.user_agent)
  96. request.headers['User-Agent'] = agent

编写 settings.py

  1. # -*- coding: utf-8 -*-
  2. # Scrapy settings for TencentSpider project
  3. #
  4. # For simplicity, this file contains only settings considered important or
  5. # commonly used. You can find more settings consulting the documentation:
  6. #
  7. # https://doc.scrapy.org/en/latest/topics/settings.html
  8. # https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
  9. # https://doc.scrapy.org/en/latest/topics/spider-middleware.html
  10. BOT_NAME = 'TencentSpider'
  11. SPIDER_MODULES = ['TencentSpider.spiders']
  12. NEWSPIDER_MODULE = 'TencentSpider.spiders'
  13. # 普通scrapy无下面5项关于redis的配置
  14. # 使用了scrapy_redis的去重组件,在redis数据库里做去重(必须)
  15. DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
  16. # 使用了scrapy_redis的调度器,在redis里分配请求(必须)
  17. SCHEDULER = "scrapy_redis.scheduler.Scheduler"
  18. # 在redis中保持scrapy-redis用到的各个队列,从而允许暂停和暂停后恢复,也就是不清理redis queues(可选参数)
  19. SCHEDULER_PERSIST = True
  20. # 指定redis数据库的连接参数(必须)
  21. REDIS_HOST = '127.0.0.1'
  22. REDIS_PORT = 6379
  23. DUPEFILTER_DEBUG = True
  24. # scrapy-redis在redis中都是用key-value形式存储数据,其中有几个常见的key-value形式:
  25. # 1、 “项目名:items” -->list 类型,保存爬虫获取到的数据item 内容是 json 字符串
  26. # 2、 “项目名:dupefilter” -->set类型,用于爬虫访问的URL去重 内容是 40个字符的 url 的hash字符串
  27. # 3、 “项目名:start_urls” -->List 类型,用于获取spider启动时爬取的第一个url
  28. # 4、 “项目名:requests” -->zset类型,用于scheduler调度处理 requests 内容是 request 对象的序列化 字符串
  29. # Crawl responsibly by identifying yourself (and your website) on the user-agent
  30. #USER_AGENT = 'TencentSpider (+http://www.yourdomain.com)'
  31. # 设置useragent随机列表
  32. MY_USER_AGENT = [
  33. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36",
  34. "Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-CN; rv:1.9.2.4) Gecko/20100611 Firefox/3.6.4",
  35. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.21 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.21",
  36. "Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1)",
  37. "Mozilla/5.0 (Windows NT 6.2; rv:30.0) Gecko/20150101 Firefox/32.0",
  38. "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko",
  39. "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36",
  40. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36",
  41. "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36",
  42. "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.2)",
  43. "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
  44. "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0",
  45. "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36",
  46. "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36",
  47. "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko",
  48. "Mozilla/4.0 (compatib1e; MSIE 6.1; Windows NT)",
  49. "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36",
  50. "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; SLCC1; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.5.21022; .NET CLR 3.5.30729; .NET CLR 3.0.30618)",
  51. "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)",
  52. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36",
  53. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36",
  54. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",
  55. "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36",
  56. "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/7.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET4.0C; .NET4.0E; Media Center PC 6.0)",
  57. "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.93 Safari/537.36",
  58. "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36",
  59. "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:23.0) Gecko/20100101 Firefox/23.0",
  60. "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2)",
  61. "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31",
  62. "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36",
  63. "Mozilla/5.0 (Windows NT 6.1; rv:17.0) Gecko/20100101 Firefox/17.0",
  64. "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36",
  65. "Mozilla/5.0 (compatible; MSIE 6.0; Windows NT 5.1)",
  66. "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)",
  67. "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36",
  68. "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0",
  69. "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; InfoPath.2)",
  70. "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134",
  71. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0",
  72. "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0)",
  73. "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36",
  74. "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0",
  75. "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36",
  76. "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0",
  77. "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36",
  78. "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763",
  79. "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.10 Safari/537.36",
  80. "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
  81. "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",
  82. "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)",
  83. "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0",
  84. "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0",
  85. "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36",
  86. "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0",
  87. "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
  88. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36",
  89. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
  90. "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12) AppleWebKit/602.1.21 (KHTML, like Gecko) Version/9.2 Safari/602.1.21",
  91. "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko",
  92. "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729)",
  93. "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.152 Safari/537.36"
  94. ]
  95. # Obey robots.txt rules
  96. ROBOTSTXT_OBEY = True
  97. # Configure maximum concurrent requests performed by Scrapy (default: 16)
  98. CONCURRENT_REQUESTS = 32
  99. # Configure a delay for requests for the same website (default: 0)
  100. # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
  101. # See also autothrottle settings and docs
  102. #DOWNLOAD_DELAY = 3
  103. # The download delay setting will honor only one of:
  104. #CONCURRENT_REQUESTS_PER_DOMAIN = 16
  105. #CONCURRENT_REQUESTS_PER_IP = 16
  106. # Disable cookies (enabled by default)
  107. #COOKIES_ENABLED = False
  108. # Disable Telnet Console (enabled by default)
  109. TELNETCONSOLE_ENABLED = False
  110. # Override the default request headers:
  111. #DEFAULT_REQUEST_HEADERS = {
  112. # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  113. # 'Accept-Language': 'en',
  114. #}
  115. DEFAULT_REQUEST_HEADERS = {
  116. 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
  117. 'Accept-Encoding': 'gzip,deflate,br',
  118. 'accept-language': 'zh-CN,zh;q=0.9',
  119. 'cache-control': 'no-cache',
  120. 'pragma': 'no-cache',
  121. 'upgrade-insecure-requests': '1',
  122. 'host': 'hr.tencent.com'
  123. }
  124. # Enable or disable spider middlewares
  125. # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
  126. #SPIDER_MIDDLEWARES = {
  127. # 'TencentSpider.middlewares.TencentspiderSpiderMiddleware': 543,
  128. #}
  129. # Enable or disable downloader middlewares
  130. # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
  131. DOWNLOADER_MIDDLEWARES = {
  132. 'TencentSpider.middlewares.TencentspiderDownloaderMiddleware': None,
  133. 'TencentSpider.middlewares.MyUserAgentMiddleware': 543
  134. 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None
  135. }
  136. # Enable or disable extensions
  137. # See https://doc.scrapy.org/en/latest/topics/extensions.html
  138. #EXTENSIONS = {
  139. # 'scrapy.extensions.telnet.TelnetConsole': None,
  140. #}
  141. # Configure item pipelines
  142. # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
  143. ITEM_PIPELINES = {
  144. 'TencentSpider.pipelines.TencentspiderPipeline': 300,
  145. # 通过配置RedisPipeline将item写入key为 spider.name : items 的redis的list中,供后面的分布式处理item 这个已经由 scrapy-redis 实现,不需要我们写代码,直接使用即可
  146. 'scrapy_redis.pipelines.RedisPipeline': 100
  147. }
  148. # Enable and configure the AutoThrottle extension (disabled by default)
  149. # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
  150. #AUTOTHROTTLE_ENABLED = True
  151. # The initial download delay
  152. #AUTOTHROTTLE_START_DELAY = 5
  153. # The maximum download delay to be set in case of high latencies
  154. #AUTOTHROTTLE_MAX_DELAY = 60
  155. # The average number of requests Scrapy should be sending in parallel to
  156. # each remote server
  157. #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
  158. # Enable showing throttling stats for every response received:
  159. #AUTOTHROTTLE_DEBUG = False
  160. # Enable and configure HTTP caching (disabled by default)
  161. # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
  162. #HTTPCACHE_ENABLED = True
  163. #HTTPCACHE_EXPIRATION_SECS = 0
  164. #HTTPCACHE_DIR = 'httpcache'
  165. #HTTPCACHE_IGNORE_HTTP_CODES = []
  166. #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
  167. LOG_LEVEL = 'DEBUG'

搭建 redis

这里搭建单机版 windows 版本,需要 linux 版本的自行百度。 下载地址:https://github.com/rgl/redis/downloads 选择最新版和你电脑的对应版本下载安装,这里我选择 redis-2.4.6-setup-64-bit.exe,双击安装,然后将 C:\Program Files\Redis 加入系统环境变量。配置文件为 C:\Program Files\Redis\conf\redis.conf 运行 redis 服务器的命令: redis-server 运行 redis 客户端的命令: redis-cli

运行爬虫

启动爬虫

  1. cd TencentSpider
  2. scrapy crawl tencent
  • TencentSpider 为项目文件夹, tencent 为爬虫名
  • 这时候爬虫会处于等待状态。
  • 可以在本机或者其他主机启动多个爬虫实例,只有所处的主机能够连接 redis 即可

设置 start_urls

  1. # redis-cli
  2. redis 127.0.0.1:6379> lpush tencent:start_urls https://hr.tencent.com/position.php?&start=0#a
  3. (integer) 1
  4. redis 127.0.0.1:6379>

或者运行以下脚本:

  1. # -*- coding: utf-8 -*-
  2. # Define here the models for your scraped items
  3. #
  4. # See documentation in:
  5. # https://doc.scrapy.org/en/latest/topics/items.html
  6. import redis
  7. if __name__ == '__main__':
  8. conn = redis.Redis(host='127.0.0.1',port=6379)
  9. # settings 中 REDIS_START_URLS_AS_SET = False #默认是false,true的话,就是集合,false的话,就为列表
  10. # 列表
  11. conn.lpush('tencent:start_urls','https://hr.tencent.com/position.php?&start=0#a')
  12. # 集合
  13. # conn.sadd('tencent:start_urls','https://hr.tencent.com/position.php?&start=0#a')
  14. # conn.close() 无需关闭连接
  • tencent:start_urls 为 spiders/tencent.py 中变量 redis_key 的值
  • 稍等片刻后,所有爬虫会运行,爬取完成后 ctrl + c 停止

结果会保存在 redis 数据库的key tencent:items 中与项目文件夹根目录下的 tencent.json 文件中,内容如下:

  1. {"positionname": "29302-服务采购商务岗", "positionlink": "https://hr.tencent.com/position_detail.php?id=49345&keywords=&tid=0&lid=0", "positionType": "职能类", "peopleNum": "1", "workLocation": "深圳", "publishTime": "2019-04-12", "positiondetail": "工作职责:• 负责相关产品和品类采购策略的制订及实施; • 负责相关产品及品类的采购运作管理,包括但不限于需求理解、供应商开发及选择、供应资源有效管理、商务谈判、成本控制、交付管理、组织验收等 • 支持业务部门的采购需求; • 收集、分析市场及行业相关信息,为采购决策提供依据。 工作要求:• 认同腾讯企业文化理念,正直、进取、尽责; • 本科或以上学历,管理、传媒、经济或其他相关专业,市场营销及内容类产品运营工作背景者优先; • 五年以上工作经验,对采购理念和采购过程管理具有清晰的认知和深刻的理解;拥有二年以上营销/设计采购、招标相关类管理经验; • 熟悉采购运作及管理,具有独立管理重大采购项目的经验,具有较深厚的采购专业知识; • 具备良好的组织协调和沟通能力、学习能力和团队合作精神强,具有敬业精神,具备较强的分析问题和解决问题的能力; • 了解IP及新文创行业现状及发展,熟悉市场营销相关行业知识和行业运作特点; • 具有良好的英语听说读写能力,英语可作为工作语言;同时有日语听说读写能力的优先; • 具备良好的文档撰写能力。计算机操作能力强,熟练使用MS OFFICE办公软件和 ERP 等软件的熟练使用。"}
  2. {"positionname": "CSIG16-自动驾驶高精地图(地图编译)", "positionlink": "https://hr.tencent.com/position_detail.php?id=49346&keywords=&tid=0&lid=0", "positionType": "技术类", "peopleNum": "1", "workLocation": "北京", "publishTime": "2019-04-12", "positiondetail": "工作职责:地图数据编译工具软件开发 工作要求: 硕士以上学历,2年以上工作经验,计算机、测绘、GIS、数学等相关专业;  精通C++编程,编程基础扎实;  熟悉常见数据结构,有较复杂算法设计经验;  精通数据库编程,如MySQL、sqlite等;  有实际的地图项目经验,如地图tile、大地坐标系、OSM等;  至少熟悉一种地图数据规格,如MIF、NDS、OpenDrive等;  有较好的数学基础,熟悉几何和图形学基本算法,;  具备较好的沟通表达能力和团队合作意识。"}
  3. {"positionname": "32032-资深特效美术设计师(上海)", "positionlink": "https://hr.tencent.com/position_detail.php?id=49353&keywords=&tid=0&lid=0", "positionType": "设计类", "peopleNum": "1", "workLocation": "上海", "publishTime": "2019-04-12", "positiondetail": "工作职责:负责游戏3D和2D特效制作,制作规范和技术标准的制定; 与项目组开发人员深入沟通,准确实现项目开发需求。 工作要求:5年以上端游、手游特效制作经验,熟悉UE4引擎; 能熟练使用相关软件和引擎工具制作高品质的3D特效; 善于使用第三方软件制作高品质序列资源,用于引擎特效; 可以总结自己的方法论和经验用于新人和带领团队; 对游戏开发和技术有热情和追求,有责任心,善于团队合作,沟通能力良好,应聘简历须附带作品。"}
  4. ......
  5. ......
  6. ......
  7.  
  8. 此爬虫不保证时效性,如果源站调整就会失效。

scrapy-redis分布式爬取tencent社招信息的更多相关文章

  1. scrapy-redis + Bloom Filter分布式爬取tencent社招信息

    scrapy-redis + Bloom Filter分布式爬取tencent社招信息 什么是scrapy-redis 什么是 Bloom Filter 为什么需要使用scrapy-redis + B ...

  2. 爬虫--scrapy+redis分布式爬取58同城北京全站租房数据

    作业需求: 1.基于Spider或者CrawlSpider进行租房信息的爬取 2.本机搭建分布式环境对租房信息进行爬取 3.搭建多台机器的分布式环境,多台机器同时进行租房数据爬取 建议:用Pychar ...

  3. python爬虫项目(scrapy-redis分布式爬取房天下租房信息)

    python爬虫scrapy项目(二) 爬取目标:房天下全国租房信息网站(起始url:http://zu.fang.com/cities.aspx) 爬取内容:城市:名字:出租方式:价格:户型:面积: ...

  4. Python爬虫基础--分布式爬取贝壳网房屋信息(Client)

    1. client_code01 2. client_code02 3. 这个时候运行多个client就可以分布式进行数据爬取.

  5. Python爬虫基础--分布式爬取贝壳网房屋信息(Server)

    1. server_code01 2. server_code02 3. server_code03

  6. Scrapy 分布式爬取

    由于受到计算机能力和网络带宽的限制,单台计算机运行的爬虫咋爬取数据量较大时,需要耗费很长时间.分布式爬取的思想是“人多力量大”,在网络中的多台计算机同时运行程序,公童完成一个大型爬取任务, Scrap ...

  7. scrapy-redis实现爬虫分布式爬取分析与实现

    本文链接:http://blog.csdn.net/u012150179/article/details/38091411 一 scrapy-redis实现分布式爬取分析 所谓的scrapy-redi ...

  8. 利用 Scrapy 爬取知乎用户信息

    思路:通过获取知乎某个大V的关注列表和被关注列表,查看该大V和其关注用户和被关注用户的详细信息,然后通过层层递归调用,实现获取关注用户和被关注用户的关注列表和被关注列表,最终实现获取大量用户信息. 一 ...

  9. 使用python scrapy爬取知乎提问信息

    前文介绍了python的scrapy爬虫框架和登录知乎的方法. 这里介绍如何爬取知乎的问题信息,并保存到mysql数据库中. 首先,看一下我要爬取哪些内容: 如下图所示,我要爬取一个问题的6个信息: ...

随机推荐

  1. jsp引擎是什么

    1.JSP引擎 执行JSP代码需要在服务器上安装JSP引擎,比较常见的引擎有webLogic和Tomcat.把这些支持JSP的web服务器配置好后,就可以在客户端通过浏览器来访问JSP页面了. 2.J ...

  2. python eval()内置函数

    python有一个内置函数eval(),可以将字符串进行运行. 通过help(eval)查看帮助文档 Help on built-in function eval in module builtins ...

  3. XML 介绍

    XML eXtensible Markup language:可扩展的标记语言 解决HTML不可扩展的问题, 作用:保存或传输数据,不是用来显示数据的. XML介绍 1.  基于文本格式的 2.  标 ...

  4. PAT_A1095#Cars on Campus

    Source: PAT A1095 Cars on Campus (30 分) Description: Zhejiang University has 8 campuses and a lot of ...

  5. Django框架(六)—— 视图层:HttpRequest、HTTPResponse、JsonResponse、CBV和FBV、文件上传

    目录 视图层 一.视图函数 二.视图层之HttpRequest对象 三.视图层之HttpResponse对象 四.视图层之JsonResponse对象 五.CBV和FBV 六.文件上传 视图层 一.视 ...

  6. 20140513 matlab画图

    1.matlab画图 x1=[1.00E-06,2.00E-06,4.00E-06,9.00E-06,2.00E-05,4.00E-05,8.00E-05,2.00E-04,4.00E-04,7.00 ...

  7. 标准 IO 测试 标准输出,输入,出错缓冲大小;全缓冲文本流大小

    例子:测试缓冲区大小 #include <stdio.h> int main(int argc, const char *argv[]) { //标准输入大小,没有输入内容时,标准输入缓冲 ...

  8. 16_TLB与流水线

    1 前面做的实验起始有缺陷 访问内存之后,后面执行两句代码后:并不能保证刚才访问的代码还在TLB中:有可能被刷新出去了: 实验验证缺陷: 代码 不连续 TLB 被淘汰: 2万次中有1次被淘汰:由于访问 ...

  9. 读书笔记一、numpy基础--创建数组

    创建ndarray   (1)使用array函数 接受一切序列型的对象(包括其他数组),然后产生一个新的含有传入数据的numpy数组. import numpy as np #将一个由数值组成列表作为 ...

  10. 链表list

    Don't  lost link! list与vector不同之处在于元素的物理地址可以任意. 为保证对列表元素访问的可行性,逻辑上互为前驱和后继的元素之间,应维护某种索引关系.这种索引关系,可抽象地 ...