爬取前的分析：

目标网站为拉勾网我们要获取的是网站中的所有公司的信息通过分析翻页请求不难看出所有数据都是通过json来传递的，所以我们只要能够正确的发送post请求，就能够获取到公司的列表数据

废话不多说，直接上代码：

import os
import json
import requests
import datetime
from pyquery import PyQuery as pq
from openpyxl import Workbook
from openpyxl import load_workbook

def (url):
    city_list = []
    html = pq(url= url)
    for areaId in html.find('#filterCollapse').find('div[class="has-more workcity"]').eq(0).find('div[class="more more-positions"]').find("a[data-lg-tj-cid='idnull']"):
        aId = pq(areaId).attr('href').replace('http://www.lagou.com/gongsi/', '').replace('-0-0#filterBox', '')
        if(aId=='0'):
            continue
        city_list.append(aId)
    return city_list

#获取城市名称列表
def get_city_name_list(u):
    city_name_list = []
    url = 'http://www.lagou.com/gongsi/'
    html = pq(url=url)
    for areaId in html.find('#filterCollapse').find('div[class="has-more workcity"]').eq(0).find('div[class="more more-positions"]').find("a[data-lg-tj-cid='idnull']"):
        area_name=pq(areaId).html()
        if area_name=="全国":
            continue
        city_name_list.append(area_name)
    return city_name_list

#获取城市下一共有多少页
def get_city_page(areaId,page_num):
    try:
        param = {'first': 'false', 'pn': page_num, 'sortField': '0', 'havemark': '0'}  大专栏  Python3爬虫：（一）爬取拉勾网公司列表#访问参数
        r = requests.post('http://www.lagou.com/gongsi/'+areaId+'-0-0.json',params=param ) #requsets请求
        page_num += 1
        if(len(r.json()['result'])/16==1):
            return get_city_page(areaId,page_num)
        else:
            return page_num
    except:
        return page_num-1

#根据城市ID获取所有公司信息
def get_company_list(areaId):
    company_list = []
    city_page_total=get_city_page(areaId,1)
    for pageIndex in range(1,city_page_total):
        print('正在爬取第'+str(pageIndex)+'页')
        json_url = 'http://www.lagou.com/gongsi/'+areaId+'-0-0.json'
        param = {'first': 'false', 'pn': str(pageIndex), 'sortField': '0', 'havemark': '0'} #访问参数
        r = requests.post(json_url,params=param ) #requsets请求
        msg = json.loads(r.text)
        try:
            for company in msg['result']:
               company_list.append([company['city'],company['cityScore'],company['companyFeatures'],company['companyId'],company['companyLabels'],company['companyLogo'],company['companyName'],str(company['companyPositions']),company['companyShortName'],company['countryScore'],company['createTime'],company['finaceStage'],company['industryField'],company['interviewRemarkNum'],company['otherLabels'], company['positionNum'],company['processRate'],str(datetime.datetime.now())])
        except:
            print('爬取编号为'+str(areaId)+'城市时第'+str(pageIndex)+'页出现了错误,错误时请求返回内容为：'+str(msg))
            continue
    return company_list

#写入Excel文件方法
def write_file(fileName):
    list = []
    wb = Workbook()
    ws = wb.active
    url = 'http://www.lagou.com/gongsi/'
    area_name_list = get_city_name_list(url)
    for area_name in area_name_list:
        wb.create_sheet(title = area_name)
        file_name = fileName+'.xlsx'
        wb.save(file_name)
    areaId_list = get_cityId_list(url)
    for areaId in areaId_list:
        company_list = get_company_list(areaId)
        print('正在爬取----->****'+company_list[0][0]+'****公司列表')
        wb1 = load_workbook(file_name)
        ws = wb1.get_sheet_by_name(company_list[0][0])
        ws.append(['城市名称','城市得分','公司期望','公司ID','公司标签','公司Logo','发展阶段','企业名称','企业位置','企业简称','注册时间','财务状况','行业','在招职位','其他标签','简历处理率'])
    for company in company_list:
        ws.append([company[0],str(company[1]),company[2],str(company[3]),company[4],company[5],company[6],company[7],company[8],company[9],company[10],company[11],company[12],company[13],company[14],company[15]])
        wb1.save(file_name)

file_name =  input('请输入文件名称')
print(str(datetime.datetime.now()))
write_file(file_name)
print(str(datetime.datetime.now()))

废话两句:

此类招聘网站的目标人群是所有人，不会被限制爬虫，可以放心的爬。

本人爬取出所有的公司数据用了 45分钟，数据比较少就没考虑用多进程爬虫，存储到excel中的公司名称一共有27k家的公司左右，与官网页面宣传的差了很多，不知道是不是因为很多企业没有认证的原因。
最后奉上爬取的Excel文件截图：

Python3爬虫：（一）爬取拉勾网公司列表的更多相关文章

python-scrapy爬虫框架爬取拉勾网招聘信息
本文实例为爬取拉勾网上的python相关的职位信息, 这些信息在职位详情页上, 如职位名, 薪资, 公司名等等. 分析思路分析查询结果页在拉勾网搜索框中搜索'python'关键字, 在浏览器地址栏 ...
【Python3爬虫】爬取美女图新姿势--Redis分布式爬虫初体验
一.写在前面之前写的爬虫都是单机爬虫,还没有尝试过分布式爬虫,这次就是一个分布式爬虫的初体验.所谓分布式爬虫,就是要用多台电脑同时爬取数据,相比于单机爬虫,分布式爬虫的爬取速度更快,也能更好地应对I ...
通俗易懂的分析如何用Python实现一只小爬虫，爬取拉勾网的职位信息
源代码:https://github.com/nnngu/LagouSpider 效果预览思路 1.首先我们打开拉勾网,并搜索"java",显示出来的职位信息就是我们的目标. 2 ...
python3 爬虫之爬取安居客二手房资讯(第一版)
#!/usr/bin/env python3 # -*- coding: utf-8 -*- # Author;Tsukasa import requests from bs4 import Beau ...
Python3爬虫之爬取某一路径的所有html文件
要离线下载易百教程网站中的所有关于Python的教程,需要将Python教程的首页作为种子url:http://www.yiibai.com/python/,然后按照广度优先(广度优先,使用队列:深度 ...
python3爬虫应用--爬取网易云音乐（两种办法）
一.需求好久没有碰爬虫了,竟不知道从何入手.偶然看到一篇知乎的评论(https://www.zhihu.com/question/20799742/answer/99491808),一时兴起就也照葫 ...
【图文详解】scrapy爬虫与动态页面——爬取拉勾网职位信息（2）
上次挖了一个坑,今天终于填上了,还记得之前我们做的拉勾爬虫吗?那时我们实现了一页的爬取,今天让我们再接再厉,实现多页爬取,顺便实现职位和公司的关键词搜索功能. 之前的内容就不再介绍了,不熟悉的请一定要 ...
node.js爬虫爬取拉勾网职位信息
简介用node.js写了一个简单的小爬虫,用来爬取拉勾网上的招聘信息,共爬取了北京.上海.广州.深圳.杭州.西安.成都7个城市的数据,分别以前端.PHP.java.c++.python.Androi ...
一起学爬虫——使用selenium和pyquery爬取京东商品列表
layout: article title: 一起学爬虫--使用selenium和pyquery爬取京东商品列表 mathjax: true --- 今天一起学起使用selenium和pyquery爬 ...

随机推荐

POJ 1O17 Packets [贪心]
Packets Description A factory produces products packed in square packets of the same height h and of ...
5 分钟全面掌握 Python 装饰器
♚ 作者:吉星高照, 网易游戏资深开发工程师,主要工作方向为网易游戏 CDN 自动化平台的设计和开发,脑洞比较奇特,喜欢在各种非主流的领域研究制作各种不走寻常路的东西. ! Python的装饰器是面试 ...
idea高效插件
RestfulToolkit:url定位controller,快捷键:ctrl+\Maven Helper:依赖分析JRebel:热部署Rainbow Brackets:个性化花括号aiXcode:a ...
Mac环境下 Python3安装及配置
1.mac 环境下安装 python3 .查看 mac 自带系统版本 #查看系统自带的python open /System/Library/Frameworks/Python.framework/V ...
Educational Codeforces Round 64(Unrated for Div.1+Div. 2)
什么垃圾比赛,A题说的什么鬼楞是没看懂.就我只会BD(其实C是个大水题二分),垃圾游戏,技不如人,肝败吓疯,告辞,口胡了E就睡觉了. B 很容易发现,存在一种方案,使得相同字母连在一起,然后发现,当字 ...
js几个常用的弹层
js弹层技术很常见,自己每次用上网找,一找一大堆. 对比了几种,考虑通用性和易用性,这里记录两个. jQueryUI的http://jqueryui.com/dialog/#modal-form ar ...
The sequence and de novo assembly of the giant panda genome.ppt
sequencing:使用二代测序原因:高通量,短序列不用长序列原因: 1.算法错误率高 2.长序列测序将嵌合体基因错误积累.嵌合体基因:通过重组由来源与功能不同的基因序列剪接而形成的杂合基因 se ...
如何将EXCEL两列比较后不重复的数据复制到另一列上
Q1:我有两列数据,需要做重复性比较,比较完后需要将不重复的数据提取出来自成一列,请问该如何操作? 假如你要比较A列与B列数据是否重复,应该有三种结果(即AB皆有,A有B无,B有A无),可在C列存放A ...
js之意想不到的结果
js 是弱类型语言 ,在进行计算时如果遇到不能计算的单位,就会进行默认转换 1.typeof NaN 结果为 “number” 原因:NaN 表示不是不是一个数字(Not a Number), ...
Prefix and Suffix
题目描述 Snuke is interested in strings that satisfy the following conditions: The length of the string ...

Python3爬虫：（一）爬取拉勾网公司列表

爬取前的分析：

废话不多说，直接上代码：

废话两句:

Python3爬虫：（一）爬取拉勾网公司列表的更多相关文章

随机推荐

热门专题