python 招聘数据分析

导入包

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

读文件

df=pd.read_csv(r'C:\Users\MSI\Desktop\1.csv')

查看数据

df.head()

查看基本信息

df.info()

一共有九个字段，22739条数据，数据全为字符串，不存在数据为空的情况，因此不需要进行对缺少数据的处理

对重复数据进行处理，删除职位和公司重复值

df.drop_duplicates(['PositionName','CompanyName'],keep='first', inplace=True)

查看处理后的信息

df.info()

剩余21851条记录

查看薪资的分布的频率，发现面议有较大的比重

df['Salary'].str[0:].value_counts(normalize = True)

自定义函数drops，删除薪资中的面议

def drops(col, tag):

    df.drop(df[df[col].str.contains(tag)].index, inplace=True)

drops('Salary', '面议')

自定义函数cutWord求平均薪资

def cutWord(word,method):

    position=word.find("-")

    length = len(word)

    if position != -1:

        bottomSalary = word[:position]

        topSalary = word[position + 1:length - 1]

    if method == 'bottom':

        return bottomSalary

    else:

        return topSalary

df['topSalary']=df.Salary.apply(cutWord,method='top')

df['bottomSalary']=df.Salary.apply(cutWord,method='bottom')

df.topSalary=df.topSalary.astype("int")

df.bottomSalary=df.bottomSalary.astype("int")

df['avgSalary']=df.apply(lambda x:(x.bottomSalary+x.topSalary)/2,axis=1)

df['avgSalary'].value_counts

由于各个仅统计各个省份，但所给数据中含有地级市及区等，因此对数据进行处理，仅保留省份/直辖市

自定义函数newCity

def newCity(city):

    if(len(str(city))>2):

        newcity = city[:2]

    else:

        newcity=city

    return newcity

df['newcity']=df.City.apply(newCity)

数据基本处理完成，保存为df_clean

df_clean = df[["PositionName", "CompanyName", "newcity", "Experience", "JobWords", "avgSalary"]]

df_clean.head()

查看数据的描述性信息

print(df_clean.describe())

平均薪资：21.85W，中位数：19W，最高：177.5W

薪资分布情况图

plt.rcParams['font.sans-serif']=['SimHei']

df_clean.avgSalary.hist(bins=20)

plt.show()

分割experience，不知道为什么这里分割了八个出来，我就定义了8列。不太懂我觉的这里应该四列才对,8列弄出来之后再把多的删了

info_split=df_clean['Experience'].str.split(' ',expand=True)

info_split.columns=['education','experience','language','age','1','2','3','4']

newExp=info_split.drop(['1','2','3','4'],axis=1)

display(newExp)

display(df_clean)

然后把两个二维表进行链接，再保存为new_df，最开始是链接之后删除experience，但是不知道为什么链接之后删除newcity就变成了city，之前的city白处理了。然后就直接保存了

newDF=pd.concat([df_clean, newExp], axis=1)

new_df = newDF[["PositionName", "CompanyName", "newcity",'education','experience','language','age' , "JobWords", "avgSalary"]]

display(new_df)

转换分类数据，这里发现本科有两个，然后其他数据不是很直观，后续有对这个数据进行了处理

new_df.education.astype('category')

自定义newEdu处理教育水平，写的有点复杂，之前的写法不知道为什么最后的结构只剩下本科和硕士。

def newEdu(education):

    if education == "硕士及以上":

        new_edu = "硕士"

    elif education == "统招本科":

        new_edu = "本科"

    elif education == "本科及以上":

        new_edu = "本科"

    elif education== "学历不限":

        new_edu = "不限"

    elif education== "大专及以上":

        new_edu = "大专"

    elif education == "中专/中技及以上":

        new_edu = "中专"

    else:

        new_edu="博士"

    return new_edu

new_df['new_edu'] = new_df.education.apply(newEdu)

new_df.new_edu.astype('category')

选用线箱进行比较。其最大的优点就是不受异常值的影响，可以以一种相对稳定的方式描述数据的平均水平、波动程度和异常值分布情况。

new_df.new_edu=new_df.new_edu.astype('category')

new_df.new_edu.cat.set_categories(["中专", "博士", "大专", "不限", "本科", "硕士", ],inplace=True)

ax=new_df.boxplot(column='avgSalary',by='new_edu',figsize=(9,6))

print(new_df.groupby(new_df.new_edu).avgSalary.mean().sort_values(ascending=False))

如图1，本科中位数薪资高于硕士生，容易误以为本科薪资高于硕士生，但同时结合图2，可见硕士生的平均薪资水平远高于本科生，由此可知，学历越高，薪资越高，知识改变命运。

转化数据类型（工作年限）创建线箱进行比较

new_df.experience.astype('category')

new_df.boxplot(column='avgSalary',by='experience',figsize=(9,6))

工作年限和薪资的比较

print(new_df.groupby(new_df.experience).avgSalary.mean().sort_values(ascending=False))

薪资与工作年限有很大关系，但优秀员工薪资明显超越年限限制。

北京和上海这两座城市，学历对薪资的影响

df_sz_bj=new_df[new_df['newcity'].isin(['上海','北京'])]

df_sz_bj.boxplot(column='avgSalary',by=['new_edu','newcity'],figsize=[14,6])

plt.show()

薪资与工作区域有很大关系，北京薪资不管什么学历都高于同等学历的薪资状况

不同城市，招聘数据分析需求前五的公司

自定义了函数topN，将传入的数据计数，并且从大到小返回前五的数据。然后以newcity聚合分组，因为求的是前5的公司，所以对CompanyName调用topN函数。

new_df.groupby('CompanyName').avgSalary.agg(lambda x:max(x)-min(x))

def topN(df,n=5):

    counts=df.value_counts()

    return counts.sort_values(ascending=False)[:n]

print(new_df.groupby('newcity').CompanyName.apply(topN))

职位需求的前五，以计算机行业为主

print(new_df.groupby('newcity').PositionName.apply(topN))

将上海和北京的薪资数据以直方图的形式进行对比

plt.hist(x=new_df[new_df.newcity=='上海'].avgSalary,

         bins=15,

         density=1,

         facecolor='blue',

         alpha=0.5)

plt.hist(x=new_df[new_df.newcity=='北京'].avgSalary,

         bins=15,

         density=1,

         facecolor='red',

         alpha=0.5)

plt.show()

做一个所需要做的工作的词云，先下载wordcloud库

在anaconda下载第三方库还挺麻烦的，镜像还不能用，只能下载之后导包

查看数据进行处理

print(new_df.JobWords)

重置索引然后作词云

df_word_counts=df_word.unstack().dropna().reset_index().groupby('level_0').count()

from wordcloud import WordCloud

df_word_counts.index=df_word_counts.index.str.replace("'","")

wc=WordCloud(font_path=r'C:\Windows\Fonts\FZSTK.TTF',width=900,height=400,background_color='white')

fig,ax=plt.subplots(figsize=(20,15))

wc.fit_words(df_word_counts.level_1)

ax=plt.imshow(wc)

plt.axis('off')

plt.show()

上图可见对统计分析，数学，英语和office使用还是有一定的要求。

完整代码

#!/usr/bin/env python

# coding: utf-8

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

df=pd.read_csv(r'C:\Users\MSI\Desktop\1.csv')

df.head()

df.info()

df.drop_duplicates(['PositionName','CompanyName'],keep='first', inplace=True)

df.info()

df['Salary'].str[0:].value_counts(normalize = True)

def drops(col, tag):

    df.drop(df[df[col].str.contains(tag)].index, inplace=True)

drops('Salary', '面议')

df['Salary'].str[0:].value_counts(normalize = True)

def cutWord(word,method):

    position=word.find("-")

    length = len(word)

    if position != -1:

        bottomSalary = word[:position]

        topSalary = word[position + 1:length - 1]

    if method == 'bottom':

        return bottomSalary

    else:

        return topSalary

df['topSalary']=df.Salary.apply(cutWord,method='top')

df['bottomSalary']=df.Salary.apply(cutWord,method='bottom')

df.topSalary=df.topSalary.astype("int")

df.bottomSalary=df.bottomSalary.astype("int")

df['avgSalary']=df.apply(lambda x:(x.bottomSalary+x.topSalary)/2,axis=1)

df['avgSalary'].value_counts

def newCity(city):

    if(len(str(city))>2):

        newcity = city[:2]

    else:

        newcity=city

    return newcity

df['newcity']=df.City.apply(newCity)

df_clean = df[["PositionName", "CompanyName", "newcity", "Experience", "JobWords", "avgSalary"]]

df_clean.head()

print(df_clean.describe())

plt.rcParams['font.sans-serif']=['SimHei']

df_clean.avgSalary.hist(bins=20)

plt.show()

info_split=df_clean['Experience'].str.split(' ',expand=True)

info_split.columns=['education','experience','language','age','1','2','3','4']

newExp=info_split.drop(['1','2','3','4'],axis=1)

display(newExp)

display(df_clean)

newDF=pd.concat([df_clean, newExp], axis=1)

new_df = newDF[["PositionName", "CompanyName", "newcity",'education','experience','language','age' , "JobWords", "avgSalary"]]

display(new_df)

new_df.education.astype('category')

def newEdu(education):

    if education == "硕士及以上":

        new_edu = "硕士"

    elif education == "统招本科":

        new_edu = "本科"

    elif education == "本科及以上":

        new_edu = "本科"

    elif education== "学历不限":

        new_edu = "不限"

    elif education== "大专及以上":

        new_edu = "大专"

    elif education == "中专/中技及以上":

        new_edu = "中专"

    else:

        new_edu="博士"

    return new_edu

new_df['new_edu'] = new_df.education.apply(newEdu)

new_df.new_edu.astype('category')

new_df.new_edu=new_df.new_edu.astype('category')

new_df.new_edu.cat.set_categories(["中专", "博士", "大专", "不限", "本科", "硕士", ],inplace=True)

ax=new_df.boxplot(column='avgSalary',by='new_edu',figsize=(9,6))

print(new_df.groupby(new_df.new_edu).avgSalary.mean().sort_values(ascending=False))

new_df.experience.astype('category')

new_df.boxplot(column='avgSalary',by='experience',figsize=(9,6))

print(new_df.groupby(new_df.experience).avgSalary.mean().sort_values(ascending=False))

df_sz_bj=new_df[new_df['newcity'].isin(['上海','北京'])]

df_sz_bj.boxplot(column='avgSalary',by=['new_edu','newcity'],figsize=[14,6])

plt.show()

new_df.groupby('CompanyName').avgSalary.agg(lambda x:max(x)-min(x))

def topN(df,n=5):

    counts=df.value_counts()

    return counts.sort_values(ascending=False)[:n]

print(new_df.groupby('newcity').CompanyName.apply(topN))

print(new_df.groupby('newcity').PositionName.apply(topN))

plt.hist(x=new_df[new_df.newcity=='上海'].avgSalary,

         bins=15,

         density=1,

         facecolor='blue',

         alpha=0.5)

plt.hist(x=new_df[new_df.newcity=='北京'].avgSalary,

         bins=15,

         density=1,

         facecolor='red',

         alpha=0.5)

plt.show()

print(new_df.JobWords)

df_word_counts=df_word.unstack().dropna().reset_index().groupby('level_0').count()

from wordcloud import WordCloud

df_word_counts.index=df_word_counts.index.str.replace("'","")

wc=WordCloud(font_path=r'C:\Windows\Fonts\FZSTK.TTF',width=900,height=400,background_color='white')

fig,ax=plt.subplots(figsize=(20,15))

wc.fit_words(df_word_counts.level_1)

ax=plt.imshow(wc)

plt.axis('off')

plt.show()

参考资料：

https://www.jianshu.com/p/1e1081ca13b5

python 招聘数据分析的更多相关文章

利用Python进行数据分析(12) pandas基础: 数据合并
pandas 提供了三种主要方法可以对数据进行合并: pandas.merge()方法:数据库风格的合并: pandas.concat()方法:轴向连接,即沿着一条轴将多个对象堆叠到一起: 实例方法c ...
利用Python进行数据分析(5) NumPy基础: ndarray索引和切片
概念理解索引即通过一个无符号整数值获取数组里的值. 切片即对数组里某个片段的描述. 一维数组一维数组的索引一维数组的索引和Python列表的功能类似: 一维数组的切片一维数组的切片语法格式为a ...
利用Python进行数据分析(9) pandas基础: 汇总统计和计算
pandas 对象拥有一些常用的数学和统计方法. 例如,sum() 方法,进行列小计: sum() 方法传入 axis=1 指定为横向汇总,即行小计: idxmax() 获取最大值对应的索 ...
利用Python进行数据分析(8) pandas基础: Series和DataFrame的基本操作
一.reindex() 方法:重新索引针对 Series 重新索引指的是根据index参数重新进行排序. 如果传入的索引值在数据里不存在,则不会报错,而是添加缺失值的新行. 不想用缺失值,可以用 ...
利用Python进行数据分析(7) pandas基础: Series和DataFrame的简单介绍
一.pandas 是什么 pandas 是基于 NumPy 的一个 Python 数据分析包,主要目的是为了数据分析.它提供了大量高级的数据结构和对数据处理的方法. pandas 有两个主要的数据结构 ...
利用Python进行数据分析(4) NumPy基础: ndarray简单介绍
一.NumPy 是什么 NumPy 是 Python 科学计算的基础包,它专为进行严格的数字处理而产生.在之前的随笔里已有更加详细的介绍,这里不再赘述. 利用 Python 进行数据分析(一)简单介绍 ...
《利用python进行数据分析》读书笔记 --第一、二章准备与例子
http://www.cnblogs.com/batteryhp/p/4868348.html 第一章准备工作今天开始码这本书--<利用python进行数据分析>.R和python都得 ...
利用python进行数据分析之绘图和可视化
matplotlib API入门使用matplotlib的办法最常用的方式是pylab的ipython,pylab模式还会向ipython引入一大堆模块和函数提供一种更接近与matlab的界面,ma ...
利用Python进行数据分析——Numpy基础：数组和矢量计算
利用Python进行数据分析--Numpy基础:数组和矢量计算 ndarry,一个具有矢量运算和复杂广播能力快速节省空间的多维数组对整组数据进行快速运算的标准数学函数,无需for-loop 用于读写 ...

随机推荐

SpringBoot入门到精通系列教程 - Filter/WebFilter
1. Filter简介 1.1. SpringBoot 中Filter的作用 1.2. @WebFilter 2. 基于@WebFilter的实现 2.1. SpringbootApplication ...
Java获取X509证书里的指纹（SHA-1）从pxf文件里面
直接通过流去获取pxf后缀文件的内容,指纹通过X509才能获取.String keyStorefile = "pfx文件地址";String strPassword = " ...
个人总结的一些C++基础理论
我自己整理的一些C++基础理论知识,面试的同学可以用到: 主要是针对那些基础理论知识比较薄弱的同学吧,希望会对大家面试有些帮助,排版什么的有点乱,大家多多包涵: 往期经典: 北漂95后的2020 给北 ...
常用的linux指令
a.cd /home 进入 '/ home' 目录' b.cd .. 返回上一级目录 c.cd ../.. 返回上两级目录 d.mkdir dir1 创建一个叫做 'dir1' 的目录' e.mkdi ...
Oracle RedoLog-基本概念和组成
Oracle 数据库恢复操作最关键的依据就是 redo log,它记录了对数据库所有的更改操作.在研究如何提取 redolog 中 DML 操作的过程可谓一波三折,因为介绍 redolog 结构细节的 ...
最开始的总结——JAVA
<最开始的总结> 回答自己几个问题:为什么去学它?学它有什么用?怎样去学它?自己目前目标是什么?估计自己会花多长时间去学习这些? 一.为什么学习Java,它有什么用? 答:我看中的是Jav ...
强大生产力工具Alfred
今天要给大家介绍的工具是Alfred,一款Mac下的高效生产力产品.它能做什么呢?简单的说就是:让你能够通过打几个字,就可以完成原本需要一顿操作的事情.举一个简单的栗子:如果我们要在Google搜索一 ...
keycloak集群化的思考
目录简介 keycloak中的集群 load balancing负载均衡暴露客户端IP地址 sticky sessions 和非sticky sessions shared databases ...
九：APP及其他资产
APP提取一键反编译提取 APP抓数据包进行工具配合各种第三方应用相关探针技术各种服务器接口相关探针技术 APP提取及抓包及后续配合某APK一键提取反编译利用burp历史抓更多URL 某IP无 ...
JS navigator.userAgent
var u = navigator.userAgent; var isAndroid = u.indexOf('Android') > -1 || u.indexOf('Adr') > - ...

python 招聘数据分析

完整代码

python 招聘数据分析的更多相关文章

随机推荐

热门专题