简介
NaN的例子
整数类型的缺失值
Datetimes 类型的缺失值
None 和 np.nan 的转换
缺失值的计算
使用fillna填充NaN数据
使用dropna删除包含NA的数据
插值interpolation
使用replace替换值

简介

在数据处理中，Pandas会将无法解析的数据或者缺失的数据使用NaN来表示。虽然所有的数据都有了相应的表示，但是NaN很明显是无法进行数学运算的。

本文将会讲解Pandas对于NaN数据的处理方法。

NaN的例子

上面讲到了缺失的数据会被表现为NaN，我们来看一个具体的例子：

我们先来构建一个DF：

In [1]: df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],

   ...:                   columns=['one', 'two', 'three'])

   ...: 

In [2]: df['four'] = 'bar'

In [3]: df['five'] = df['one'] > 0

In [4]: df

Out[4]:

        one       two     three four   five

a  0.469112 -0.282863 -1.509059  bar   True

c -1.135632  1.212112 -0.173215  bar  False

e  0.119209 -1.044236 -0.861849  bar   True

f -2.104569 -0.494929  1.071804  bar  False

h  0.721555 -0.706771 -1.039575  bar   True

上面DF只有acefh这几个index，我们重新index一下数据：

In [5]: df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

In [6]: df2

Out[6]:

        one       two     three four   five

a  0.469112 -0.282863 -1.509059  bar   True

b       NaN       NaN       NaN  NaN    NaN

c -1.135632  1.212112 -0.173215  bar  False

d       NaN       NaN       NaN  NaN    NaN

e  0.119209 -1.044236 -0.861849  bar   True

f -2.104569 -0.494929  1.071804  bar  False

g       NaN       NaN       NaN  NaN    NaN

h  0.721555 -0.706771 -1.039575  bar   True

数据缺失，就会产生很多NaN。

为了检测是否NaN，可以使用isna()或者notna() 方法。

In [7]: df2['one']

Out[7]:

a    0.469112

b         NaN

c   -1.135632

d         NaN

e    0.119209

f   -2.104569

g         NaN

h    0.721555

Name: one, dtype: float64

In [8]: pd.isna(df2['one'])

Out[8]:

a    False

b     True

c    False

d     True

e    False

f    False

g     True

h    False

Name: one, dtype: bool

In [9]: df2['four'].notna()

Out[9]:

a     True

b    False

c     True

d    False

e     True

f     True

g    False

h     True

Name: four, dtype: bool

注意在Python中None是相等的：

In [11]: None == None                                                 # noqa: E711

Out[11]: True

但是np.nan是不等的：

In [12]: np.nan == np.nan

Out[12]: False

整数类型的缺失值

NaN默认是float类型的，如果是整数类型，我们可以强制进行转换：

In [14]: pd.Series([1, 2, np.nan, 4], dtype=pd.Int64Dtype())

Out[14]:

0       1

1       2

2    <NA>

3       4

dtype: Int64

Datetimes 类型的缺失值

时间类型的缺失值使用NaT来表示：

In [15]: df2 = df.copy()

In [16]: df2['timestamp'] = pd.Timestamp('20120101')

In [17]: df2

Out[17]:

        one       two     three four   five  timestamp

a  0.469112 -0.282863 -1.509059  bar   True 2012-01-01

c -1.135632  1.212112 -0.173215  bar  False 2012-01-01

e  0.119209 -1.044236 -0.861849  bar   True 2012-01-01

f -2.104569 -0.494929  1.071804  bar  False 2012-01-01

h  0.721555 -0.706771 -1.039575  bar   True 2012-01-01

In [18]: df2.loc[['a', 'c', 'h'], ['one', 'timestamp']] = np.nan

In [19]: df2

Out[19]:

        one       two     three four   five  timestamp

a       NaN -0.282863 -1.509059  bar   True        NaT

c       NaN  1.212112 -0.173215  bar  False        NaT

e  0.119209 -1.044236 -0.861849  bar   True 2012-01-01

f -2.104569 -0.494929  1.071804  bar  False 2012-01-01

h       NaN -0.706771 -1.039575  bar   True        NaT

In [20]: df2.dtypes.value_counts()

Out[20]:

float64           3

datetime64[ns]    1

bool              1

object            1

dtype: int64

None 和 np.nan 的转换

对于数字类型的，如果赋值为None，那么会转换为相应的NaN类型：

In [21]: s = pd.Series([1, 2, 3])

In [22]: s.loc[0] = None

In [23]: s

Out[23]:

0    NaN

1    2.0

2    3.0

dtype: float64

如果是对象类型，使用None赋值，会保持原样：

In [24]: s = pd.Series(["a", "b", "c"])

In [25]: s.loc[0] = None

In [26]: s.loc[1] = np.nan

In [27]: s

Out[27]:

0    None

1     NaN

2       c

dtype: object

缺失值的计算

缺失值的数学计算还是缺失值：

In [28]: a

Out[28]:

        one       two

a       NaN -0.282863

c       NaN  1.212112

e  0.119209 -1.044236

f -2.104569 -0.494929

h -2.104569 -0.706771

In [29]: b

Out[29]:

        one       two     three

a       NaN -0.282863 -1.509059

c       NaN  1.212112 -0.173215

e  0.119209 -1.044236 -0.861849

f -2.104569 -0.494929  1.071804

h       NaN -0.706771 -1.039575

In [30]: a + b

Out[30]:

        one  three       two

a       NaN    NaN -0.565727

c       NaN    NaN  2.424224

e  0.238417    NaN -2.088472

f -4.209138    NaN -0.989859

h       NaN    NaN -1.413542

但是在统计中会将NaN当成0来对待。

In [31]: df

Out[31]:

        one       two     three

a       NaN -0.282863 -1.509059

c       NaN  1.212112 -0.173215

e  0.119209 -1.044236 -0.861849

f -2.104569 -0.494929  1.071804

h       NaN -0.706771 -1.039575

In [32]: df['one'].sum()

Out[32]: -1.9853605075978744

In [33]: df.mean(1)

Out[33]:

a   -0.895961

c    0.519449

e   -0.595625

f   -0.509232

h   -0.873173

dtype: float64

如果是在cumsum或者cumprod中，默认是会跳过NaN，如果不想统计NaN，可以加上参数skipna=False

In [34]: df.cumsum()

Out[34]:

        one       two     three

a       NaN -0.282863 -1.509059

c       NaN  0.929249 -1.682273

e  0.119209 -0.114987 -2.544122

f -1.985361 -0.609917 -1.472318

h       NaN -1.316688 -2.511893

In [35]: df.cumsum(skipna=False)

Out[35]:

   one       two     three

a  NaN -0.282863 -1.509059

c  NaN  0.929249 -1.682273

e  NaN -0.114987 -2.544122

f  NaN -0.609917 -1.472318

h  NaN -1.316688 -2.511893

使用fillna填充NaN数据

数据分析中，如果有NaN数据，那么需要对其进行处理，一种处理方法就是使用fillna来进行填充。

下面填充常量：

In [42]: df2

Out[42]:

        one       two     three four   five  timestamp

a       NaN -0.282863 -1.509059  bar   True        NaT

c       NaN  1.212112 -0.173215  bar  False        NaT

e  0.119209 -1.044236 -0.861849  bar   True 2012-01-01

f -2.104569 -0.494929  1.071804  bar  False 2012-01-01

h       NaN -0.706771 -1.039575  bar   True        NaT

In [43]: df2.fillna(0)

Out[43]:

        one       two     three four   five            timestamp

a  0.000000 -0.282863 -1.509059  bar   True                    0

c  0.000000  1.212112 -0.173215  bar  False                    0

e  0.119209 -1.044236 -0.861849  bar   True  2012-01-01 00:00:00

f -2.104569 -0.494929  1.071804  bar  False  2012-01-01 00:00:00

h  0.000000 -0.706771 -1.039575  bar   True                    0

还可以指定填充方法，比如pad：

In [45]: df

Out[45]:

        one       two     three

a       NaN -0.282863 -1.509059

c       NaN  1.212112 -0.173215

e  0.119209 -1.044236 -0.861849

f -2.104569 -0.494929  1.071804

h       NaN -0.706771 -1.039575

In [46]: df.fillna(method='pad')

Out[46]:

        one       two     three

a       NaN -0.282863 -1.509059

c       NaN  1.212112 -0.173215

e  0.119209 -1.044236 -0.861849

f -2.104569 -0.494929  1.071804

h -2.104569 -0.706771 -1.039575

可以指定填充的行数：

In [48]: df.fillna(method='pad', limit=1)

fill方法统计：

方法名	描述
pad / ffill	向前填充
bfill / backfill	向后填充

可以使用PandasObject来填充：

In [53]: dff

Out[53]:

          A         B         C

0  0.271860 -0.424972  0.567020

1  0.276232 -1.087401 -0.673690

2  0.113648 -1.478427  0.524988

3       NaN  0.577046 -1.715002

4       NaN       NaN -1.157892

5 -1.344312       NaN       NaN

6 -0.109050  1.643563       NaN

7  0.357021 -0.674600       NaN

8 -0.968914 -1.294524  0.413738

9  0.276662 -0.472035 -0.013960

In [54]: dff.fillna(dff.mean())

Out[54]:

          A         B         C

0  0.271860 -0.424972  0.567020

1  0.276232 -1.087401 -0.673690

2  0.113648 -1.478427  0.524988

3 -0.140857  0.577046 -1.715002

4 -0.140857 -0.401419 -1.157892

5 -1.344312 -0.401419 -0.293543

6 -0.109050  1.643563 -0.293543

7  0.357021 -0.674600 -0.293543

8 -0.968914 -1.294524  0.413738

9  0.276662 -0.472035 -0.013960

In [55]: dff.fillna(dff.mean()['B':'C'])

Out[55]:

          A         B         C

0  0.271860 -0.424972  0.567020

1  0.276232 -1.087401 -0.673690

2  0.113648 -1.478427  0.524988

3       NaN  0.577046 -1.715002

4       NaN -0.401419 -1.157892

5 -1.344312 -0.401419 -0.293543

6 -0.109050  1.643563 -0.293543

7  0.357021 -0.674600 -0.293543

8 -0.968914 -1.294524  0.413738

9  0.276662 -0.472035 -0.013960

上面操作等同于：

In [56]: dff.where(pd.notna(dff), dff.mean(), axis='columns')

使用dropna删除包含NA的数据

除了fillna来填充数据之外，还可以使用dropna删除包含na的数据。

In [57]: df

Out[57]:

   one       two     three

a  NaN -0.282863 -1.509059

c  NaN  1.212112 -0.173215

e  NaN  0.000000  0.000000

f  NaN  0.000000  0.000000

h  NaN -0.706771 -1.039575

In [58]: df.dropna(axis=0)

Out[58]:

Empty DataFrame

Columns: [one, two, three]

Index: []

In [59]: df.dropna(axis=1)

Out[59]:

        two     three

a -0.282863 -1.509059

c  1.212112 -0.173215

e  0.000000  0.000000

f  0.000000  0.000000

h -0.706771 -1.039575

In [60]: df['one'].dropna()

Out[60]: Series([], Name: one, dtype: float64)

插值interpolation

数据分析时候，为了数据的平稳，我们需要一些插值运算interpolate() ，使用起来很简单：

In [61]: ts

Out[61]:

2000-01-31    0.469112

2000-02-29         NaN

2000-03-31         NaN

2000-04-28         NaN

2000-05-31         NaN

                ...

2007-12-31   -6.950267

2008-01-31   -7.904475

2008-02-29   -6.441779

2008-03-31   -8.184940

2008-04-30   -9.011531

Freq: BM, Length: 100, dtype: float64

In [64]: ts.interpolate()

Out[64]:

2000-01-31    0.469112

2000-02-29    0.434469

2000-03-31    0.399826

2000-04-28    0.365184

2000-05-31    0.330541

                ...

2007-12-31   -6.950267

2008-01-31   -7.904475

2008-02-29   -6.441779

2008-03-31   -8.184940

2008-04-30   -9.011531

Freq: BM, Length: 100, dtype: float64

插值函数还可以添加参数，指定插值的方法，比如按时间插值：

In [67]: ts2

Out[67]:

2000-01-31    0.469112

2000-02-29         NaN

2002-07-31   -5.785037

2005-01-31         NaN

2008-04-30   -9.011531

dtype: float64

In [68]: ts2.interpolate()

Out[68]:

2000-01-31    0.469112

2000-02-29   -2.657962

2002-07-31   -5.785037

2005-01-31   -7.398284

2008-04-30   -9.011531

dtype: float64

In [69]: ts2.interpolate(method='time')

Out[69]:

2000-01-31    0.469112

2000-02-29    0.270241

2002-07-31   -5.785037

2005-01-31   -7.190866

2008-04-30   -9.011531

dtype: float64

按index的float value进行插值：

In [70]: ser

Out[70]:

0.0      0.0

1.0      NaN

10.0    10.0

dtype: float64

In [71]: ser.interpolate()

Out[71]:

0.0      0.0

1.0      5.0

10.0    10.0

dtype: float64

In [72]: ser.interpolate(method='values')

Out[72]:

0.0      0.0

1.0      1.0

10.0    10.0

dtype: float64

除了插值Series，还可以插值DF：

In [73]: df = pd.DataFrame({'A': [1, 2.1, np.nan, 4.7, 5.6, 6.8],

   ....:                    'B': [.25, np.nan, np.nan, 4, 12.2, 14.4]})

   ....: 

In [74]: df

Out[74]:

     A      B

0  1.0   0.25

1  2.1    NaN

2  NaN    NaN

3  4.7   4.00

4  5.6  12.20

5  6.8  14.40

In [75]: df.interpolate()

Out[75]:

     A      B

0  1.0   0.25

1  2.1   1.50

2  3.4   2.75

3  4.7   4.00

4  5.6  12.20

5  6.8  14.40

interpolate还接收limit参数，可以指定插值的个数。

In [95]: ser.interpolate(limit=1)

Out[95]:

0     NaN

1     NaN

2     5.0

3     7.0

4     NaN

5     NaN

6    13.0

7    13.0

8     NaN

dtype: float64

使用replace替换值

replace可以替换常量，也可以替换list：

In [102]: ser = pd.Series([0., 1., 2., 3., 4.])

In [103]: ser.replace(0, 5)

Out[103]:

0    5.0

1    1.0

2    2.0

3    3.0

4    4.0

dtype: float64

In [104]: ser.replace([0, 1, 2, 3, 4], [4, 3, 2, 1, 0])

Out[104]:

0    4.0

1    3.0

2    2.0

3    1.0

4    0.0

dtype: float64

可以替换DF中特定的数值：

In [106]: df = pd.DataFrame({'a': [0, 1, 2, 3, 4], 'b': [5, 6, 7, 8, 9]})

In [107]: df.replace({'a': 0, 'b': 5}, 100)

Out[107]:

     a    b

0  100  100

1    1    6

2    2    7

3    3    8

4    4    9

可以使用插值替换：

In [108]: ser.replace([1, 2, 3], method='pad')

Out[108]:

0    0.0

1    0.0

2    0.0

3    0.0

4    4.0

dtype: float64

本文已收录于 http://www.flydean.com/07-python-pandas-missingdata/

最通俗的解读，最深刻的干货，最简洁的教程，众多你不知道的小技巧等你来发现！

欢迎关注我的公众号:「程序那些事」,懂技术，更懂你！

Pandas高级教程之:处理缺失数据的更多相关文章

Pandas高级教程之:处理text数据
目录简介创建text的DF String 的方法 columns的String操作分割和替换String String的连接使用 .str来index extract extractall c ...
Pandas高级教程之:GroupBy用法
Pandas高级教程之:GroupBy用法目录简介分割数据多index get_group dropna groups属性 index的层级 group的遍历聚合操作通用聚合方法同时使用 ...
Pandas高级教程之:Dataframe的合并
目录简介使用concat 使用append 使用merge 使用join 覆盖数据简介 Pandas提供了很多合并Series和Dataframe的强大的功能,通过这些功能可以方便的进行数据分析 ...
Pandas高级教程之:category数据类型
目录简介创建category 使用Series创建使用DF创建创建控制转换为原始类型 categories的操作获取category的属性重命名categories 使用add_cate ...
Pandas高级教程之:plot画图详解
目录简介基础画图其他图像 bar stacked bar barh Histograms box Area Scatter Hexagonal bin Pie 在画图中处理NaN数据其他作图工 ...
Pandas高级教程之:统计方法
目录简介变动百分百 Covariance协方差 Correlation相关系数 rank等级简介数据分析中经常会用到很多统计类的方法,本文将会介绍Pandas中使用到的统计方法. 变动百分百 ...
Pandas高级教程之:window操作
目录简介滚动窗口 Center window Weighted window 加权窗口扩展窗口指数加权窗口简介在数据统计中,经常需要进行一些范围操作,这些范围我们可以称之为一个window ...
Pandas高级教程之:稀疏数据结构
目录简介 Spare data的例子 SparseArray SparseDtype Sparse的属性 Sparse的计算 SparseSeries 和 SparseDataFrame 简介如果 ...
Pandas高级教程之:自定义选项
目录简介常用选项 get/set 选项经常使用的选项最大展示行数超出数据展示最大列的宽度显示精度零转换的门槛列头的对齐方向简介 pandas有一个option系统可以控制panda ...

随机推荐

游戏视野系统算法（FOV using recursive shadowcasting）
原理 http://www.roguebasin.com/index.php?title=FOV_using_recursive_shadowcasting python代码实现 http://www ...
移动应用开发第5讲 Activity课堂综合练习
作业总要求使用附件"素材"压缩包中的素材完成下列任务: 1.完成小游戏主程序,如图mainActivity.png. 2.在主程序界面当按下游戏介绍按钮时进行游戏介绍界面如图gam ...
Java集合详解（五）：Hashtable原理解析
概述本文是基于jdk8_271版本进行分析的. Hashtable与HashMap一样,是一个存储key-value的双列集合.底层是基于数组+链表实现的,没有红黑树结构.Hashtable默认初始 ...
[DB] Hadoop免密登录原理及设置
情景: 现有两台电脑bigdata111.bigdata112,bigdata111想免密码登录bigdata112 过程: 1.bigdata111生成公钥(用于加密,给别人)和私钥(用于解密,自己 ...
[bug] MySQL-Front连接MySQL 8.0失败
原因: MySQL-Front不支持MySQL 8.0的密码认证方式解决: 在mysql安装目录中my.ini文件末尾添加 default_authentication_plugin=mysql_n ...
qemu:///system 没有连接驱动器可用；读取数据时进入文件终点: 输入/输出错误
原因 1. KVM的相关包装少了 2KVM的相关包重新安装 3 May 31 15:22:55 localhost libvirtd: 2019-05-31 07:22:55.554+0000: ...
Linux 文件隐藏属性-创建文件默认权限
Linux特殊权限介绍 # password原本只有root可以访问但是为什么普通用户在修改密码的时候也可以改里面的内容. [root@gong ~]# ll /usr/bin/passwd -rws ...
MyBatis 缓存机制（十三）
什么是缓存缓存就是内存中的一个对象,用于对数据库查询结果的保存,用于减少与数据库的交互次数从而降低数据库的压力,进而提高响应速度. MyBatis 缓存机制原理 Mybatis 缓存机制原理是将第一 ...
9.1 ps：查看进程
ps命令用于列出执行ps命令的那个时刻的进程快照,就像用手机给进程照了一张照片.如果想要动态地显示进程的信息,就需要使用top命令,该命令类似于把手机切换成录像模式.因为ps命令的功能实在是太多了, ...
sklearn中，数据集划分函数 StratifiedShuffleSplit.split() 使用踩坑
在SKLearn中,StratifiedShuffleSplit 类实现了对数据集进行洗牌.分割的功能.但在今晚的实际使用中,发现该类及其方法split()仅能够对二分类样本有效. 一个简单的例子如下 ...

Pandas高级教程之:处理缺失数据

简介