Pandas处理缺失的数据

处理丢失数据

有两种丢失数据：

None
np.nan(NaN)

import numpy as np

import pandas

from pandas import DataFrame

1. None

None是Python自带的，其类型为python object。因此，None不能参与到任何计算中。

# 查看None的数据类型

type(None)

NoneType

2. np.nan（NaN）

np.nan是浮点类型，能参与到计算中。但计算的结果总是NaN。

# 查看np.nan的数据类型

type(np.nan)

float

3. pandas中的None与NaN

创建DataFrame

df = DataFrame(data=np.random.randint(0,100,size=(10,8)))

df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	0	1	2	3	4	5	6	7
0	22	13	16	41	81	7	25	86
1	23	3	57	20	4	58	69	40
2	35	81	80	63	53	43	20	35
3	40	14	48	89	34	4	64	46
4	36	14	62	30	80	99	88	59
5	9	98	83	81	69	46	39	7
6	55	88	81	75	35	44	27	64
7	14	74	24	3	54	99	75	53
8	24	22	41	68	1	87	46	19
9	82	10	36	99	85	36	12	83

# 将某些数组元素赋值为nan

df.iloc[1,4] = None

df.iloc[3,6] = None

df.iloc[7,7] = None

df.iloc[3,1] = None

df.iloc[5,5] = np.nan

df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	0	1	2	3	4	5	6	7
0	22	13.0	16	41	81.0	7.0	25.0	86.0
1	23	3.0	57	20	NaN	58.0	69.0	40.0
2	35	81.0	80	63	53.0	43.0	20.0	35.0
3	40	NaN	48	89	34.0	4.0	NaN	46.0
4	36	14.0	62	30	80.0	99.0	88.0	59.0
5	9	98.0	83	81	69.0	NaN	39.0	7.0
6	55	88.0	81	75	35.0	44.0	27.0	64.0
7	14	74.0	24	3	54.0	99.0	75.0	NaN
8	24	22.0	41	68	1.0	87.0	46.0	19.0
9	82	10.0	36	99	85.0	36.0	12.0	83.0

pandas处理空值操作

判断函数

isnull()
notnull()

df.isnull()   # 为空,显示True

df.notnull()  # 不为空,显示True

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	0	1	2	3	4	5	6	7
0	True	True	True	True	True	True	True	True
1	True	True	True	True	False	True	True	True
2	True	True	True	True	True	True	True	True
3	True	False	True	True	True	True	False	True
4	True	True	True	True	True	True	True	True
5	True	True	True	True	True	False	True	True
6	True	True	True	True	True	True	True	True
7	True	True	True	True	True	True	True	False
8	True	True	True	True	True	True	True	True
9	True	True	True	True	True	True	True	True

df.notnull/ isnull().any()/ all()

df.isnull()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	0	1	2	3	4	5	6	7
0	False	False	False	False	False	False	False	False
1	False	False	False	False	True	False	False	False
2	False	False	False	False	False	False	False	False
3	False	True	False	False	False	False	True	False
4	False	False	False	False	False	False	False	False
5	False	False	False	False	False	True	False	False
6	False	False	False	False	False	False	False	False
7	False	False	False	False	False	False	False	True
8	False	False	False	False	False	False	False	False
9	False	False	False	False	False	False	False	False

df.isnull().any(axis=1)  # any表示or,axis=1表示行,即一行中存在True,即为True

0    False

1     True

2    False

3     True

4    False

5     True

6    False

7     True

8    False

9    False

dtype: bool

df.notnull().all(axis=1) # all表示and,axis=1表示行,即一行中全为True,才为True

0     True

1    False

2     True

3    False

4     True

5    False

6     True

7    False

8     True

9     True

dtype: bool

df.loc[~df.isnull().any(axis=1)] # ~表示取反

往往这样搭配:

isnull()->any
notnull()->all

df.dropna() 可以选择过滤的是行还是列（默认为行）:axis中0表示行，1表示的列

df.dropna(axis=0)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	0	1	2	3	4	5	6	7
0	22	13.0	16	41	81.0	7.0	25.0	86.0
2	35	81.0	80	63	53.0	43.0	20.0	35.0
4	36	14.0	62	30	80.0	99.0	88.0	59.0
6	55	88.0	81	75	35.0	44.0	27.0	64.0
8	24	22.0	41	68	1.0	87.0	46.0	19.0
9	82	10.0	36	99	85.0	36.0	12.0	83.0

填充函数 Series/DataFrame

fillna():value和method参数

df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	0	1	2	3	4	5	6	7
0	22	13.0	16	41	81.0	7.0	25.0	86.0
1	23	3.0	57	20	NaN	58.0	69.0	40.0
2	35	81.0	80	63	53.0	43.0	20.0	35.0
3	40	NaN	48	89	34.0	4.0	NaN	46.0
4	36	14.0	62	30	80.0	99.0	88.0	59.0
5	9	98.0	83	81	69.0	NaN	39.0	7.0
6	55	88.0	81	75	35.0	44.0	27.0	64.0
7	14	74.0	24	3	54.0	99.0	75.0	NaN
8	24	22.0	41	68	1.0	87.0	46.0	19.0
9	82	10.0	36	99	85.0	36.0	12.0	83.0

# bfill表示后, ffill表示前

# axis表示方向: 0:上下, 1:左右

df_test = df.fillna(method='bfill',axis=1).fillna(method='ffill',axis=1)

df_test

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	0	1	2	3	4	5	6	7
0	22.0	13.0	16.0	41.0	81.0	7.0	25.0	86.0
1	23.0	3.0	57.0	20.0	58.0	58.0	69.0	40.0
2	35.0	81.0	80.0	63.0	53.0	43.0	20.0	35.0
3	40.0	48.0	48.0	89.0	34.0	4.0	46.0	46.0
4	36.0	14.0	62.0	30.0	80.0	99.0	88.0	59.0
5	9.0	98.0	83.0	81.0	69.0	39.0	39.0	7.0
6	55.0	88.0	81.0	75.0	35.0	44.0	27.0	64.0
7	14.0	74.0	24.0	3.0	54.0	99.0	75.0	75.0
8	24.0	22.0	41.0	68.0	1.0	87.0	46.0	19.0
9	82.0	10.0	36.0	99.0	85.0	36.0	12.0	83.0

# 测试df_test中的哪些列中还有空值

df_test.isnull().any(axis=0)

0    False

1    False

2    False

3    False

4    False

5    False

6    False

7    False

dtype: bool