Pandas的高级操作
pandas数据处理
1. 删除重复元素
使用duplicated()函数检测重复的行,返回元素为布尔类型的Series对象,每个元素对应一行,如果该行不是第一次出现,则元素为True
keep参数:指定保留哪一重复的行数据
创建具有重复元素行的DataFrame
import numpy as np
import pandas as pd
from pandas import DataFrame
# 创建一个df
df = DataFrame(data=np.random.randint(0,100,size=(12,7)))
df
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | |
|---|---|---|---|---|---|---|---|
| 0 | 18 | 75 | 98 | 68 | 33 | 35 | 15 |
| 1 | 48 | 71 | 36 | 34 | 41 | 17 | 46 |
| 2 | 3 | 28 | 81 | 21 | 86 | 53 | 85 |
| 3 | 15 | 35 | 35 | 90 | 63 | 83 | 14 |
| 4 | 12 | 36 | 65 | 79 | 25 | 53 | 95 |
| 5 | 98 | 63 | 4 | 58 | 35 | 64 | 80 |
| 6 | 31 | 61 | 23 | 33 | 80 | 53 | 60 |
| 7 | 52 | 47 | 60 | 58 | 54 | 35 | 17 |
| 8 | 7 | 92 | 42 | 61 | 31 | 40 | 56 |
| 9 | 76 | 45 | 30 | 42 | 74 | 83 | 53 |
| 10 | 69 | 2 | 89 | 99 | 12 | 51 | 62 |
| 11 | 17 | 86 | 1 | 76 | 40 | 34 | 41 |
# 手动将df的某几行设置成相同的内容
df.iloc[1] = [6,6,6,6,6,6,6]
df.iloc[8] = [6,6,6,6,6,6,6]
df.iloc[5] = [6,6,6,6,6,6,6]
df
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | |
|---|---|---|---|---|---|---|---|
| 0 | 18 | 75 | 98 | 68 | 33 | 35 | 15 |
| 1 | 6 | 6 | 6 | 6 | 6 | 6 | 6 |
| 2 | 3 | 28 | 81 | 21 | 86 | 53 | 85 |
| 3 | 15 | 35 | 35 | 90 | 63 | 83 | 14 |
| 4 | 12 | 36 | 65 | 79 | 25 | 53 | 95 |
| 5 | 6 | 6 | 6 | 6 | 6 | 6 | 6 |
| 6 | 31 | 61 | 23 | 33 | 80 | 53 | 60 |
| 7 | 52 | 47 | 60 | 58 | 54 | 35 | 17 |
| 8 | 6 | 6 | 6 | 6 | 6 | 6 | 6 |
| 9 | 76 | 45 | 30 | 42 | 74 | 83 | 53 |
| 10 | 69 | 2 | 89 | 99 | 12 | 51 | 62 |
| 11 | 17 | 86 | 1 | 76 | 40 | 34 | 41 |
- 使用drop_duplicates()函数删除重复的行
- drop_duplicates(keep='first/last'/False)
df.drop_duplicates(keep='last') # 保留最后一个重复的行
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | |
|---|---|---|---|---|---|---|---|
| 0 | 26 | 9 | 31 | 11 | 38 | 46 | 22 |
| 2 | 89 | 24 | 53 | 28 | 64 | 89 | 40 |
| 3 | 7 | 80 | 43 | 91 | 32 | 95 | 6 |
| 4 | 96 | 92 | 58 | 55 | 82 | 73 | 21 |
| 6 | 43 | 1 | 13 | 54 | 24 | 34 | 43 |
| 7 | 75 | 32 | 88 | 85 | 40 | 29 | 41 |
| 8 | 6 | 6 | 6 | 6 | 6 | 6 | 6 |
| 9 | 93 | 3 | 62 | 4 | 60 | 51 | 78 |
| 10 | 58 | 11 | 63 | 42 | 62 | 30 | 10 |
| 11 | 89 | 93 | 96 | 49 | 23 | 40 | 57 |
2. 映射
1) replace()函数:替换元素
DataFrame替换操作
单值替换
- 普通替换: 替换所有符合要求的元素:to_replace=15,value='e'
- 按列指定单值替换: to_replace={列标签:替换值} value='value'
多值替换
- 列表替换: to_replace=[] value=[]
- 字典替换(推荐) to_replace={to_replace:value,to_replace:value}
df
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | |
|---|---|---|---|---|---|---|---|
| 0 | 26 | 9 | 31 | 11 | 38 | 46 | 22 |
| 1 | 6 | 6 | 6 | 6 | 6 | 6 | 6 |
| 2 | 89 | 24 | 53 | 28 | 64 | 89 | 40 |
| 3 | 7 | 80 | 43 | 91 | 32 | 95 | 6 |
| 4 | 96 | 92 | 58 | 55 | 82 | 73 | 21 |
| 5 | 6 | 6 | 6 | 6 | 6 | 6 | 6 |
| 6 | 43 | 1 | 13 | 54 | 24 | 34 | 43 |
| 7 | 75 | 32 | 88 | 85 | 40 | 29 | 41 |
| 8 | 6 | 6 | 6 | 6 | 6 | 6 | 6 |
| 9 | 93 | 3 | 62 | 4 | 60 | 51 | 78 |
| 10 | 58 | 11 | 63 | 42 | 62 | 30 | 10 |
| 11 | 89 | 93 | 96 | 49 | 23 | 40 | 57 |
注意:DataFrame中,无法使用method和limit参数
df.replace(to_replace=6,value='six') # 将数据中的所有6替换成six,默认不改变原表
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | |
|---|---|---|---|---|---|---|---|
| 0 | 18 | 75 | 98 | 68 | 33 | 35 | 15 |
| 1 | six | six | six | six | six | six | six |
| 2 | 3 | 28 | 81 | 21 | 86 | 53 | 85 |
| 3 | 15 | 35 | 35 | 90 | 63 | 83 | 14 |
| 4 | 12 | 36 | 65 | 79 | 25 | 53 | 95 |
| 5 | six | six | six | six | six | six | six |
| 6 | 31 | 61 | 23 | 33 | 80 | 53 | 60 |
| 7 | 52 | 47 | 60 | 58 | 54 | 35 | 17 |
| 8 | six | six | six | six | six | six | six |
| 9 | 76 | 45 | 30 | 42 | 74 | 83 | 53 |
| 10 | 69 | 2 | 89 | 99 | 12 | 51 | 62 |
| 11 | 17 | 86 | 1 | 76 | 40 | 34 | 41 |
df.replace(to_replace={6:'six'}) # 效果同上
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | |
|---|---|---|---|---|---|---|---|
| 0 | 26 | 9 | 31 | 11 | 38 | 46 | 22 |
| 1 | six | six | six | six | six | six | six |
| 2 | 89 | 24 | 53 | 28 | 64 | 89 | 40 |
| 3 | 7 | 80 | 43 | 91 | 32 | 95 | six |
| 4 | 96 | 92 | 58 | 55 | 82 | 73 | 21 |
| 5 | six | six | six | six | six | six | six |
| 6 | 43 | 1 | 13 | 54 | 24 | 34 | 43 |
| 7 | 75 | 32 | 88 | 85 | 40 | 29 | 41 |
| 8 | six | six | six | six | six | six | six |
| 9 | 93 | 3 | 62 | 4 | 60 | 51 | 78 |
| 10 | 58 | 11 | 63 | 42 | 62 | 30 | 10 |
| 11 | 89 | 93 | 96 | 49 | 23 | 40 | 57 |
df.replace(to_replace={5:6},value='six') # 将第5列中的6,替换成six
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | |
|---|---|---|---|---|---|---|---|
| 0 | 26 | 9 | 31 | 11 | 38 | 46 | 22 |
| 1 | 6 | 6 | 6 | 6 | 6 | six | 6 |
| 2 | 89 | 24 | 53 | 28 | 64 | 89 | 40 |
| 3 | 7 | 80 | 43 | 91 | 32 | 95 | 6 |
| 4 | 96 | 92 | 58 | 55 | 82 | 73 | 21 |
| 5 | 6 | 6 | 6 | 6 | 6 | six | 6 |
| 6 | 43 | 1 | 13 | 54 | 24 | 34 | 43 |
| 7 | 75 | 32 | 88 | 85 | 40 | 29 | 41 |
| 8 | 6 | 6 | 6 | 6 | 6 | six | 6 |
| 9 | 93 | 3 | 62 | 4 | 60 | 51 | 78 |
| 10 | 58 | 11 | 63 | 42 | 62 | 30 | 10 |
| 11 | 89 | 93 | 96 | 49 | 23 | 40 | 57 |
2) map()函数
新建一列 , map函数并不是df的方法,而是series的方法
map()可以映射新一列数据
map()中可以使用lambd表达式
map()中可以使用方法,可以是自定义的方法
eg:map({to_replace:value})
注意 map()中不能使用sum之类的函数,for循环
新增一列:给df中,添加一列,该列的值为中文名对应的英文名
dic = {
'name':['张三','周杰伦','张三'],
'salary':[20000,10000,20000]
}
df = DataFrame(data=dic)
df
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| name | salary | |
|---|---|---|
| 0 | 张三 | 20000 |
| 1 | 周杰伦 | 10000 |
| 2 | 张三 | 20000 |
# 映射关系表
dic = {
'张三':'tom',
'周杰伦':'jay'
}
df['e_name'] = df['name'].map(dic)
df
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| name | salary | e_name | |
|---|---|---|---|
| 0 | 张三 | 20000 | tom |
| 1 | 周杰伦 | 10000 | jay |
| 2 | 张三 | 20000 | tom |
map当做一种运算工具,至于执行何种运算,是由map函数的参数决定的(参数:lambda,函数)
- 使用自定义函数
def after_sal(s):
return s - (s-3000)*0.5
# 超过3000部分的钱缴纳50%的税
df['after_sal'] = df['salary'].map(after_sal)
df
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| name | salary | e_name | after_sal | |
|---|---|---|---|---|
| 0 | 张三 | 20000 | tom | 11500.0 |
| 1 | 周杰伦 | 10000 | jay | 6500.0 |
| 2 | 张三 | 20000 | tom | 11500.0 |
df['after_sal'] = df['salary'].apply(after_sal) # apply效率高于map
df
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| name | salary | e_name | after_sal | |
|---|---|---|---|---|
| 0 | 张三 | 20000 | tom | 11500.0 |
| 1 | 周杰伦 | 10000 | jay | 6500.0 |
| 2 | 张三 | 20000 | tom | 11500.0 |
- 使用lambda表达式
df['after_sal'] = df['salary'].apply(lambda x:x-(x-3000)*0.5) # 或map
df
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| name | salary | e_name | after_sal | |
|---|---|---|---|---|
| 0 | 张三 | 20000 | tom | 11500.0 |
| 1 | 周杰伦 | 10000 | jay | 6500.0 |
| 2 | 张三 | 20000 | tom | 11500.0 |
注意:并不是任何形式的函数都可以作为map的参数。只有当一个函数具有一个参数且有返回值,那么该函数才可以作为map的参数。
3. 使用聚合操作对数据异常值检测和过滤
使用df.std()函数可以求得DataFrame对象每一列的标准差
- 创建一个1000行3列的df 范围(0-1),求其每一列的标准差
df = DataFrame(data=np.random.random(size=(1000,3)),columns=['A','B','C'])
df
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| A | B | C | |
|---|---|---|---|
| 0 | 0.056365 | 0.080972 | 0.378327 |
| 1 | 0.371930 | 0.007791 | 0.318345 |
| 2 | 0.140999 | 0.921772 | 0.752930 |
| 3 | 0.877110 | 0.447756 | 0.760049 |
| 4 | 0.212178 | 0.143772 | 0.621486 |
| 5 | 0.255404 | 0.195473 | 0.008677 |
| 6 | 0.011568 | 0.308934 | 0.882607 |
| 7 | 0.470868 | 0.080049 | 0.285998 |
| 8 | 0.659013 | 0.794802 | 0.270541 |
| 9 | 0.315826 | 0.814653 | 0.906056 |
| 10 | 0.892474 | 0.301340 | 0.687254 |
| 11 | 0.015484 | 0.567598 | 0.043682 |
| 12 | 0.957620 | 0.967676 | 0.063608 |
| 13 | 0.102506 | 0.490077 | 0.235902 |
| 14 | 0.099083 | 0.778190 | 0.451824 |
| 15 | 0.023148 | 0.074169 | 0.589411 |
| 16 | 0.425894 | 0.772662 | 0.797658 |
| 17 | 0.939475 | 0.773502 | 0.766101 |
| 18 | 0.330299 | 0.984615 | 0.346554 |
| 19 | 0.882735 | 0.237546 | 0.847036 |
| 20 | 0.578589 | 0.730879 | 0.751632 |
| 21 | 0.504627 | 0.716272 | 0.386102 |
| 22 | 0.424879 | 0.231262 | 0.590047 |
| 23 | 0.580738 | 0.675268 | 0.726104 |
| 24 | 0.507248 | 0.136465 | 0.463764 |
| 25 | 0.421517 | 0.814806 | 0.449040 |
| 26 | 0.275373 | 0.935430 | 0.525679 |
| 27 | 0.404031 | 0.221492 | 0.730966 |
| 28 | 0.779142 | 0.063435 | 0.120807 |
| 29 | 0.618392 | 0.535934 | 0.554632 |
| ... | ... | ... | ... |
| 970 | 0.378107 | 0.687434 | 0.567923 |
| 971 | 0.876770 | 0.443219 | 0.236627 |
| 972 | 0.486757 | 0.416836 | 0.524889 |
| 973 | 0.886021 | 0.203959 | 0.789022 |
| 974 | 0.838247 | 0.279468 | 0.333581 |
| 975 | 0.762230 | 0.352878 | 0.550439 |
| 976 | 0.044568 | 0.680916 | 0.350743 |
| 977 | 0.031232 | 0.029839 | 0.918445 |
| 978 | 0.323142 | 0.686965 | 0.978349 |
| 979 | 0.746471 | 0.081773 | 0.729567 |
| 980 | 0.810169 | 0.793025 | 0.993532 |
| 981 | 0.480849 | 0.321984 | 0.233431 |
| 982 | 0.491794 | 0.056681 | 0.429988 |
| 983 | 0.278019 | 0.105290 | 0.435492 |
| 984 | 0.480974 | 0.098199 | 0.958667 |
| 985 | 0.465396 | 0.806955 | 0.668972 |
| 986 | 0.602675 | 0.966963 | 0.338542 |
| 987 | 0.051971 | 0.105833 | 0.132917 |
| 988 | 0.416362 | 0.861777 | 0.832573 |
| 989 | 0.951651 | 0.002912 | 0.942564 |
| 990 | 0.274033 | 0.071102 | 0.941272 |
| 991 | 0.632913 | 0.807060 | 0.540686 |
| 992 | 0.035006 | 0.526970 | 0.058584 |
| 993 | 0.368957 | 0.395593 | 0.210440 |
| 994 | 0.692847 | 0.655492 | 0.877564 |
| 995 | 0.245593 | 0.003551 | 0.913750 |
| 996 | 0.374804 | 0.311604 | 0.680521 |
| 997 | 0.355928 | 0.924330 | 0.224949 |
| 998 | 0.923060 | 0.834740 | 0.275359 |
| 999 | 0.905336 | 0.482290 | 0.722851 |
1000 rows × 3 columns
对df应用筛选条件,去除标准差太大的数据:假设过滤条件为 C列数据大于两倍的C列标准差
twice_std = df['C'].std() * 2
twice_std
0.5714973528631762
~(df['C'] > twice_std)
0 True
1 True
2 False
3 False
4 False
5 True
6 False
7 True
8 True
9 False
10 False
11 True
12 True
13 True
14 True
15 False
16 False
17 False
18 True
19 False
20 False
21 True
22 False
23 False
24 True
25 True
26 True
27 False
28 True
29 True
...
970 True
971 True
972 True
973 False
974 True
975 True
976 True
977 False
978 False
979 False
980 False
981 True
982 True
983 True
984 False
985 False
986 True
987 True
988 False
989 False
990 False
991 True
992 True
993 True
994 False
995 False
996 False
997 True
998 True
999 False
Name: C, Length: 1000, dtype: bool
df.loc[~(df['C'] > twice_std)]
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| A | B | C | |
|---|---|---|---|
| 0 | 0.056365 | 0.080972 | 0.378327 |
| 1 | 0.371930 | 0.007791 | 0.318345 |
| 5 | 0.255404 | 0.195473 | 0.008677 |
| 7 | 0.470868 | 0.080049 | 0.285998 |
| 8 | 0.659013 | 0.794802 | 0.270541 |
| 11 | 0.015484 | 0.567598 | 0.043682 |
| 12 | 0.957620 | 0.967676 | 0.063608 |
| 13 | 0.102506 | 0.490077 | 0.235902 |
| 14 | 0.099083 | 0.778190 | 0.451824 |
| 18 | 0.330299 | 0.984615 | 0.346554 |
| 21 | 0.504627 | 0.716272 | 0.386102 |
| 24 | 0.507248 | 0.136465 | 0.463764 |
| 25 | 0.421517 | 0.814806 | 0.449040 |
| 26 | 0.275373 | 0.935430 | 0.525679 |
| 28 | 0.779142 | 0.063435 | 0.120807 |
| 29 | 0.618392 | 0.535934 | 0.554632 |
| 31 | 0.616251 | 0.034984 | 0.342615 |
| 33 | 0.009574 | 0.195987 | 0.221378 |
| 35 | 0.721609 | 0.518311 | 0.561978 |
| 36 | 0.316993 | 0.678054 | 0.163737 |
| 37 | 0.494355 | 0.499986 | 0.560351 |
| 39 | 0.584863 | 0.881738 | 0.481162 |
| 43 | 0.160369 | 0.402388 | 0.208208 |
| 45 | 0.002698 | 0.576528 | 0.070493 |
| 47 | 0.764883 | 0.778927 | 0.494559 |
| 48 | 0.868643 | 0.392903 | 0.109240 |
| 49 | 0.058928 | 0.350504 | 0.497170 |
| 50 | 0.373490 | 0.783554 | 0.335720 |
| 55 | 0.638066 | 0.442382 | 0.173654 |
| 56 | 0.837218 | 0.722685 | 0.454352 |
| ... | ... | ... | ... |
| 943 | 0.322268 | 0.957496 | 0.108147 |
| 944 | 0.384463 | 0.490386 | 0.245737 |
| 945 | 0.382611 | 0.726888 | 0.345724 |
| 947 | 0.713337 | 0.828064 | 0.364005 |
| 948 | 0.818703 | 0.445825 | 0.281585 |
| 951 | 0.968651 | 0.897188 | 0.368103 |
| 952 | 0.136136 | 0.431300 | 0.543917 |
| 954 | 0.846105 | 0.064527 | 0.200963 |
| 955 | 0.708107 | 0.857570 | 0.475146 |
| 957 | 0.595819 | 0.060763 | 0.294676 |
| 958 | 0.268046 | 0.790128 | 0.342255 |
| 959 | 0.116645 | 0.968789 | 0.493773 |
| 967 | 0.073665 | 0.204168 | 0.286095 |
| 968 | 0.205796 | 0.596242 | 0.468190 |
| 970 | 0.378107 | 0.687434 | 0.567923 |
| 971 | 0.876770 | 0.443219 | 0.236627 |
| 972 | 0.486757 | 0.416836 | 0.524889 |
| 974 | 0.838247 | 0.279468 | 0.333581 |
| 975 | 0.762230 | 0.352878 | 0.550439 |
| 976 | 0.044568 | 0.680916 | 0.350743 |
| 981 | 0.480849 | 0.321984 | 0.233431 |
| 982 | 0.491794 | 0.056681 | 0.429988 |
| 983 | 0.278019 | 0.105290 | 0.435492 |
| 986 | 0.602675 | 0.966963 | 0.338542 |
| 987 | 0.051971 | 0.105833 | 0.132917 |
| 991 | 0.632913 | 0.807060 | 0.540686 |
| 992 | 0.035006 | 0.526970 | 0.058584 |
| 993 | 0.368957 | 0.395593 | 0.210440 |
| 997 | 0.355928 | 0.924330 | 0.224949 |
| 998 | 0.923060 | 0.834740 | 0.275359 |
559 rows × 3 columns
- 检测过滤缺失值
- dropna
- fillna
- 检测过滤重复值
- drop_duplicated(keep)
- 检测过滤异常值
- 得到鉴定异常值的条件
- 将异常值对应的行删除
4. 排序
使用.take()函数排序
- take()函数接受一个索引列表,用数字表示,使得df根据列表中索引的顺序进行排序
- eg:df.take([1,3,4,2,5])
可以借助np.random.permutation()函数随机排序
np.random.permutation(1000) # 将0-999进行随机排列
array([956, 614, 993, 437, 371, 215, 579, 282, 301, 646, 893, 7, 441,
539, 953, 794, 155, 370, 154, 100, 753, 793, 412, 867, 941, 998,
672, 590, 708, 1, 634, 899, 417, 242, 557, 122, 397, 850, 543,
560, 389, 896, 903, 505, 685, 334, 665, 460, 768, 937, 522, 637,
121, 605, 107, 130, 286, 532, 982, 563, 995, 89, 217, 218, 82,
781, 951, 798, 200, 947, 790, 398, 538, 411, 15, 44, 784, 205,
281, 314, 439, 132, 192, 238, 795, 470, 65, 842, 259, 426, 528,
383, 682, 750, 119, 465, 503, 278, 715, 603, 544, 265, 239, 569,
204, 616, 343, 710, 653, 256, 6, 873, 338, 27, 570, 707, 70,
73, 233, 838, 799, 266, 859, 279, 136, 479, 724, 870, 611, 574,
564, 655, 177, 39, 253, 148, 471, 317, 661, 851, 69, 523, 513,
928, 650, 23, 582, 622, 814, 959, 723, 938, 612, 912, 865, 402,
638, 80, 962, 214, 983, 194, 680, 758, 29, 74, 86, 102, 583,
695, 580, 835, 931, 832, 454, 258, 493, 967, 670, 555, 494, 501,
581, 591, 179, 354, 118, 671, 380, 732, 229, 719, 623, 874, 495,
944, 900, 123, 250, 628, 244, 872, 731, 625, 586, 57, 752, 596,
827, 775, 841, 163, 394, 833, 153, 669, 295, 826, 384, 890, 711,
60, 141, 237, 198, 404, 463, 712, 960, 749, 510, 866, 609, 26,
169, 372, 459, 365, 949, 124, 733, 12, 257, 668, 878, 487, 138,
652, 300, 219, 413, 445, 193, 207, 337, 779, 77, 95, 693, 812,
409, 33, 490, 992, 9, 167, 358, 743, 369, 99, 817, 542, 706,
289, 589, 666, 927, 391, 761, 844, 452, 66, 830, 498, 968, 689,
329, 508, 526, 335, 884, 129, 972, 507, 480, 274, 110, 425, 500,
388, 418, 869, 769, 251, 863, 456, 112, 247, 304, 478, 481, 429,
741, 241, 347, 37, 673, 427, 285, 415, 59, 853, 144, 822, 125,
455, 64, 332, 71, 971, 763, 804, 19, 191, 918, 608, 61, 327,
137, 116, 746, 482, 828, 766, 691, 424, 727, 468, 633, 302, 861,
848, 134, 704, 491, 320, 280, 660, 375, 846, 359, 987, 511, 342,
307, 399, 76, 825, 11, 28, 961, 485, 451, 675, 457, 618, 554,
551, 885, 531, 880, 534, 160, 607, 367, 374, 797, 910, 970, 595,
575, 756, 90, 897, 801, 49, 140, 985, 512, 577, 922, 168, 225,
360, 315, 350, 919, 231, 911, 631, 31, 774, 103, 186, 892, 293,
483, 149, 860, 887, 93, 340, 744, 908, 52, 196, 222, 955, 3,
930, 571, 484, 156, 50, 843, 599, 506, 936, 703, 881, 273, 520,
41, 85, 328, 223, 48, 492, 97, 56, 36, 974, 924, 656, 58,
649, 92, 114, 62, 173, 984, 973, 346, 573, 996, 597, 990, 667,
206, 917, 213, 272, 462, 686, 469, 472, 236, 643, 787, 224, 120,
255, 24, 171, 94, 904, 771, 344, 556, 981, 593, 988, 271, 762,
363, 254, 535, 361, 979, 303, 692, 964, 504, 150, 894, 349, 796,
714, 525, 943, 785, 260, 145, 292, 718, 811, 234, 641, 403, 818,
999, 461, 778, 802, 901, 352, 40, 515, 32, 877, 664, 323, 966,
635, 905, 754, 940, 810, 182, 75, 442, 308, 262, 776, 592, 267,
203, 294, 657, 34, 414, 405, 232, 151, 373, 601, 14, 807, 467,
421, 43, 935, 430, 287, 313, 283, 152, 516, 530, 356, 559, 518,
644, 889, 977, 521, 548, 381, 674, 929, 0, 916, 246, 540, 297,
67, 980, 422, 117, 772, 53, 13, 91, 46, 423, 509, 21, 128,
598, 115, 610, 679, 783, 264, 78, 270, 824, 311, 648, 220, 636,
226, 658, 886, 227, 268, 773, 620, 529, 864, 502, 567, 713, 963,
366, 210, 333, 249, 600, 701, 2, 640, 407, 745, 942, 113, 87,
390, 159, 188, 948, 957, 488, 351, 288, 245, 431, 248, 164, 767,
839, 702, 803, 792, 594, 837, 489, 934, 684, 386, 629, 519, 876,
63, 448, 98, 858, 378, 298, 368, 453, 25, 868, 624, 79, 133,
902, 906, 428, 401, 162, 157, 728, 950, 662, 190, 496, 568, 975,
952, 627, 909, 994, 131, 780, 751, 883, 871, 319, 722, 199, 536,
209, 821, 318, 290, 393, 35, 325, 187, 786, 681, 284, 514, 331,
647, 855, 143, 989, 642, 96, 676, 986, 561, 602, 336, 20, 379,
847, 735, 954, 645, 547, 357, 447, 435, 739, 228, 566, 305, 353,
158, 755, 716, 730, 856, 127, 47, 392, 862, 809, 720, 760, 432,
243, 932, 208, 382, 585, 747, 111, 836, 736, 700, 705, 615, 355,
18, 330, 820, 8, 857, 184, 175, 221, 737, 524, 697, 436, 395,
764, 939, 104, 759, 819, 240, 659, 147, 269, 387, 420, 621, 364,
926, 201, 549, 165, 696, 742, 997, 181, 277, 726, 10, 683, 991,
291, 81, 126, 68, 920, 808, 572, 740, 533, 699, 72, 146, 230,
888, 5, 606, 466, 263, 458, 898, 604, 385, 805, 105, 211, 945,
958, 721, 823, 376, 497, 545, 576, 738, 626, 852, 449, 541, 444,
406, 976, 88, 815, 552, 166, 183, 178, 438, 553, 84, 83, 717,
651, 782, 678, 324, 584, 42, 687, 517, 195, 106, 101, 933, 434,
348, 440, 587, 310, 923, 663, 921, 499, 565, 296, 38, 891, 895,
316, 30, 978, 677, 170, 322, 613, 546, 527, 630, 476, 174, 51,
816, 845, 185, 108, 17, 321, 813, 806, 109, 882, 197, 550, 907,
339, 698, 965, 362, 729, 914, 791, 694, 475, 879, 486, 309, 748,
326, 688, 202, 410, 915, 690, 854, 377, 341, 788, 22, 777, 275,
473, 261, 400, 45, 54, 135, 770, 189, 946, 562, 925, 537, 789,
312, 829, 725, 252, 800, 578, 446, 55, 419, 396, 4, 558, 212,
831, 450, 299, 161, 617, 345, 306, 757, 709, 180, 235, 433, 840,
477, 913, 474, 734, 408, 443, 834, 654, 875, 172, 632, 416, 16,
216, 464, 139, 619, 588, 969, 176, 276, 142, 639, 765, 849])
# 行排序与列排序均随机
df.take(indices=np.random.permutation(1000),axis=0).take(indices=np.random.permutation(3),axis=1)
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| A | C | B | |
|---|---|---|---|
| 810 | 0.056462 | 0.836914 | 0.105296 |
| 2 | 0.140999 | 0.752930 | 0.921772 |
| 721 | 0.941986 | 0.206568 | 0.283233 |
| 803 | 0.302248 | 0.027969 | 0.946815 |
| 46 | 0.576391 | 0.604795 | 0.199215 |
| 224 | 0.091563 | 0.448896 | 0.460941 |
| 682 | 0.081894 | 0.360009 | 0.174743 |
| 894 | 0.758221 | 0.311932 | 0.054626 |
| 389 | 0.951142 | 0.174418 | 0.764700 |
| 441 | 0.283697 | 0.577370 | 0.698306 |
| 350 | 0.623445 | 0.681211 | 0.547610 |
| 53 | 0.186217 | 0.617344 | 0.339724 |
| 467 | 0.231915 | 0.540558 | 0.972880 |
| 962 | 0.543442 | 0.895628 | 0.444214 |
| 598 | 0.516110 | 0.047393 | 0.670478 |
| 337 | 0.022056 | 0.222698 | 0.010719 |
| 481 | 0.182805 | 0.301250 | 0.652167 |
| 277 | 0.127561 | 0.749532 | 0.170472 |
| 162 | 0.767004 | 0.261541 | 0.381312 |
| 250 | 0.847071 | 0.344957 | 0.539958 |
| 416 | 0.369274 | 0.495600 | 0.393579 |
| 425 | 0.228196 | 0.273655 | 0.114908 |
| 843 | 0.394974 | 0.904397 | 0.875514 |
| 893 | 0.451844 | 0.336345 | 0.787189 |
| 492 | 0.516625 | 0.499929 | 0.350670 |
| 453 | 0.218878 | 0.957251 | 0.308231 |
| 186 | 0.611224 | 0.981765 | 0.809362 |
| 243 | 0.092659 | 0.374212 | 0.658671 |
| 522 | 0.773774 | 0.436375 | 0.037527 |
| 961 | 0.172133 | 0.762221 | 0.800747 |
| ... | ... | ... | ... |
| 624 | 0.587435 | 0.183552 | 0.831386 |
| 675 | 0.636248 | 0.542904 | 0.918788 |
| 861 | 0.519202 | 0.322943 | 0.315798 |
| 989 | 0.951651 | 0.942564 | 0.002912 |
| 136 | 0.940608 | 0.069835 | 0.504026 |
| 950 | 0.294872 | 0.712361 | 0.821118 |
| 529 | 0.648302 | 0.860493 | 0.626701 |
| 833 | 0.783501 | 0.823326 | 0.357173 |
| 173 | 0.181090 | 0.697154 | 0.906783 |
| 615 | 0.177069 | 0.732558 | 0.275658 |
| 182 | 0.091686 | 0.262477 | 0.340532 |
| 913 | 0.069850 | 0.903723 | 0.102737 |
| 417 | 0.353772 | 0.345310 | 0.618327 |
| 487 | 0.697415 | 0.083422 | 0.921608 |
| 345 | 0.331507 | 0.295755 | 0.995060 |
| 978 | 0.323142 | 0.978349 | 0.686965 |
| 197 | 0.947977 | 0.235533 | 0.295503 |
| 133 | 0.428408 | 0.963203 | 0.485624 |
| 214 | 0.861541 | 0.840486 | 0.435903 |
| 640 | 0.453934 | 0.807253 | 0.940066 |
| 977 | 0.031232 | 0.918445 | 0.029839 |
| 698 | 0.780159 | 0.042282 | 0.127449 |
| 427 | 0.326411 | 0.101616 | 0.915007 |
| 898 | 0.768911 | 0.231629 | 0.451036 |
| 77 | 0.718200 | 0.682757 | 0.986735 |
| 865 | 0.553171 | 0.535761 | 0.088467 |
| 513 | 0.203601 | 0.908238 | 0.116113 |
| 711 | 0.655778 | 0.164941 | 0.472295 |
| 685 | 0.012172 | 0.035356 | 0.501114 |
| 801 | 0.891855 | 0.355426 | 0.682663 |
1000 rows × 3 columns
- np.random.permutation(x)可以生成x个从0-(x-1)的随机数列
df.take(indices=np.random.permutation(1000),axis=0).take(indices=np.random.permutation(3),axis=1)[0:5]
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| B | A | C | |
|---|---|---|---|
| 839 | 0.817163 | 0.346661 | 0.113644 |
| 708 | 0.644456 | 0.327089 | 0.081710 |
| 244 | 0.852833 | 0.366820 | 0.028656 |
| 728 | 0.627186 | 0.850947 | 0.375577 |
| 238 | 0.784179 | 0.764240 | 0.579280 |
随机抽样
当DataFrame规模足够大时,直接使用np.random.permutation(x)函数,就配合take()函数实现随机抽样
5. 数据分类处理【重点】
数据聚合是数据处理的最后一步,通常是要使每一个数组生成一个单一的数值。
数据分类处理:
- 分组:先把数据分为几组
- 用函数处理:为不同组的数据应用不同的函数以转换数据
- 合并:把不同组得到的结果合并起来
数据分类处理的核心:
- groupby()函数
- groups属性查看分组情况
- eg: df.groupby(by='item').groups
分组
df = DataFrame({'item':['Apple','Banana','Orange','Banana','Orange','Apple'],
'price':[4,3,3,2.5,4,2],
'color':['red','yellow','yellow','green','green','green'],
'weight':[12,20,50,30,20,44]})
df
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| color | item | price | weight | |
|---|---|---|---|---|
| 0 | red | Apple | 4.0 | 12 |
| 1 | yellow | Banana | 3.0 | 20 |
| 2 | yellow | Orange | 3.0 | 50 |
| 3 | green | Banana | 2.5 | 30 |
| 4 | green | Orange | 4.0 | 20 |
| 5 | green | Apple | 2.0 | 44 |
- 使用groupby实现分组
df.groupby(by='item',axis=0)
<pandas.core.groupby.DataFrameGroupBy object at 0x00000235AA6F6C18>
- 使用groups查看分组情况
# 该函数可以进行数据的分组,但是不显示分组情况
df.groupby(by='item',axis=0).groups
{'Apple': Int64Index([0, 5], dtype='int64'),
'Banana': Int64Index([1, 3], dtype='int64'),
'Orange': Int64Index([2, 4], dtype='int64')}
- 分组后的聚合操作:分组后的成员中可以被进行运算的值会进行运算,不能被运算的值不进行运算
# 给df创建一个新列,内容为各个水果的平均价格
df.groupby(by='item',axis=0).mean()['price']
item
Apple 3.00
Banana 2.75
Orange 3.50
Name: price, dtype: float64
mean_price_series = df.groupby(by='item',axis=0)['price'].mean()
mean_price_series
item
Apple 3.00
Banana 2.75
Orange 3.50
Name: price, dtype: float64
# 映射关系表
dic = mean_price_series.to_dict()
df['mean_price'] = df['item'].map(dic)
df
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| color | item | price | weight | mean_price | |
|---|---|---|---|---|---|
| 0 | red | Apple | 4.0 | 12 | 3.00 |
| 1 | yellow | Banana | 3.0 | 20 | 2.75 |
| 2 | yellow | Orange | 3.0 | 50 | 3.50 |
| 3 | green | Banana | 2.5 | 30 | 2.75 |
| 4 | green | Orange | 4.0 | 20 | 3.50 |
| 5 | green | Apple | 2.0 | 44 | 3.00 |
计算出苹果的平均价格
df
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| color | item | price | weight | |
|---|---|---|---|---|
| 0 | red | Apple | 4.0 | 12 |
| 1 | yellow | Banana | 3.0 | 20 |
| 2 | yellow | Orange | 3.0 | 50 |
| 3 | green | Banana | 2.5 | 30 |
| 4 | green | Orange | 4.0 | 20 |
| 5 | green | Apple | 2.0 | 44 |
按颜色查看各种颜色的水果的平均价格
汇总:将各种颜色水果的平均价格和df进行汇总
df['color_mean_price'] = df['color'].map(df.groupby(by='color')['price'].mean().to_dict())
df
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| color | item | price | weight | mean_price | color_mean_price | |
|---|---|---|---|---|---|---|
| 0 | red | Apple | 4.0 | 12 | 3.00 | 4.000000 |
| 1 | yellow | Banana | 3.0 | 20 | 2.75 | 3.000000 |
| 2 | yellow | Orange | 3.0 | 50 | 3.50 | 3.000000 |
| 3 | green | Banana | 2.5 | 30 | 2.75 | 2.833333 |
| 4 | green | Orange | 4.0 | 20 | 3.50 | 2.833333 |
| 5 | green | Apple | 2.0 | 44 | 3.00 | 2.833333 |
6.高级数据聚合
使用groupby分组后,也可以使用transform和apply提供自定义函数实现更多的运算
- df.groupby('item')['price'].sum() <==> df.groupby('item')['price'].apply(sum)
- transform和apply都会进行运算,在transform或者apply中传入函数即可
- transform和apply也可以传入一个lambda表达式
df.groupby(by='item')['price'].mean()
item
Apple 3.00
Banana 2.75
Orange 3.50
Name: price, dtype: float64
def my_mean(s):
sum = 0
for i in s:
sum += i
return sum/len(s)
# 使用apply函数求出水果的平均价格
df.groupby(by='item')['price'].apply(my_mean)
item
Apple 3.00
Banana 2.75
Orange 3.50
Name: price, dtype: float64
# 使用transform函数求出水果的平均价格
df.groupby(by='item')['price'].transform(my_mean)
0 3.00
1 2.75
2 3.50
3 2.75
4 3.50
5 3.00
Name: price, dtype: float64
Pandas的高级操作的更多相关文章
- 数据分析05 /pandas的高级操作
数据分析05 /pandas的高级操作 目录 数据分析05 /pandas的高级操作 1. 替换操作 2. 映射操作 3. 运算工具 4. 映射索引 / 更改之前索引 5. 排序实现的随机抽样/打乱表 ...
- 数据分析06 /pandas高级操作相关案例:人口案例分析、2012美国大选献金项目数据分析
数据分析06 /pandas高级操作相关案例:人口案例分析.2012美国大选献金项目数据分析 目录 数据分析06 /pandas高级操作相关案例:人口案例分析.2012美国大选献金项目数据分析 1. ...
- pandas高级操作
pandas高级操作 import numpy as np import pandas as pd from pandas import DataFrame,Series 替换操作 替换操作可以同步作 ...
- 一句Python,一句R︱pandas模块——高级版data.frame
先学了R,最近刚刚上手python,所以想着将python和R结合起来互相对比来更好理解python.最好就是一句python,对应写一句R. pandas可谓如雷贯耳,数据处理神器. 以下符号: = ...
- pandas函数高级
一.处理丢失数据 有两种丢失数据: None np.nan(NaN) 1. None None是Python自带的,其类型为python object.因此,None不能参与到任何计算中. #查看No ...
- [Session] SessionHelper2---C#关于Session高级操作帮助类 (转载)
点击下载 SessionHelper2.rar 这个类是关于Session的一些高级操作1.添加时限制时间2.读取对象3.读取数据等等看下面代码吧 /// <summary> /// 联系 ...
- cassandra高级操作之索引、排序以及分页
本次就给大家讲讲cassandra的高级操作:索引.排序和分页:处于性能的考虑,cassandra对这些支持都比较简单,所以我们不能希望cassandra完全适用于我们的逻辑,而是应该将我们的逻辑设计 ...
- MySQL学习笔记_9_MySQL高级操作(上)
MySQL高级操作(上) 一.MySQL表复制 create table t2 like t1; #复制表结构,t2可以学习到t1所有的表结构 insert into t2 ...
- MySQL学习笔记_10_MySQL高级操作(下)
MySQL高级操作(下) 五.MySQL预处理语句 1.设置预处理stmt,传递一个数据作为where的判断条件 prepare stmt from "select * from table ...
随机推荐
- 解决javax.servlet.jsp.JspException cannot be resolved to a type
转自:https://blog.csdn.net/fengspg/article/details/41645159
- java 类加载及实例化的调用顺序
1.没有继承的情况 单独一个类的场景下,初始化顺序为依次为 静态变量和静态代码块(看两者的书写顺序),继承的基类的构造函数,成员变量,被调用的构造函数. 代码呈现: public class Test ...
- 如何在SVN服务器上创建项目
1,首先你的电脑上安装了SVN的服务器 VisualSVN-Server-3.7.1-x64.msi 2,打开SVN服务器后,可以看到分布的目录是 Repositories.Users.Groups. ...
- oracle 使用escape转义%与_匹配字符为本来含义
举例: 查找姓名为M%的员工. select * from employee where staff_name like 'M\%' escape '\';
- Codeforces 1221F Game With String 思维题
题意:有两个人玩游戏,游戏规则如下:有一个长度为n的字符串,这个字符串由 . 和 X 构成,Alice可以选择a个连续的 . 把它们变成X, Bob可以选择连续的b个 . 把它们变成X.题目中保证a ...
- CentOS7 利用systemctl添加自定义系统服务
一.命令systemctl介绍 CentOS 7.0中已经没有service命令,而是启用了systemctl服务器命令,它实际上将 service 和 chkconfig 这两个命令组合到一起. 命 ...
- LOJ2540「PKUWC2018」随机算法
又是一道被咕了很久的题 貌似从WC2019之前咕到了现在 我们用f[i][s]表示现在最大独立集的大小为i 不可选集合为s 然后转移O(n)枚举加进来的点就比较简单啦 这个的复杂度是O(2^n*n^2 ...
- 【leetcode】1038. Binary Search Tree to Greater Sum Tree
题目如下: Given the root of a binary search tree with distinct values, modify it so that every node has ...
- Dump文件的生成
一.Windows系统的任务管理器里抓dump 启动任务管理器,选中某个进程,右键,弹出菜单"创建转储文件" 注意事项: 当你在64位Windows系统上抓32位进程的dmup文件 ...
- igserver许可证过期 & 发布服务客户端浏览出现error2032
1.卸载igserver重装 卸载包IGServerForDotNet成功! 卸载包IGServerCore成功! 卸载包MiddleWare_ArcGIS成功! 卸载包SDE_DM成功! 卸载包SD ...