Pandas选择与索引
Series和DataFrame对象与Numpy数组和标准Python字典的数据索引和选择模式一样。
字典形式选择和索引
Series
In [1]: import pandas as pd
In [2]: data = pd.Series([0.25,0.5,0.75,1.0],index=['a','b','c','d'])
In [3]: data
Out[3]:
a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64
#用判断字典键的方法,判断某字段是否包含在Series的Index对象内
In [4]: 'a' in data
Out[4]: True
#用获取字典键的方法,获取Index对象
In [6]: data.keys()
Out[6]: Index(['a', 'b', 'c', 'd'], dtype='object')
#用获取字典键值对象的方法,获取Index/value映射对象
In [9]: list(data.items())
Out[9]: [('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]
#用添加字典键值对象的方法,添加Series数据
In [10]: data['e'] = 2
In [11]: data
Out[11]:
a 0.25
b 0.50
c 0.75
d 1.00
e 2.00
dtype: float64
DataFrame:相当于若干个Series对象构成的字典
In [13]: area = pd.Series({'California':43231,'Texas':53432,'New York':67632,
'Florida':89734,'Illinois':63321})
In [14]: pop = pd.Series({'California':453231,'Texas':533432,'New York':676732,
'Florida':897534,'Illinois':633241})
#以Series对象为字典对象进行创建DataFrame对象
In [15]: data = pd.DataFrame({'area':area,'pop':pop})
In [16]: data
Out[16]:
area pop
California 43231 453231
Texas 53432 533432
New York 67632 676732
Florida 89734 897534
Illinois 63321 633241
#获取某列(相当于获取字典某个键值)
In [17]: data['area']
Out[17]:
California 43231
Texas 53432
New York 67632
Florida 89734
Illinois 63321
Name: area, dtype: int64
#以属性形式获取某列
In [18]: data.area
Out[18]:
California 43231
Texas 53432
New York 67632
Florida 89734
Illinois 63321
Name: area, dtype: int64
#以字典形式赋值一个新列
In [19]: data['density'] = data['pop']/data['area']
In [20]: data
Out[20]:
area pop density
California 43231 453231 10.483935
Texas 53432 533432 9.983381
New York 67632 676732 10.006092
Florida 89734 897534 10.002162
Illinois 63321 633241 10.000490
数组形式选择和索引
Series相当于Numpy的一维数组
In [48]: data
Out[48]:
a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64
#以显式索引进行切片,包含最后一个索引对应的值
In [49]: data['a':'c']
Out[49]:
a 0.25
b 0.50
c 0.75
dtype: float64
#以隐式索引进行切片,不包含最后一个索引对应的值
In [50]: data[0:2]
Out[50]:
a 0.25
b 0.50
dtype: float64
#利用布尔索引进行取值
In [51]: data[(data > 0.3) & (data < 0.8)]
Out[51]:
b 0.50
c 0.75
dtype: float64
#利用花哨索引进行取值
In [52]: data[['a','d']]
Out[52]:
a 0.25
d 1.00
dtype: float64
DataFrame相当于Numpy的二维数组
In [22]: data
Out[22]:
area pop density
California 43231 453231 10.483935
Texas 53432 533432 9.983381
New York 67632 676732 10.006092
Florida 89734 897534 10.002162
Illinois 63321 633241 10.000490
#获取DataFrame的数组值,返回的是Numpy二维数组
In [23]: data.values
Out[23]:
array([[4.32310000e+04, 4.53231000e+05, 1.04839351e+01],
[5.34320000e+04, 5.33432000e+05, 9.98338075e+00],
[6.76320000e+04, 6.76732000e+05, 1.00060918e+01],
[8.97340000e+04, 8.97534000e+05, 1.00021619e+01],
[6.33210000e+04, 6.33241000e+05, 1.00004896e+01]])
#对DataFrame的数组进行装置,相当于Numpy二维数组的转置
In [24]: data.T
Out[24]:
California Texas New York Florida Illinois
area 43231.000000 53432.000000 67632.000000 89734.000000 63321.00000
pop 453231.000000 533432.000000 676732.000000 897534.000000 633241.00000
density 10.483935 9.983381 10.006092 10.002162 10.00049
#获得某一行的数组(方法同Numpy数组)
In [25]: data.values[0]
Out[25]: array([4.32310000e+04, 4.53231000e+05, 1.04839351e+01])
#布尔索引
In [27]: data[data > 10]
Out[27]:
area pop density
California 43231 453231 10.483935
Texas 53432 533432 NaN
New York 67632 676732 10.006092
Florida 89734 897534 10.002162
Illinois 63321 633241 10.000490
In [28]: data[data['density']>10]
Out[28]:
area pop density
California 43231 453231 10.483935
New York 67632 676732 10.006092
Florida 89734 897534 10.002162
Illinois 63321 633241 10.000490
索引选择器(iloc 和 loc)
为了避免在Index和Columns为整数时,造成切片和取值时的混乱,Panda提供了一些索引器(indexer)属性来进行取值。
iloc:隐私索引(顺序整数)
#Series
In [53]: data
Out[53]:
a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64
In [54]: data.iloc[0:2]
Out[54]:
a 0.25
b 0.50
dtype: float64
In [55]: data.iloc[2]
Out[55]: 0.75
#DataFrame
In [22]: data
Out[22]:
area pop density
California 43231 453231 10.483935
Texas 53432 533432 9.983381
New York 67632 676732 10.006092
Florida 89734 897534 10.002162
Illinois 63321 633241 10.000490
In [31]: data.iloc[:3,:2]
Out[31]:
area pop
California 43231 453231
Texas 53432 533432
New York 67632 676732
In [32]: data.iloc[0,2] = 100
In [33]: data
Out[33]:
area pop density
California 43231 453231 100.000000
Texas 53432 533432 9.983381
New York 67632 676732 10.006092
Florida 89734 897534 10.002162
Illinois 63321 633241 10.000490
In [34]: data.iloc[3,2]
Out[34]: 10.002161945305012
In [35]: data.iloc[:3,2]
Out[35]:
California 100.000000
Texas 9.983381
New York 10.006092
Name: density, dtype: float64
In [36]: data.iloc[3,:2]
Out[36]:
area 89734.0
pop 897534.0
Name: Florida, dtype: float64
#切出的是DataFrame结构
In [37]: data.iloc[2:3,:2]
Out[37]:
area pop
New York 67632 676732
loc:显示索引(自定义)
#Series
In [53]: data
Out[53]:
a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64
In [56]: data.loc['b']
Out[56]: 0.5
In [57]: data.loc['a':'b']
Out[57]:
a 0.25
b 0.50
dtype: float64
In [58]: data.loc[['a','c']]
Out[58]:
a 0.25
c 0.75
dtype: float64
#DataFrame
In [38]: data
Out[38]:
area pop density
California 43231 453231 100.000000
Texas 53432 533432 9.983381
New York 67632 676732 10.006092
Florida 89734 897534 10.002162
Illinois 63321 633241 10.000490
#显性索引切片时与整数切片不同,最后一个切片位置也要包含在内
In [39]: data['California':'Texas']
Out[39]:
area pop density
California 43231 453231 100.000000
Texas 53432 533432 9.983381
In [40]: data.loc['California':'Texas']
Out[40]:
area pop density
California 43231 453231 100.000000
Texas 53432 533432 9.983381
In [41]: data.loc[data['density']>10,['area','density']]
Out[41]:
area density
California 43231 100.000000
New York 67632 10.006092
Florida 89734 10.002162
Illinois 63321 10.000490
In [44]: data.loc[:,['pop']]
Out[44]:
pop
California 453231
Texas 533432
New York 676732
Florida 897534
Illinois 633241
多级索引应用(Hierarchical Indexing)
为解决高维数组表达的需求而引入多级索引(Hierarchical indexing)。Pandas的MultiIndex类型提供了丰富的操作手法。
多级索引创建
#直接创建,index写成多级模式,MultiIndex在后台自动创建
In [3]: data = pd.DataFrame(np.random.rand(4,2),
index = [['a','a','b','b'],[1,2,1,2]],
columns=['data1','data2'])
In [4]: data
Out[4]:
data1 data2
a 1 0.154166 0.773928
2 0.859236 0.777436
b 1 0.468589 0.010060
2 0.849230 0.929585
#使用元组字典进行创建,自动把元组元素分割转换成MultiIndex
In [5]: data = {('California',2000):3334323,('California',2010):33444332,
('Texas',2000):5555999,('Texas',2010):7778839}
In [6]: pd.Series(data)
Out[6]:
California 2000 3334323
2010 33444332
Texas 2000 5555999
2010 7778839
dtype: int64
#使用列表进行创建
In [59]: pd.MultiIndex.from_arrays([['a','a','b','b'],[1,2,1,2]])
Out[59]:
MultiIndex([('a', 1),
('a', 2),
('b', 1),
('b', 2)],)
#使用元组进行创建
In [60]: pd.MultiIndex.from_tuples([('a',1),('a',2),('b',1),('b',2)])
Out[60]:
MultiIndex([('a', 1),
('a', 2),
('b', 1),
('b', 2)],
)
#使用笛卡尔积进行创建
In [62]: pd.MultiIndex.from_product([['a','b'],[1,2]])
Out[62]:
MultiIndex([('a', 1),
('a', 2),
('b', 1),
('b', 2)],
)
#直接利用levels和labels(codes)进行创建
In [63]: pd.MultiIndex(levels=[['a','b'],[1,2]],labels=[[0,0,1,1],[0,1,0,1]])
D:\Anaconda\install\Scripts\ipython:1: FutureWarning: the 'labels' keyword is deprecated, use 'codes' instead
Out[63]:
MultiIndex([('a', 1),
('a', 2),
('b', 1),
('b', 2)],
)
In [64]: pd.MultiIndex(levels=[['a','b'],[1,2]],codes=[[0,0,1,1],[0,1,0,1]])
Out[64]:
MultiIndex([('a', 1),
('a', 2),
('b', 1),
('b', 2)],
)
多级索引取值与切片
Series
In [1]: import pandas as pd
In [2]: pop_index = pd.MultiIndex.from_product([['California','Texas','New York'],
['2000','2010']])
In [3]: data = [23343,33453,78893,34569,56493,89392]
In [4]: pop = pd.Series(data, index=pop_index, name='population')
In [5]: pop
Out[5]:
California 2000 23343
2010 33453
Texas 2000 78893
2010 34569
New York 2000 56493
2010 89392
Name: population, dtype: int64
#添加索引名称
In [6]: pop.index.names = ['state','year']
In [7]: pop
Out[7]:
state year
California 2000 23343
2010 33453
Texas 2000 78893
2010 34569
New York 2000 56493
2010 89392
Name: population, dtype: int64
In [23]: pop
Out[23]:
state year
California 2000 23343
2010 33453
Texas 2000 78893
2010 34569
New York 2000 56493
2010 89392
Name: population, dtype: int64
#利用显性索引进行局部切片时,需要MultiIndex按顺序排列的(利用sort_index函数实现)
In [25]: pop
Out[25]:
state year
California 2000 23343
2010 33453
Texas 2000 78893
2010 34569
New York 2000 56493
2010 89392
Name: population, dtype: int64
In [26]: pop = pop.sort_index()
In [27]: pop
Out[27]:
state year
California 2000 23343
2010 33453
New York 2000 56493
2010 89392
Texas 2000 78893
2010 34569
Name: population, dtype: int64
#显性索引局部切片
In [28]: pop.loc['California':'New York']
Out[28]:
state year
California 2000 23343
2010 33453
New York 2000 56493
2010 89392
Name: population, dtype: int64
In [29]: pop.loc['California':'New York','2000']
Out[29]:
state year
California 2000 23343
New York 2000 56493
Name: population, dtype: int64
#显性+花哨索引取值
In [30]: pop.loc[['California','Texas']]
Out[30]:
state year
California 2000 23343
2010 33453
Texas 2000 78893
2010 34569
Name: population, dtype: int64
#隐性索引局部切片
In [32]: pop.iloc[:2]
Out[32]:
state year
California 2000 23343
2010 33453
Name: population, dtype: int64
#布尔索引取值
In [33]: pop[pop > 30000]
Out[33]:
state year
California 2010 33453
New York 2000 56493
2010 89392
Texas 2000 78893
2010 34569
Name: population, dtype: int64
DataFrame
In [1]: import pandas as pd
In [4]: import numpy as np
#行标签
In [2]: index = pd.MultiIndex.from_product([[2019,2020],[1,2,3,4]],
names=['year','quarter'])
#列标签
In [3]: columns = pd.MultiIndex.from_product([['Bill','Andy','Coco'],
['Temp','HR']],names=['name','type'])
#获取模拟数据
In [12]: data = np.round(np.random.randn(8,6),1)
In [15]: data[:,::2] *10
Out[15]:
array([[ 8., -5., -10.],
[ 4., 22., 4.],
[ 14., -19., -18.],
[ 8., -1., -9.],
[ 2., -5., 4.],
[ 17., 7., 4.],
[ 0., 3., 2.],
[ 3., 11., -11.]])
In [17]: data += 37
In [18]: data
Out[18]:
array([[45. , 35.4, 32. , 38.1, 27. , 37.7],
[41. , 36.9, 59. , 35.4, 41. , 37.1],
[51. , 37.1, 18. , 37.5, 19. , 36.7],
[45. , 37.6, 36. , 37.6, 28. , 37.5],
[39. , 35.9, 32. , 37.2, 41. , 38. ],
[54. , 36.8, 44. , 38.7, 41. , 36.9],
[37. , 37. , 40. , 35.1, 39. , 37.9],
[40. , 37.5, 48. , 37.5, 26. , 35.7]])
#创建带行列标签的DataFrame
In [19]: health_data = pd.DataFrame(data, index=index, columns=columns)
In [20]: health_data
Out[20]:
name Bill Andy Coco
type Temp HR Temp HR Temp HR
year quarter
2019 1 45.0 35.4 32.0 38.1 27.0 37.7
2 41.0 36.9 59.0 35.4 41.0 37.1
3 51.0 37.1 18.0 37.5 19.0 36.7
4 45.0 37.6 36.0 37.6 28.0 37.5
2020 1 39.0 35.9 32.0 37.2 41.0 38.0
2 54.0 36.8 44.0 38.7 41.0 36.9
3 37.0 37.0 40.0 35.1 39.0 37.9
4 40.0 37.5 48.0 37.5 26.0 35.7
#直接根据标签名称索引获取值
In [21]: health_data['Bill']
Out[21]:
type Temp HR
year quarter
2019 1 45.0 35.4
2 41.0 36.9
3 51.0 37.1
4 45.0 37.6
2020 1 39.0 35.9
2 54.0 36.8
3 37.0 37.0
4 40.0 37.5
In [22]: health_data['Bill', 'HR']
Out[22]:
year quarter
2019 1 35.4
2 36.9
3 37.1
4 37.6
2020 1 35.9
2 36.8
3 37.0
4 37.5
Name: (Bill, HR), dtype: float64
#隐性索引获取值
In [24]: health_data.iloc[:2,:2]
Out[24]:
name Bill
type Temp HR
year quarter
2019 1 45.0 35.4
2 41.0 36.9
#显性索引获取值
In [23]: health_data.loc[:,('Bill','HR')]
Out[23]:
year quarter
2019 1 35.4
2 36.9
3 37.1
4 37.6
2020 1 35.9
2 36.8
3 37.0
4 37.5
Name: (Bill, HR), dtype: float64
#使用IndexSlice对象对显性索引进行分割
In [25]: idx = pd.IndexSlice
In [28]: health_data.loc[idx[:,1],idx[:,'HR']]
Out[28]:
name Bill Andy Coco
type HR HR HR
year quarter
2019 1 35.4 38.1 37.7
2020 1 35.9 37.2 38.0
In [29]: health_data.loc[idx[:,2],idx[:,'HR']]
Out[29]:
name Bill Andy Coco
type HR HR HR
year quarter
2019 2 36.9 35.4 37.1
2020 2 36.8 38.7 36.9
多级索引排序和转换
In [18]: frame = pd.DataFrame(np.arange(12).reshape((4,3)),
...: index=[list('aabb'),list('1212')],
...: columns = [['Ohio','Ohio','Colorado'],
...: ['Green','Red','Green']])
In [19]: frame
Out[19]:
Ohio Colorado
Green Red Green
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
In [20]: frame.index.names = ['key1','key2']
In [21]: frame.columns.names = ['state','color']
In [22]: frame
Out[22]:
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
#交换Index
In [26]: frame.swaplevel('key1','key2')
Out[26]:
state Ohio Colorado
color Green Red Green
key2 key1
1 a 0 1 2
2 a 3 4 5
1 b 6 7 8
2 b 9 10 11
In [27]: frame
Out[27]:
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
In [28]: frame.sort_index(level=1)
Out[28]:
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
b 1 6 7 8
a 2 3 4 5
b 2 9 10 11
In [29]: frame.swaplevel(0,1).sort_index(level=0)
Out[29]:
state Ohio Colorado
color Green Red Green
key2 key1
1 a 0 1 2
b 6 7 8
2 a 3 4 5
b 9 10 11
多级索引行列转换和重置(stack、unstack、set_index、reset_index)
Series
In [6]: pop
Out[6]:
state year
California 2000 23343
2010 33453
Texas 2000 78893
2010 34569
New York 2000 56493
2010 89392
Name: population, dtype: int64
#通过level参数设置转换索引的层级
In [7]: pop.unstack(level=0)
Out[7]:
state California New York Texas
year
2000 23343 56493 78893
2010 33453 89392 34569
In [8]: pop.unstack(level=1)
Out[8]:
year 2000 2010
state
California 23343 33453
New York 56493 89392
Texas 78893 34569
In [9]: pop.unstack(level='year')
Out[9]:
year 2000 2010
state
California 23343 33453
New York 56493 89392
Texas 78893 34569
In [10]: pop.unstack(level='year').stack(level='year')
Out[10]:
state year
California 2000 23343
2010 33453
New York 2000 56493
2010 89392
Texas 2000 78893
2010 34569
dtype: int64
#reset_index()把层级索引变成列标签
In [11]: pop_flat = pop.reset_index(name='population')
In [12]: pop_flat
Out[12]:
state year population
0 California 2000 23343
1 California 2010 33453
2 Texas 2000 78893
3 Texas 2010 34569
4 New York 2000 56493
5 New York 2010 89392
#set_index()把列标签设置成层级索引
In [13]: pop_flat.set_index(['state','year'])
Out[13]:
population
state year
California 2000 23343
2010 33453
Texas 2000 78893
2010 34569
New York 2000 56493
2010 89392
DataFrame
In [30]: health_data
Out[30]:
name Bill Andy Coco
type Temp HR Temp HR Temp HR
year quarter
2019 1 45.0 35.4 32.0 38.1 27.0 37.7
2 41.0 36.9 59.0 35.4 41.0 37.1
3 51.0 37.1 18.0 37.5 19.0 36.7
4 45.0 37.6 36.0 37.6 28.0 37.5
2020 1 39.0 35.9 32.0 37.2 41.0 38.0
2 54.0 36.8 44.0 38.7 41.0 36.9
3 37.0 37.0 40.0 35.1 39.0 37.9
4 40.0 37.5 48.0 37.5 26.0 35.7
#从列堆积到行:stack(level=*),*表示索引的序号,用隐性索引则自上而下从0开始索引,用显性索引可直接赋值索引名称
#默认level=1
In [31]: health_data.stack()
Out[31]:
name Andy Bill Coco
year quarter type
2019 1 HR 38.1 35.4 37.7
Temp 32.0 45.0 27.0
2 HR 35.4 36.9 37.1
Temp 59.0 41.0 41.0
3 HR 37.5 37.1 36.7
Temp 18.0 51.0 19.0
4 HR 37.6 37.6 37.5
Temp 36.0 45.0 28.0
2020 1 HR 37.2 35.9 38.0
Temp 32.0 39.0 41.0
2 HR 38.7 36.8 36.9
Temp 44.0 54.0 41.0
3 HR 35.1 37.0 37.9
Temp 40.0 37.0 39.0
4 HR 37.5 37.5 35.7
Temp 48.0 40.0 26.0
In [33]: health_data.stack(level=1)
Out[33]:
name Andy Bill Coco
year quarter type
2019 1 HR 38.1 35.4 37.7
Temp 32.0 45.0 27.0
2 HR 35.4 36.9 37.1
Temp 59.0 41.0 41.0
3 HR 37.5 37.1 36.7
Temp 18.0 51.0 19.0
4 HR 37.6 37.6 37.5
Temp 36.0 45.0 28.0
2020 1 HR 37.2 35.9 38.0
Temp 32.0 39.0 41.0
2 HR 38.7 36.8 36.9
Temp 44.0 54.0 41.0
3 HR 35.1 37.0 37.9
Temp 40.0 37.0 39.0
4 HR 37.5 37.5 35.7
Temp 48.0 40.0 26.0
In [32]: health_data.stack(level=0)
Out[32]:
type HR Temp
year quarter name
2019 1 Andy 38.1 32.0
Bill 35.4 45.0
Coco 37.7 27.0
2 Andy 35.4 59.0
Bill 36.9 41.0
Coco 37.1 41.0
3 Andy 37.5 18.0
Bill 37.1 51.0
Coco 36.7 19.0
4 Andy 37.6 36.0
Bill 37.6 45.0
Coco 37.5 28.0
2020 1 Andy 37.2 32.0
Bill 35.9 39.0
Coco 38.0 41.0
2 Andy 38.7 44.0
Bill 36.8 54.0
Coco 36.9 41.0
3 Andy 35.1 40.0
Bill 37.0 37.0
Coco 37.9 39.0
4 Andy 37.5 48.0
Bill 37.5 40.0
Coco 35.7 26.0
In [38]: health_data.stack(level='name')
Out[38]:
type HR Temp
year quarter name
2019 1 Andy 38.1 32.0
Bill 35.4 45.0
Coco 37.7 27.0
2 Andy 35.4 59.0
Bill 36.9 41.0
Coco 37.1 41.0
3 Andy 37.5 18.0
Bill 37.1 51.0
Coco 36.7 19.0
4 Andy 37.6 36.0
Bill 37.6 45.0
Coco 37.5 28.0
2020 1 Andy 37.2 32.0
Bill 35.9 39.0
Coco 38.0 41.0
2 Andy 38.7 44.0
Bill 36.8 54.0
Coco 36.9 41.0
3 Andy 35.1 40.0
Bill 37.0 37.0
Coco 37.9 39.0
4 Andy 37.5 48.0
Bill 37.5 40.0
Coco 35.7 26.0
#从行散开到列:unstack(level=*),*表示索引的序号,用隐性索引则自左而右从0开始索引,用显性索引可直接赋值索引名称
#默认level=1
In [35]: health_data.unstack(level=0)
Out[35]:
name Bill Andy Coco
type Temp HR Temp HR Temp HR
year 2019 2020 2019 2020 2019 2020 2019 2020 2019 2020 2019 2020
quarter
1 45.0 39.0 35.4 35.9 32.0 32.0 38.1 37.2 27.0 41.0 37.7 38.0
2 41.0 54.0 36.9 36.8 59.0 44.0 35.4 38.7 41.0 41.0 37.1 36.9
3 51.0 37.0 37.1 37.0 18.0 40.0 37.5 35.1 19.0 39.0 36.7 37.9
4 45.0 40.0 37.6 37.5 36.0 48.0 37.6 37.5 28.0 26.0 37.5 35.7
In [36]: health_data.unstack(level=1)
Out[36]:
name Bill Andy ... Coco
type Temp HR Temp ... HR Temp HR
quarter 1 2 3 4 1 2 3 4 1 ... 4 1 2 3 4 1 2 3 4
year ...
2019 45.0 41.0 51.0 45.0 35.4 36.9 37.1 37.6 32.0 ... 37.6 27.0 41.0 19.0 28.0 37.7 37.1 36.7 37.5
2020 39.0 54.0 37.0 40.0 35.9 36.8 37.0 37.5 32.0 ... 37.5 41.0 41.0 39.0 26.0 38.0 36.9 37.9 35.7
[2 rows x 24 columns]
In [39]: health_data.unstack(level='year')
Out[39]:
name Bill Andy Coco
type Temp HR Temp HR Temp HR
year 2019 2020 2019 2020 2019 2020 2019 2020 2019 2020 2019 2020
quarter
1 45.0 39.0 35.4 35.9 32.0 32.0 38.1 37.2 27.0 41.0 37.7 38.0
2 41.0 54.0 36.9 36.8 59.0 44.0 35.4 38.7 41.0 41.0 37.1 36.9
3 51.0 37.0 37.1 37.0 18.0 40.0 37.5 35.1 19.0 39.0 36.7 37.9
4 45.0 40.0 37.6 37.5 36.0 48.0 37.6 37.5 28.0 26.0 37.5 35.7
多级索引数据统计方法
In [23]: health_data
Out[23]:
name Bill Andy Coco
type Temp HR Temp HR Temp HR
year quarter
2019 1 37.1 35.3 37.2 37.5 36.2 37.6
2 36.9 37.0 37.8 35.9 37.5 36.6
3 36.8 34.7 35.4 37.5 37.0 38.2
4 36.8 37.6 36.2 37.4 36.4 36.7
2020 1 37.5 34.8 36.8 36.5 36.3 37.4
2 36.2 36.7 37.3 37.2 36.3 36.4
3 37.7 38.1 36.9 37.7 36.6 36.1
4 36.5 36.3 37.6 36.3 35.5 37.3
#level参数设置累计的索引层级
In [24]: data_mean = health_data.mean(level='year')
In [25]: data_mean
Out[25]:
name Bill Andy Coco
type Temp HR Temp HR Temp HR
year
2019 36.900 36.150 36.65 37.075 36.775 37.275
2020 36.975 36.475 37.15 36.925 36.175 36.800
#axis参数设置累计的轴
In [26]: data_mean.mean(axis=1,level='type')
Out[26]:
type Temp HR
year
2019 36.775000 36.833333
2020 36.766667 36.733333
Pandas选择与索引的更多相关文章
- 如何选择普通索引和唯一索引《死磕MySQL系列 五》
系列文章 一.原来一条select语句在MySQL是这样执行的<死磕MySQL系列 一> 二.一生挚友redo log.binlog<死磕MySQL系列 二> 三.MySQL强 ...
- pandas选择数据-【老鱼学pandas】
选择列 根据列名来选择某列的数据 import pandas as pd import numpy as np dates = pd.date_range("2017-01-08" ...
- pandas选择单元格,选择行列
首先创建示例df: df = pd.DataFrame(np.arange(16).reshape(4, 4), columns=list('ABCD'), index=list('5678')) d ...
- pandas 之 多层索引
In many applications, data may be spread across a number of files or datasets or be arranged in a fo ...
- numpy和pandas的基础索引切片
Numpy的索引切片 索引 In [72]: arr = np.array([[[1,1,1],[2,2,2]],[[3,3,3],[4,4,4]]]) In [73]: arr Out[73]: a ...
- 利用Python进行数据分析(11) pandas基础: 层次化索引
层次化索引 层次化索引指你能在一个数组上拥有多个索引,例如: 有点像Excel里的合并单元格对么? 根据索引选择数据子集 以外层索引的方式选择数据子集: 以内层索引的方式选择数据: 多重索引S ...
- 深入学习jQuery选择器系列第三篇——过滤选择器之索引选择器
× 目录 [1]通用形式 [2]首尾索引 [3]奇偶索引[4]范围索引 前面的话 上一篇介绍了过滤选择器中的子元素选择器部分,本文开始介绍极易与之混淆的索引选择器 通用形式 $(':eq(index) ...
- 但从谈论性能点SQL Server选择聚集索引键
简单介绍 在SQL Server中,数据是按页进行存放的.而为表加上聚集索引后,SQL Server对于数据的查找就是依照聚集索引的列作为keyword进行了. 因此对于聚集索引的选择对性能的影响就变 ...
- pandas 选择列或者添加列生成新的DataFrame
选择某些列 import pandas as pd # 从Excel中读取数据,生成DataFrame数据 # 导入Excel路径和sheet name df = pd.read_excel(exce ...
- pandas基础用法——索引
# -*- coding: utf-8 -*- # Time : 2016/11/28 15:14 # Author : XiaoDeng # version : python3.5 # Softwa ...
随机推荐
- 【Azure Entra ID】如何在中国区获取用户 StrongAuthenticationUserDetails 和 StrongAuthenticationMethods 信息
问题描述 如何在中国区获取用户 StrongAuthenticationUserDetails 和 StrongAuthenticationMethods 信息 ? StrongAuthenticat ...
- C# 多线程(17):小总结
前言 本篇内容是小总结和过渡,看完这篇后,就要开始继续学习 C# 多线程中的知识点啦~. 前面,经过 16 篇的学习,我们学习了多线程.锁.线程池.任务.同步.异步等知识,还没有使用到 async.a ...
- C#多线程(7):手动线程通知
目录 区别与示例 ManualResetEvent 类 ManualResetEventSlim 区别与示例 AutoResetEvent 和 ManualResetEvent 十分相似.两者之间的区 ...
- Ubuntu中安装使用QEMU/KVM/virt-manager运行虚拟机
本文为原创,原文发布于个人博客网站:Ubuntu中安装使用QEMU/KVM/virt-manager运行虚拟机 有时候我们需要在同一台计算机中使用多种不同操作系统环境,基于已有的同一堆硬件资源来获得不 ...
- vue3使用路由keep-alive和监听路由实现transition
随着vue3.0的发布,vue-router发布了4.0版本,文档 很明了,提供了vue2路由到vue3的变化和写法指导. vue2: // transition <transition nam ...
- ACS 构词法 + 44个后缀 记忆方法
抖音号:九词君-不用语法学好英语
- RaspberryPi 树莓派 omxplayer 的音量设置方法实测有效!
正确的设置初始音量的方法,其中n/100=你期望的初始dB值, omxplayer --vol n target.media omxplayer介绍 omxplayer 在github中的描述是一个弃 ...
- day34-IO流01
IO流01 1.文件基础知识 什么是文件? 文件,我们并不陌生.文件是保存数据的地方.比如大家经常使用的word文档,txt文件,excel文件等,都是文件.它既可以保存一张图片,也可以保存声音.视频 ...
- display标签简介
下面是网友总结的display标签的优缺点: 1. 分页 如果想对代码分页,只需在display:table标签中添加一项pagesize="每页显示行数" 2. 对列排序 dis ...
- 访问Webapp目录下面的html文件变为代码
一.问题由来 一位朋友在学习使用Servlet做练习的时候,突然出现一个问题,他去访问自己创建的html文件时,发现返回的数据是html代码,而不是解析后的页面. 很是疑惑,自己尝试着解决这个问题,很 ...