Pandas选择与索引

Series和DataFrame对象与Numpy数组和标准Python字典的数据索引和选择模式一样。

字典形式选择和索引

Series

In [1]: import pandas as pd

In [2]: data = pd.Series([0.25,0.5,0.75,1.0],index=['a','b','c','d'])

In [3]: data

Out[3]:

a    0.25

b    0.50

c    0.75

d    1.00

dtype: float64

#用判断字典键的方法，判断某字段是否包含在Series的Index对象内

In [4]: 'a' in data

Out[4]: True

#用获取字典键的方法，获取Index对象

In [6]: data.keys()

Out[6]: Index(['a', 'b', 'c', 'd'], dtype='object')

#用获取字典键值对象的方法，获取Index/value映射对象

In [9]: list(data.items())

Out[9]: [('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

#用添加字典键值对象的方法，添加Series数据

In [10]: data['e'] = 2

In [11]: data

Out[11]:

a    0.25

b    0.50

c    0.75

d    1.00

e    2.00

dtype: float64

DataFrame：相当于若干个Series对象构成的字典

In [13]: area = pd.Series({'California':43231,'Texas':53432,'New York':67632,

                           'Florida':89734,'Illinois':63321})

In [14]: pop = pd.Series({'California':453231,'Texas':533432,'New York':676732,

                          'Florida':897534,'Illinois':633241})

#以Series对象为字典对象进行创建DataFrame对象

In [15]: data = pd.DataFrame({'area':area,'pop':pop})

In [16]: data

Out[16]:

             area     pop

California  43231  453231

Texas       53432  533432

New York    67632  676732

Florida     89734  897534

Illinois    63321  633241

#获取某列（相当于获取字典某个键值）

In [17]: data['area']

Out[17]:

California    43231

Texas         53432

New York      67632

Florida       89734

Illinois      63321

Name: area, dtype: int64

#以属性形式获取某列

In [18]: data.area

Out[18]:

California    43231

Texas         53432

New York      67632

Florida       89734

Illinois      63321

Name: area, dtype: int64

#以字典形式赋值一个新列

In [19]: data['density'] = data['pop']/data['area']

In [20]: data

Out[20]:

             area     pop    density

California  43231  453231  10.483935

Texas       53432  533432   9.983381

New York    67632  676732  10.006092

Florida     89734  897534  10.002162

Illinois    63321  633241  10.000490

数组形式选择和索引

Series相当于Numpy的一维数组

In [48]: data

Out[48]:

a    0.25

b    0.50

c    0.75

d    1.00

dtype: float64

#以显式索引进行切片，包含最后一个索引对应的值

In [49]: data['a':'c']

Out[49]:

a    0.25

b    0.50

c    0.75

dtype: float64

#以隐式索引进行切片，不包含最后一个索引对应的值

In [50]: data[0:2]

Out[50]:

a    0.25

b    0.50

dtype: float64

#利用布尔索引进行取值

In [51]: data[(data > 0.3) & (data < 0.8)]

Out[51]:

b    0.50

c    0.75

dtype: float64

#利用花哨索引进行取值

In [52]: data[['a','d']]

Out[52]:

a    0.25

d    1.00

dtype: float64

DataFrame相当于Numpy的二维数组

In [22]: data

Out[22]:

             area     pop    density

California  43231  453231  10.483935

Texas       53432  533432   9.983381

New York    67632  676732  10.006092

Florida     89734  897534  10.002162

Illinois    63321  633241  10.000490

#获取DataFrame的数组值，返回的是Numpy二维数组

In [23]: data.values

Out[23]:

array([[4.32310000e+04, 4.53231000e+05, 1.04839351e+01],

       [5.34320000e+04, 5.33432000e+05, 9.98338075e+00],

       [6.76320000e+04, 6.76732000e+05, 1.00060918e+01],

       [8.97340000e+04, 8.97534000e+05, 1.00021619e+01],

       [6.33210000e+04, 6.33241000e+05, 1.00004896e+01]])

#对DataFrame的数组进行装置，相当于Numpy二维数组的转置

In [24]: data.T

Out[24]:

            California          Texas       New York        Florida      Illinois

area      43231.000000   53432.000000   67632.000000   89734.000000   63321.00000

pop      453231.000000  533432.000000  676732.000000  897534.000000  633241.00000

density      10.483935       9.983381      10.006092      10.002162      10.00049

#获得某一行的数组（方法同Numpy数组）

In [25]: data.values[0]

Out[25]: array([4.32310000e+04, 4.53231000e+05, 1.04839351e+01])

#布尔索引

In [27]: data[data > 10]

Out[27]:

             area     pop    density

California  43231  453231  10.483935

Texas       53432  533432        NaN

New York    67632  676732  10.006092

Florida     89734  897534  10.002162

Illinois    63321  633241  10.000490

In [28]: data[data['density']>10]

Out[28]:

             area     pop    density

California  43231  453231  10.483935

New York    67632  676732  10.006092

Florida     89734  897534  10.002162

Illinois    63321  633241  10.000490

索引选择器（iloc 和 loc）

为了避免在Index和Columns为整数时，造成切片和取值时的混乱，Panda提供了一些索引器（indexer)属性来进行取值。

iloc：隐私索引（顺序整数）

#Series

In [53]: data

Out[53]:

a    0.25

b    0.50

c    0.75

d    1.00

dtype: float64

In [54]: data.iloc[0:2]

Out[54]:

a    0.25

b    0.50

dtype: float64

In [55]: data.iloc[2]

Out[55]: 0.75

#DataFrame

In [22]: data

Out[22]:

             area     pop    density

California  43231  453231  10.483935

Texas       53432  533432   9.983381

New York    67632  676732  10.006092

Florida     89734  897534  10.002162

Illinois    63321  633241  10.000490

In [31]: data.iloc[:3,:2]

Out[31]:

             area     pop

California  43231  453231

Texas       53432  533432

New York    67632  676732

In [32]: data.iloc[0,2] = 100

In [33]: data

Out[33]:

             area     pop     density

California  43231  453231  100.000000

Texas       53432  533432    9.983381

New York    67632  676732   10.006092

Florida     89734  897534   10.002162

Illinois    63321  633241   10.000490

In [34]: data.iloc[3,2]

Out[34]: 10.002161945305012

In [35]: data.iloc[:3,2]

Out[35]:

California    100.000000

Texas           9.983381

New York       10.006092

Name: density, dtype: float64

In [36]: data.iloc[3,:2]

Out[36]:

area     89734.0

pop     897534.0

Name: Florida, dtype: float64

#切出的是DataFrame结构

In [37]: data.iloc[2:3,:2]

Out[37]:

           area     pop

New York  67632  676732

loc：显示索引（自定义）

#Series

In [53]: data

Out[53]:

a    0.25

b    0.50

c    0.75

d    1.00

dtype: float64

In [56]: data.loc['b']

Out[56]: 0.5

In [57]: data.loc['a':'b']

Out[57]:

a    0.25

b    0.50

dtype: float64

In [58]: data.loc[['a','c']]

Out[58]:

a    0.25

c    0.75

dtype: float64

#DataFrame

In [38]: data

Out[38]:

             area     pop     density

California  43231  453231  100.000000

Texas       53432  533432    9.983381

New York    67632  676732   10.006092

Florida     89734  897534   10.002162

Illinois    63321  633241   10.000490

#显性索引切片时与整数切片不同，最后一个切片位置也要包含在内

In [39]: data['California':'Texas']

Out[39]:

             area     pop     density

California  43231  453231  100.000000

Texas       53432  533432    9.983381

In [40]: data.loc['California':'Texas']

Out[40]:

             area     pop     density

California  43231  453231  100.000000

Texas       53432  533432    9.983381

In [41]: data.loc[data['density']>10,['area','density']]

Out[41]:

             area     density

California  43231  100.000000

New York    67632   10.006092

Florida     89734   10.002162

Illinois    63321   10.000490

In [44]: data.loc[:,['pop']]

Out[44]:

               pop

California  453231

Texas       533432

New York    676732

Florida     897534

Illinois    633241

多级索引应用(Hierarchical Indexing)

为解决高维数组表达的需求而引入多级索引（Hierarchical indexing）。Pandas的MultiIndex类型提供了丰富的操作手法。

多级索引创建

#直接创建，index写成多级模式，MultiIndex在后台自动创建

In [3]: data = pd.DataFrame(np.random.rand(4,2),

                            index = [['a','a','b','b'],[1,2,1,2]],

                            columns=['data1','data2'])

In [4]: data

Out[4]:

        data1     data2

a 1  0.154166  0.773928

  2  0.859236  0.777436

b 1  0.468589  0.010060

  2  0.849230  0.929585

#使用元组字典进行创建，自动把元组元素分割转换成MultiIndex

In [5]: data = {('California',2000):3334323,('California',2010):33444332,

                ('Texas',2000):5555999,('Texas',2010):7778839}

In [6]: pd.Series(data)

Out[6]:

California  2000     3334323

            2010    33444332

Texas       2000     5555999

            2010     7778839

dtype: int64

#使用列表进行创建

In [59]: pd.MultiIndex.from_arrays([['a','a','b','b'],[1,2,1,2]])

Out[59]:

MultiIndex([('a', 1),

            ('a', 2),

            ('b', 1),

            ('b', 2)],)

#使用元组进行创建

In [60]: pd.MultiIndex.from_tuples([('a',1),('a',2),('b',1),('b',2)])

Out[60]:

MultiIndex([('a', 1),

            ('a', 2),

            ('b', 1),

            ('b', 2)],

           )

#使用笛卡尔积进行创建

In [62]: pd.MultiIndex.from_product([['a','b'],[1,2]])

Out[62]:

MultiIndex([('a', 1),

            ('a', 2),

            ('b', 1),

            ('b', 2)],

           )

#直接利用levels和labels(codes)进行创建

In [63]: pd.MultiIndex(levels=[['a','b'],[1,2]],labels=[[0,0,1,1],[0,1,0,1]])

D:\Anaconda\install\Scripts\ipython:1: FutureWarning: the 'labels' keyword is deprecated, use 'codes' instead

Out[63]:

MultiIndex([('a', 1),

            ('a', 2),

            ('b', 1),

            ('b', 2)],

           )

In [64]: pd.MultiIndex(levels=[['a','b'],[1,2]],codes=[[0,0,1,1],[0,1,0,1]])

Out[64]:

MultiIndex([('a', 1),

            ('a', 2),

            ('b', 1),

            ('b', 2)],

           )

多级索引取值与切片

Series

In [1]: import pandas as pd

In [2]: pop_index = pd.MultiIndex.from_product([['California','Texas','New York'],

                                                ['2000','2010']])

In [3]: data = [23343,33453,78893,34569,56493,89392]

In [4]: pop = pd.Series(data, index=pop_index, name='population')

In [5]: pop

Out[5]:

California  2000    23343

            2010    33453

Texas       2000    78893

            2010    34569

New York    2000    56493

            2010    89392

Name: population, dtype: int64

#添加索引名称

In [6]: pop.index.names = ['state','year']

In [7]: pop

Out[7]:

state       year

California  2000    23343

            2010    33453

Texas       2000    78893

            2010    34569

New York    2000    56493

            2010    89392

Name: population, dtype: int64

In [23]: pop

Out[23]:

state       year

California  2000    23343

            2010    33453

Texas       2000    78893

            2010    34569

New York    2000    56493

            2010    89392

Name: population, dtype: int64

#利用显性索引进行局部切片时，需要MultiIndex按顺序排列的（利用sort_index函数实现）

In [25]: pop

Out[25]:

state       year

California  2000    23343

            2010    33453

Texas       2000    78893

            2010    34569

New York    2000    56493

            2010    89392

Name: population, dtype: int64

In [26]: pop = pop.sort_index()

In [27]: pop

Out[27]:

state       year

California  2000    23343

            2010    33453

New York    2000    56493

            2010    89392

Texas       2000    78893

            2010    34569

Name: population, dtype: int64

#显性索引局部切片

In [28]: pop.loc['California':'New York']

Out[28]:

state       year

California  2000    23343

            2010    33453

New York    2000    56493

            2010    89392

Name: population, dtype: int64

In [29]: pop.loc['California':'New York','2000']

Out[29]:

state       year

California  2000    23343

New York    2000    56493

Name: population, dtype: int64

#显性+花哨索引取值

In [30]: pop.loc[['California','Texas']]

Out[30]:

state       year

California  2000    23343

            2010    33453

Texas       2000    78893

            2010    34569

Name: population, dtype: int64

#隐性索引局部切片

In [32]: pop.iloc[:2]

Out[32]:

state       year

California  2000    23343

            2010    33453

Name: population, dtype: int64

#布尔索引取值

In [33]: pop[pop > 30000]

Out[33]:

state       year

California  2010    33453

New York    2000    56493

            2010    89392

Texas       2000    78893

            2010    34569

Name: population, dtype: int64

DataFrame

In [1]: import pandas as pd

In [4]: import numpy as np

#行标签

In [2]: index = pd.MultiIndex.from_product([[2019,2020],[1,2,3,4]],

                                           names=['year','quarter'])

#列标签

In [3]: columns = pd.MultiIndex.from_product([['Bill','Andy','Coco'],

                                              ['Temp','HR']],names=['name','type'])

#获取模拟数据

In [12]: data = np.round(np.random.randn(8,6),1)

In [15]: data[:,::2] *10

Out[15]:

array([[  8.,  -5., -10.],

       [  4.,  22.,   4.],

       [ 14., -19., -18.],

       [  8.,  -1.,  -9.],

       [  2.,  -5.,   4.],

       [ 17.,   7.,   4.],

       [  0.,   3.,   2.],

       [  3.,  11., -11.]])

In [17]: data += 37

In [18]: data

Out[18]:

array([[45. , 35.4, 32. , 38.1, 27. , 37.7],

       [41. , 36.9, 59. , 35.4, 41. , 37.1],

       [51. , 37.1, 18. , 37.5, 19. , 36.7],

       [45. , 37.6, 36. , 37.6, 28. , 37.5],

       [39. , 35.9, 32. , 37.2, 41. , 38. ],

       [54. , 36.8, 44. , 38.7, 41. , 36.9],

       [37. , 37. , 40. , 35.1, 39. , 37.9],

       [40. , 37.5, 48. , 37.5, 26. , 35.7]])

#创建带行列标签的DataFrame

In [19]: health_data = pd.DataFrame(data, index=index, columns=columns)

In [20]: health_data

Out[20]:

name          Bill        Andy        Coco

type          Temp    HR  Temp    HR  Temp    HR

year quarter

2019 1        45.0  35.4  32.0  38.1  27.0  37.7

     2        41.0  36.9  59.0  35.4  41.0  37.1

     3        51.0  37.1  18.0  37.5  19.0  36.7

     4        45.0  37.6  36.0  37.6  28.0  37.5

2020 1        39.0  35.9  32.0  37.2  41.0  38.0

     2        54.0  36.8  44.0  38.7  41.0  36.9

     3        37.0  37.0  40.0  35.1  39.0  37.9

     4        40.0  37.5  48.0  37.5  26.0  35.7

#直接根据标签名称索引获取值

In [21]: health_data['Bill']

Out[21]:

type          Temp    HR

year quarter

2019 1        45.0  35.4

     2        41.0  36.9

     3        51.0  37.1

     4        45.0  37.6

2020 1        39.0  35.9

     2        54.0  36.8

     3        37.0  37.0

     4        40.0  37.5

In [22]: health_data['Bill', 'HR']

Out[22]:

year  quarter

2019  1          35.4

      2          36.9

      3          37.1

      4          37.6

2020  1          35.9

      2          36.8

      3          37.0

      4          37.5

Name: (Bill, HR), dtype: float64

#隐性索引获取值

In [24]: health_data.iloc[:2,:2]

Out[24]:

name          Bill

type          Temp    HR

year quarter

2019 1        45.0  35.4

     2        41.0  36.9

#显性索引获取值

In [23]: health_data.loc[:,('Bill','HR')]

Out[23]:

year  quarter

2019  1          35.4

      2          36.9

      3          37.1

      4          37.6

2020  1          35.9

      2          36.8

      3          37.0

      4          37.5

Name: (Bill, HR), dtype: float64

#使用IndexSlice对象对显性索引进行分割

In [25]: idx = pd.IndexSlice

In [28]: health_data.loc[idx[:,1],idx[:,'HR']]

Out[28]:

name          Bill  Andy  Coco

type            HR    HR    HR

year quarter

2019 1        35.4  38.1  37.7

2020 1        35.9  37.2  38.0

In [29]: health_data.loc[idx[:,2],idx[:,'HR']]

Out[29]:

name          Bill  Andy  Coco

type            HR    HR    HR

year quarter

2019 2        36.9  35.4  37.1

2020 2        36.8  38.7  36.9

多级索引排序和转换

In [18]: frame = pd.DataFrame(np.arange(12).reshape((4,3)),

    ...:                     index=[list('aabb'),list('1212')],

    ...:                     columns = [['Ohio','Ohio','Colorado'],

    ...:                                 ['Green','Red','Green']])

In [19]: frame

Out[19]:

     Ohio     Colorado

    Green Red    Green

a 1     0   1        2

  2     3   4        5

b 1     6   7        8

  2     9  10       11

In [20]: frame.index.names = ['key1','key2']

In [21]: frame.columns.names = ['state','color']

In [22]: frame

Out[22]:

state      Ohio     Colorado

color     Green Red    Green

key1 key2

a    1        0   1        2

     2        3   4        5

b    1        6   7        8

     2        9  10       11

#交换Index

In [26]: frame.swaplevel('key1','key2')

Out[26]:

state      Ohio     Colorado

color     Green Red    Green

key2 key1

1    a        0   1        2

2    a        3   4        5

1    b        6   7        8

2    b        9  10       11

In [27]: frame

Out[27]:

state      Ohio     Colorado

color     Green Red    Green

key1 key2

a    1        0   1        2

     2        3   4        5

b    1        6   7        8

     2        9  10       11

In [28]: frame.sort_index(level=1)

Out[28]:

state      Ohio     Colorado

color     Green Red    Green

key1 key2

a    1        0   1        2

b    1        6   7        8

a    2        3   4        5

b    2        9  10       11

In [29]: frame.swaplevel(0,1).sort_index(level=0)

Out[29]:

state      Ohio     Colorado

color     Green Red    Green

key2 key1

1    a        0   1        2

     b        6   7        8

2    a        3   4        5

     b        9  10       11

多级索引行列转换和重置(stack、unstack、set_index、reset_index)

Series

In [6]: pop

Out[6]:

state       year

California  2000    23343

            2010    33453

Texas       2000    78893

            2010    34569

New York    2000    56493

            2010    89392

Name: population, dtype: int64

#通过level参数设置转换索引的层级

In [7]: pop.unstack(level=0)

Out[7]:

state  California  New York  Texas

year

2000        23343     56493  78893

2010        33453     89392  34569

In [8]: pop.unstack(level=1)

Out[8]:

year         2000   2010

state

California  23343  33453

New York    56493  89392

Texas       78893  34569

In [9]: pop.unstack(level='year')

Out[9]:

year         2000   2010

state

California  23343  33453

New York    56493  89392

Texas       78893  34569

In [10]: pop.unstack(level='year').stack(level='year')

Out[10]:

state       year

California  2000    23343

            2010    33453

New York    2000    56493

            2010    89392

Texas       2000    78893

            2010    34569

dtype: int64

#reset_index()把层级索引变成列标签

In [11]: pop_flat = pop.reset_index(name='population')

In [12]: pop_flat

Out[12]:

        state  year  population

0  California  2000       23343

1  California  2010       33453

2       Texas  2000       78893

3       Texas  2010       34569

4    New York  2000       56493

5    New York  2010       89392

#set_index()把列标签设置成层级索引

In [13]: pop_flat.set_index(['state','year'])

Out[13]:

                 population

state      year

California 2000       23343

           2010       33453

Texas      2000       78893

           2010       34569

New York   2000       56493

           2010       89392

DataFrame

In [30]: health_data

Out[30]:

name          Bill        Andy        Coco

type          Temp    HR  Temp    HR  Temp    HR

year quarter

2019 1        45.0  35.4  32.0  38.1  27.0  37.7

     2        41.0  36.9  59.0  35.4  41.0  37.1

     3        51.0  37.1  18.0  37.5  19.0  36.7

     4        45.0  37.6  36.0  37.6  28.0  37.5

2020 1        39.0  35.9  32.0  37.2  41.0  38.0

     2        54.0  36.8  44.0  38.7  41.0  36.9

     3        37.0  37.0  40.0  35.1  39.0  37.9

     4        40.0  37.5  48.0  37.5  26.0  35.7

#从列堆积到行：stack(level=*)，*表示索引的序号，用隐性索引则自上而下从0开始索引，用显性索引可直接赋值索引名称

#默认level=1

In [31]: health_data.stack()

Out[31]:

name               Andy  Bill  Coco

year quarter type

2019 1       HR    38.1  35.4  37.7

             Temp  32.0  45.0  27.0

     2       HR    35.4  36.9  37.1

             Temp  59.0  41.0  41.0

     3       HR    37.5  37.1  36.7

             Temp  18.0  51.0  19.0

     4       HR    37.6  37.6  37.5

             Temp  36.0  45.0  28.0

2020 1       HR    37.2  35.9  38.0

             Temp  32.0  39.0  41.0

     2       HR    38.7  36.8  36.9

             Temp  44.0  54.0  41.0

     3       HR    35.1  37.0  37.9

             Temp  40.0  37.0  39.0

     4       HR    37.5  37.5  35.7

             Temp  48.0  40.0  26.0

In [33]: health_data.stack(level=1)

Out[33]:

name               Andy  Bill  Coco

year quarter type

2019 1       HR    38.1  35.4  37.7

             Temp  32.0  45.0  27.0

     2       HR    35.4  36.9  37.1

             Temp  59.0  41.0  41.0

     3       HR    37.5  37.1  36.7

             Temp  18.0  51.0  19.0

     4       HR    37.6  37.6  37.5

             Temp  36.0  45.0  28.0

2020 1       HR    37.2  35.9  38.0

             Temp  32.0  39.0  41.0

     2       HR    38.7  36.8  36.9

             Temp  44.0  54.0  41.0

     3       HR    35.1  37.0  37.9

             Temp  40.0  37.0  39.0

     4       HR    37.5  37.5  35.7

             Temp  48.0  40.0  26.0

In [32]: health_data.stack(level=0)

Out[32]:

type                 HR  Temp

year quarter name

2019 1       Andy  38.1  32.0

             Bill  35.4  45.0

             Coco  37.7  27.0

     2       Andy  35.4  59.0

             Bill  36.9  41.0

             Coco  37.1  41.0

     3       Andy  37.5  18.0

             Bill  37.1  51.0

             Coco  36.7  19.0

     4       Andy  37.6  36.0

             Bill  37.6  45.0

             Coco  37.5  28.0

2020 1       Andy  37.2  32.0

             Bill  35.9  39.0

             Coco  38.0  41.0

     2       Andy  38.7  44.0

             Bill  36.8  54.0

             Coco  36.9  41.0

     3       Andy  35.1  40.0

             Bill  37.0  37.0

             Coco  37.9  39.0

     4       Andy  37.5  48.0

             Bill  37.5  40.0

             Coco  35.7  26.0

In [38]: health_data.stack(level='name')

Out[38]:

type                 HR  Temp

year quarter name

2019 1       Andy  38.1  32.0

             Bill  35.4  45.0

             Coco  37.7  27.0

     2       Andy  35.4  59.0

             Bill  36.9  41.0

             Coco  37.1  41.0

     3       Andy  37.5  18.0

             Bill  37.1  51.0

             Coco  36.7  19.0

     4       Andy  37.6  36.0

             Bill  37.6  45.0

             Coco  37.5  28.0

2020 1       Andy  37.2  32.0

             Bill  35.9  39.0

             Coco  38.0  41.0

     2       Andy  38.7  44.0

             Bill  36.8  54.0

             Coco  36.9  41.0

     3       Andy  35.1  40.0

             Bill  37.0  37.0

             Coco  37.9  39.0

     4       Andy  37.5  48.0

             Bill  37.5  40.0

             Coco  35.7  26.0

#从行散开到列：unstack(level=*)，*表示索引的序号，用隐性索引则自左而右从0开始索引，用显性索引可直接赋值索引名称

#默认level=1

In [35]: health_data.unstack(level=0)

Out[35]:

name     Bill                    Andy                    Coco

type     Temp          HR        Temp          HR        Temp          HR

year     2019  2020  2019  2020  2019  2020  2019  2020  2019  2020  2019  2020

quarter

1        45.0  39.0  35.4  35.9  32.0  32.0  38.1  37.2  27.0  41.0  37.7  38.0

2        41.0  54.0  36.9  36.8  59.0  44.0  35.4  38.7  41.0  41.0  37.1  36.9

3        51.0  37.0  37.1  37.0  18.0  40.0  37.5  35.1  19.0  39.0  36.7  37.9

4        45.0  40.0  37.6  37.5  36.0  48.0  37.6  37.5  28.0  26.0  37.5  35.7

In [36]: health_data.unstack(level=1)

Out[36]:

name     Bill                                            Andy  ...        Coco

type     Temp                      HR                    Temp  ...    HR  Temp                      HR

quarter     1     2     3     4     1     2     3     4     1  ...     4     1     2     3     4     1     2     3     4

year                                                           ...

2019     45.0  41.0  51.0  45.0  35.4  36.9  37.1  37.6  32.0  ...  37.6  27.0  41.0  19.0  28.0  37.7  37.1  36.7  37.5

2020     39.0  54.0  37.0  40.0  35.9  36.8  37.0  37.5  32.0  ...  37.5  41.0  41.0  39.0  26.0  38.0  36.9  37.9  35.7

[2 rows x 24 columns]

In [39]: health_data.unstack(level='year')

Out[39]:

name     Bill                    Andy                    Coco

type     Temp          HR        Temp          HR        Temp          HR

year     2019  2020  2019  2020  2019  2020  2019  2020  2019  2020  2019  2020

quarter

1        45.0  39.0  35.4  35.9  32.0  32.0  38.1  37.2  27.0  41.0  37.7  38.0

2        41.0  54.0  36.9  36.8  59.0  44.0  35.4  38.7  41.0  41.0  37.1  36.9

3        51.0  37.0  37.1  37.0  18.0  40.0  37.5  35.1  19.0  39.0  36.7  37.9

4        45.0  40.0  37.6  37.5  36.0  48.0  37.6  37.5  28.0  26.0  37.5  35.7

多级索引数据统计方法

In [23]: health_data

Out[23]:

name          Bill        Andy        Coco

type          Temp    HR  Temp    HR  Temp    HR

year quarter

2019 1        37.1  35.3  37.2  37.5  36.2  37.6

     2        36.9  37.0  37.8  35.9  37.5  36.6

     3        36.8  34.7  35.4  37.5  37.0  38.2

     4        36.8  37.6  36.2  37.4  36.4  36.7

2020 1        37.5  34.8  36.8  36.5  36.3  37.4

     2        36.2  36.7  37.3  37.2  36.3  36.4

     3        37.7  38.1  36.9  37.7  36.6  36.1

     4        36.5  36.3  37.6  36.3  35.5  37.3

#level参数设置累计的索引层级

In [24]: data_mean = health_data.mean(level='year')

In [25]: data_mean

Out[25]:

name    Bill           Andy            Coco

type    Temp      HR   Temp      HR    Temp      HR

year

2019  36.900  36.150  36.65  37.075  36.775  37.275

2020  36.975  36.475  37.15  36.925  36.175  36.800

#axis参数设置累计的轴

In [26]: data_mean.mean(axis=1,level='type')

Out[26]:

type       Temp         HR

year

2019  36.775000  36.833333

2020  36.766667  36.733333

Pandas选择与索引的更多相关文章

如何选择普通索引和唯一索引《死磕MySQL系列五》
系列文章一.原来一条select语句在MySQL是这样执行的<死磕MySQL系列一> 二.一生挚友redo log.binlog<死磕MySQL系列二> 三.MySQL强 ...
pandas选择数据-【老鱼学pandas】
选择列根据列名来选择某列的数据 import pandas as pd import numpy as np dates = pd.date_range("2017-01-08" ...
pandas选择单元格，选择行列
首先创建示例df: df = pd.DataFrame(np.arange(16).reshape(4, 4), columns=list('ABCD'), index=list('5678')) d ...
pandas 之多层索引
In many applications, data may be spread across a number of files or datasets or be arranged in a fo ...
numpy和pandas的基础索引切片
Numpy的索引切片索引 In [72]: arr = np.array([[[1,1,1],[2,2,2]],[[3,3,3],[4,4,4]]]) In [73]: arr Out[73]: a ...
利用Python进行数据分析(11) pandas基础: 层次化索引
层次化索引层次化索引指你能在一个数组上拥有多个索引,例如: 有点像Excel里的合并单元格对么? 根据索引选择数据子集以外层索引的方式选择数据子集: 以内层索引的方式选择数据: 多重索引S ...
深入学习jQuery选择器系列第三篇——过滤选择器之索引选择器
× 目录 [1]通用形式 [2]首尾索引 [3]奇偶索引[4]范围索引前面的话上一篇介绍了过滤选择器中的子元素选择器部分,本文开始介绍极易与之混淆的索引选择器通用形式 $(':eq(index) ...
但从谈论性能点SQL Server选择聚集索引键
简单介绍在SQL Server中,数据是按页进行存放的.而为表加上聚集索引后,SQL Server对于数据的查找就是依照聚集索引的列作为keyword进行了. 因此对于聚集索引的选择对性能的影响就变 ...
pandas 选择列或者添加列生成新的DataFrame
选择某些列 import pandas as pd # 从Excel中读取数据,生成DataFrame数据 # 导入Excel路径和sheet name df = pd.read_excel(exce ...
pandas基础用法——索引
# -*- coding: utf-8 -*- # Time : 2016/11/28 15:14 # Author : XiaoDeng # version : python3.5 # Softwa ...

随机推荐

安装MySql失败( Microsoft Visual C++ 2013 Runtime 64bit)
参考资料:下载之家提示你缺少什么版本就安装什么版本.64位或者32位. 文件下载地址:下载之家不知道有没有失效,如果失效的话大家直接去下载之家搜索下载.
【补档_C51单片机】基于C51的蜂鸣器音乐盒工程源码解析（可播放《打上花火》）
项目地址:https://gitee.com/daycen/c51-music-box 通过Keil uVision3打开即可使用以前做的一些小硬件,现补档至博客 1 功能及总体方案 1.1 功能描 ...
【LeetCode回溯算法#09】全排列，排列问题以及其中涉及的去重操作
全排列力扣题目链接给定一个不含重复数字的数组 nums ,返回其所有可能的全排列 .你可以按任意顺序返回答案. 示例 1: 输入:nums = [1,2,3] 输出:[[1,2,3],[1, ...
【LeetCode回溯算法#02】组合总和III
组合总和III 力扣题目链接(opens new window) 找出所有相加之和为 n 的 k 个数的组合.组合中只允许含有 1 - 9 的正整数,并且每种组合中不存在重复的数字. 说明: 所有数字 ...
Java static关键字的小练习
1 package com.bytezreo.statictest; 2 3 /** 4 * 5 * @Description static 关键字的使用小练习 6 * @author Byteze ...
MVC阶段所有框架完整组合示例
思路:创建工程,导包.编辑配置文件包括核心spring配置 SpringConfig myBatis 配置文件 mybatisConfig JdbcConfig jdbc.properti ...
[.Net]使用Soa库+Abp搭建微服务项目框架（一）：Abp与DDD相关知识回顾
在企业中大型项目中,随着业务的不断拓展,项目发展到一定程度,需要寻求项目的各模块解耦,独立成为微服务.如何实现呢? 首先我们先来简单回顾一下Abp框架怎样实现(DDD)领域驱动设计的,Abp框架的 ...
springboot参数据校验
什么是Hibernate Validator? Hibernate Validator是Hibernate提供的一个开源框架,使用注解方式非常方便的实现服务端的数据校验. 官网:http://hibe ...
Word中的公式复制到Visio中乱码问题
将word中编辑好的公式复制到Visio中出现乱码问题如图所示问题: 解决方案(Visio 选项 --> 高级 --> 显示 ->勾选禁用增强元文件优化) 具体的公式导入和解决操作 ...
day01-1-需求分析和项目设计
满汉楼01 1.需求分析满汉楼项目说明因为javaGUI不是学习的重点,这里将继续使用控制台界面来代替界面和事件处理完成的功能: 登录订座点餐结账查看账单等功能在实际项目中,独立完成项 ...

Pandas选择与索引

字典形式选择和索引

数组形式选择和索引

索引选择器（iloc 和 loc）

iloc：隐私索引（顺序整数）

loc：显示索引（自定义）

多级索引应用(Hierarchical Indexing)

多级索引创建

多级索引取值与切片

Series

DataFrame

多级索引排序和转换

多级索引行列转换和重置(stack、unstack、set_index、reset_index)

Series

DataFrame

多级索引数据统计方法

Pandas选择与索引的更多相关文章

随机推荐

热门专题