Pandas选择与索引

Series和DataFrame对象与Numpy数组和标准Python字典的数据索引和选择模式一样。

字典形式选择和索引

Series

In [1]: import pandas as pd

In [2]: data = pd.Series([0.25,0.5,0.75,1.0],index=['a','b','c','d'])

In [3]: data

Out[3]:

a    0.25

b    0.50

c    0.75

d    1.00

dtype: float64

#用判断字典键的方法，判断某字段是否包含在Series的Index对象内

In [4]: 'a' in data

Out[4]: True

#用获取字典键的方法，获取Index对象

In [6]: data.keys()

Out[6]: Index(['a', 'b', 'c', 'd'], dtype='object')

#用获取字典键值对象的方法，获取Index/value映射对象

In [9]: list(data.items())

Out[9]: [('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

#用添加字典键值对象的方法，添加Series数据

In [10]: data['e'] = 2

In [11]: data

Out[11]:

a    0.25

b    0.50

c    0.75

d    1.00

e    2.00

dtype: float64

DataFrame：相当于若干个Series对象构成的字典

In [13]: area = pd.Series({'California':43231,'Texas':53432,'New York':67632,

                           'Florida':89734,'Illinois':63321})

In [14]: pop = pd.Series({'California':453231,'Texas':533432,'New York':676732,

                          'Florida':897534,'Illinois':633241})

#以Series对象为字典对象进行创建DataFrame对象

In [15]: data = pd.DataFrame({'area':area,'pop':pop})

In [16]: data

Out[16]:

             area     pop

California  43231  453231

Texas       53432  533432

New York    67632  676732

Florida     89734  897534

Illinois    63321  633241

#获取某列（相当于获取字典某个键值）

In [17]: data['area']

Out[17]:

California    43231

Texas         53432

New York      67632

Florida       89734

Illinois      63321

Name: area, dtype: int64

#以属性形式获取某列

In [18]: data.area

Out[18]:

California    43231

Texas         53432

New York      67632

Florida       89734

Illinois      63321

Name: area, dtype: int64

#以字典形式赋值一个新列

In [19]: data['density'] = data['pop']/data['area']

In [20]: data

Out[20]:

             area     pop    density

California  43231  453231  10.483935

Texas       53432  533432   9.983381

New York    67632  676732  10.006092

Florida     89734  897534  10.002162

Illinois    63321  633241  10.000490

数组形式选择和索引

Series相当于Numpy的一维数组

In [48]: data

Out[48]:

a    0.25

b    0.50

c    0.75

d    1.00

dtype: float64

#以显式索引进行切片，包含最后一个索引对应的值

In [49]: data['a':'c']

Out[49]:

a    0.25

b    0.50

c    0.75

dtype: float64

#以隐式索引进行切片，不包含最后一个索引对应的值

In [50]: data[0:2]

Out[50]:

a    0.25

b    0.50

dtype: float64

#利用布尔索引进行取值

In [51]: data[(data > 0.3) & (data < 0.8)]

Out[51]:

b    0.50

c    0.75

dtype: float64

#利用花哨索引进行取值

In [52]: data[['a','d']]

Out[52]:

a    0.25

d    1.00

dtype: float64

DataFrame相当于Numpy的二维数组

In [22]: data

Out[22]:

             area     pop    density

California  43231  453231  10.483935

Texas       53432  533432   9.983381

New York    67632  676732  10.006092

Florida     89734  897534  10.002162

Illinois    63321  633241  10.000490

#获取DataFrame的数组值，返回的是Numpy二维数组

In [23]: data.values

Out[23]:

array([[4.32310000e+04, 4.53231000e+05, 1.04839351e+01],

       [5.34320000e+04, 5.33432000e+05, 9.98338075e+00],

       [6.76320000e+04, 6.76732000e+05, 1.00060918e+01],

       [8.97340000e+04, 8.97534000e+05, 1.00021619e+01],

       [6.33210000e+04, 6.33241000e+05, 1.00004896e+01]])

#对DataFrame的数组进行装置，相当于Numpy二维数组的转置

In [24]: data.T

Out[24]:

            California          Texas       New York        Florida      Illinois

area      43231.000000   53432.000000   67632.000000   89734.000000   63321.00000

pop      453231.000000  533432.000000  676732.000000  897534.000000  633241.00000

density      10.483935       9.983381      10.006092      10.002162      10.00049

#获得某一行的数组（方法同Numpy数组）

In [25]: data.values[0]

Out[25]: array([4.32310000e+04, 4.53231000e+05, 1.04839351e+01])

#布尔索引

In [27]: data[data > 10]

Out[27]:

             area     pop    density

California  43231  453231  10.483935

Texas       53432  533432        NaN

New York    67632  676732  10.006092

Florida     89734  897534  10.002162

Illinois    63321  633241  10.000490

In [28]: data[data['density']>10]

Out[28]:

             area     pop    density

California  43231  453231  10.483935

New York    67632  676732  10.006092

Florida     89734  897534  10.002162

Illinois    63321  633241  10.000490

索引选择器（iloc 和 loc）

为了避免在Index和Columns为整数时，造成切片和取值时的混乱，Panda提供了一些索引器（indexer)属性来进行取值。

iloc：隐私索引（顺序整数）

#Series

In [53]: data

Out[53]:

a    0.25

b    0.50

c    0.75

d    1.00

dtype: float64

In [54]: data.iloc[0:2]

Out[54]:

a    0.25

b    0.50

dtype: float64

In [55]: data.iloc[2]

Out[55]: 0.75

#DataFrame

In [22]: data

Out[22]:

             area     pop    density

California  43231  453231  10.483935

Texas       53432  533432   9.983381

New York    67632  676732  10.006092

Florida     89734  897534  10.002162

Illinois    63321  633241  10.000490

In [31]: data.iloc[:3,:2]

Out[31]:

             area     pop

California  43231  453231

Texas       53432  533432

New York    67632  676732

In [32]: data.iloc[0,2] = 100

In [33]: data

Out[33]:

             area     pop     density

California  43231  453231  100.000000

Texas       53432  533432    9.983381

New York    67632  676732   10.006092

Florida     89734  897534   10.002162

Illinois    63321  633241   10.000490

In [34]: data.iloc[3,2]

Out[34]: 10.002161945305012

In [35]: data.iloc[:3,2]

Out[35]:

California    100.000000

Texas           9.983381

New York       10.006092

Name: density, dtype: float64

In [36]: data.iloc[3,:2]

Out[36]:

area     89734.0

pop     897534.0

Name: Florida, dtype: float64

#切出的是DataFrame结构

In [37]: data.iloc[2:3,:2]

Out[37]:

           area     pop

New York  67632  676732

loc：显示索引（自定义）

#Series

In [53]: data

Out[53]:

a    0.25

b    0.50

c    0.75

d    1.00

dtype: float64

In [56]: data.loc['b']

Out[56]: 0.5

In [57]: data.loc['a':'b']

Out[57]:

a    0.25

b    0.50

dtype: float64

In [58]: data.loc[['a','c']]

Out[58]:

a    0.25

c    0.75

dtype: float64

#DataFrame

In [38]: data

Out[38]:

             area     pop     density

California  43231  453231  100.000000

Texas       53432  533432    9.983381

New York    67632  676732   10.006092

Florida     89734  897534   10.002162

Illinois    63321  633241   10.000490

#显性索引切片时与整数切片不同，最后一个切片位置也要包含在内

In [39]: data['California':'Texas']

Out[39]:

             area     pop     density

California  43231  453231  100.000000

Texas       53432  533432    9.983381

In [40]: data.loc['California':'Texas']

Out[40]:

             area     pop     density

California  43231  453231  100.000000

Texas       53432  533432    9.983381

In [41]: data.loc[data['density']>10,['area','density']]

Out[41]:

             area     density

California  43231  100.000000

New York    67632   10.006092

Florida     89734   10.002162

Illinois    63321   10.000490

In [44]: data.loc[:,['pop']]

Out[44]:

               pop

California  453231

Texas       533432

New York    676732

Florida     897534

Illinois    633241

多级索引应用(Hierarchical Indexing)

为解决高维数组表达的需求而引入多级索引（Hierarchical indexing）。Pandas的MultiIndex类型提供了丰富的操作手法。

多级索引创建

#直接创建，index写成多级模式，MultiIndex在后台自动创建

In [3]: data = pd.DataFrame(np.random.rand(4,2),

                            index = [['a','a','b','b'],[1,2,1,2]],

                            columns=['data1','data2'])

In [4]: data

Out[4]:

        data1     data2

a 1  0.154166  0.773928

  2  0.859236  0.777436

b 1  0.468589  0.010060

  2  0.849230  0.929585

#使用元组字典进行创建，自动把元组元素分割转换成MultiIndex

In [5]: data = {('California',2000):3334323,('California',2010):33444332,

                ('Texas',2000):5555999,('Texas',2010):7778839}

In [6]: pd.Series(data)

Out[6]:

California  2000     3334323

            2010    33444332

Texas       2000     5555999

            2010     7778839

dtype: int64

#使用列表进行创建

In [59]: pd.MultiIndex.from_arrays([['a','a','b','b'],[1,2,1,2]])

Out[59]:

MultiIndex([('a', 1),

            ('a', 2),

            ('b', 1),

            ('b', 2)],)

#使用元组进行创建

In [60]: pd.MultiIndex.from_tuples([('a',1),('a',2),('b',1),('b',2)])

Out[60]:

MultiIndex([('a', 1),

            ('a', 2),

            ('b', 1),

            ('b', 2)],

           )

#使用笛卡尔积进行创建

In [62]: pd.MultiIndex.from_product([['a','b'],[1,2]])

Out[62]:

MultiIndex([('a', 1),

            ('a', 2),

            ('b', 1),

            ('b', 2)],

           )

#直接利用levels和labels(codes)进行创建

In [63]: pd.MultiIndex(levels=[['a','b'],[1,2]],labels=[[0,0,1,1],[0,1,0,1]])

D:\Anaconda\install\Scripts\ipython:1: FutureWarning: the 'labels' keyword is deprecated, use 'codes' instead

Out[63]:

MultiIndex([('a', 1),

            ('a', 2),

            ('b', 1),

            ('b', 2)],

           )

In [64]: pd.MultiIndex(levels=[['a','b'],[1,2]],codes=[[0,0,1,1],[0,1,0,1]])

Out[64]:

MultiIndex([('a', 1),

            ('a', 2),

            ('b', 1),

            ('b', 2)],

           )

多级索引取值与切片

Series

In [1]: import pandas as pd

In [2]: pop_index = pd.MultiIndex.from_product([['California','Texas','New York'],

                                                ['2000','2010']])

In [3]: data = [23343,33453,78893,34569,56493,89392]

In [4]: pop = pd.Series(data, index=pop_index, name='population')

In [5]: pop

Out[5]:

California  2000    23343

            2010    33453

Texas       2000    78893

            2010    34569

New York    2000    56493

            2010    89392

Name: population, dtype: int64

#添加索引名称

In [6]: pop.index.names = ['state','year']

In [7]: pop

Out[7]:

state       year

California  2000    23343

            2010    33453

Texas       2000    78893

            2010    34569

New York    2000    56493

            2010    89392

Name: population, dtype: int64

In [23]: pop

Out[23]:

state       year

California  2000    23343

            2010    33453

Texas       2000    78893

            2010    34569

New York    2000    56493

            2010    89392

Name: population, dtype: int64

#利用显性索引进行局部切片时，需要MultiIndex按顺序排列的（利用sort_index函数实现）

In [25]: pop

Out[25]:

state       year

California  2000    23343

            2010    33453

Texas       2000    78893

            2010    34569

New York    2000    56493

            2010    89392

Name: population, dtype: int64

In [26]: pop = pop.sort_index()

In [27]: pop

Out[27]:

state       year

California  2000    23343

            2010    33453

New York    2000    56493

            2010    89392

Texas       2000    78893

            2010    34569

Name: population, dtype: int64

#显性索引局部切片

In [28]: pop.loc['California':'New York']

Out[28]:

state       year

California  2000    23343

            2010    33453

New York    2000    56493

            2010    89392

Name: population, dtype: int64

In [29]: pop.loc['California':'New York','2000']

Out[29]:

state       year

California  2000    23343

New York    2000    56493

Name: population, dtype: int64

#显性+花哨索引取值

In [30]: pop.loc[['California','Texas']]

Out[30]:

state       year

California  2000    23343

            2010    33453

Texas       2000    78893

            2010    34569

Name: population, dtype: int64

#隐性索引局部切片

In [32]: pop.iloc[:2]

Out[32]:

state       year

California  2000    23343

            2010    33453

Name: population, dtype: int64

#布尔索引取值

In [33]: pop[pop > 30000]

Out[33]:

state       year

California  2010    33453

New York    2000    56493

            2010    89392

Texas       2000    78893

            2010    34569

Name: population, dtype: int64

DataFrame

In [1]: import pandas as pd

In [4]: import numpy as np

#行标签

In [2]: index = pd.MultiIndex.from_product([[2019,2020],[1,2,3,4]],

                                           names=['year','quarter'])

#列标签

In [3]: columns = pd.MultiIndex.from_product([['Bill','Andy','Coco'],

                                              ['Temp','HR']],names=['name','type'])

#获取模拟数据

In [12]: data = np.round(np.random.randn(8,6),1)

In [15]: data[:,::2] *10

Out[15]:

array([[  8.,  -5., -10.],

       [  4.,  22.,   4.],

       [ 14., -19., -18.],

       [  8.,  -1.,  -9.],

       [  2.,  -5.,   4.],

       [ 17.,   7.,   4.],

       [  0.,   3.,   2.],

       [  3.,  11., -11.]])

In [17]: data += 37

In [18]: data

Out[18]:

array([[45. , 35.4, 32. , 38.1, 27. , 37.7],

       [41. , 36.9, 59. , 35.4, 41. , 37.1],

       [51. , 37.1, 18. , 37.5, 19. , 36.7],

       [45. , 37.6, 36. , 37.6, 28. , 37.5],

       [39. , 35.9, 32. , 37.2, 41. , 38. ],

       [54. , 36.8, 44. , 38.7, 41. , 36.9],

       [37. , 37. , 40. , 35.1, 39. , 37.9],

       [40. , 37.5, 48. , 37.5, 26. , 35.7]])

#创建带行列标签的DataFrame

In [19]: health_data = pd.DataFrame(data, index=index, columns=columns)

In [20]: health_data

Out[20]:

name          Bill        Andy        Coco

type          Temp    HR  Temp    HR  Temp    HR

year quarter

2019 1        45.0  35.4  32.0  38.1  27.0  37.7

     2        41.0  36.9  59.0  35.4  41.0  37.1

     3        51.0  37.1  18.0  37.5  19.0  36.7

     4        45.0  37.6  36.0  37.6  28.0  37.5

2020 1        39.0  35.9  32.0  37.2  41.0  38.0

     2        54.0  36.8  44.0  38.7  41.0  36.9

     3        37.0  37.0  40.0  35.1  39.0  37.9

     4        40.0  37.5  48.0  37.5  26.0  35.7

#直接根据标签名称索引获取值

In [21]: health_data['Bill']

Out[21]:

type          Temp    HR

year quarter

2019 1        45.0  35.4

     2        41.0  36.9

     3        51.0  37.1

     4        45.0  37.6

2020 1        39.0  35.9

     2        54.0  36.8

     3        37.0  37.0

     4        40.0  37.5

In [22]: health_data['Bill', 'HR']

Out[22]:

year  quarter

2019  1          35.4

      2          36.9

      3          37.1

      4          37.6

2020  1          35.9

      2          36.8

      3          37.0

      4          37.5

Name: (Bill, HR), dtype: float64

#隐性索引获取值

In [24]: health_data.iloc[:2,:2]

Out[24]:

name          Bill

type          Temp    HR

year quarter

2019 1        45.0  35.4

     2        41.0  36.9

#显性索引获取值

In [23]: health_data.loc[:,('Bill','HR')]

Out[23]:

year  quarter

2019  1          35.4

      2          36.9

      3          37.1

      4          37.6

2020  1          35.9

      2          36.8

      3          37.0

      4          37.5

Name: (Bill, HR), dtype: float64

#使用IndexSlice对象对显性索引进行分割

In [25]: idx = pd.IndexSlice

In [28]: health_data.loc[idx[:,1],idx[:,'HR']]

Out[28]:

name          Bill  Andy  Coco

type            HR    HR    HR

year quarter

2019 1        35.4  38.1  37.7

2020 1        35.9  37.2  38.0

In [29]: health_data.loc[idx[:,2],idx[:,'HR']]

Out[29]:

name          Bill  Andy  Coco

type            HR    HR    HR

year quarter

2019 2        36.9  35.4  37.1

2020 2        36.8  38.7  36.9

多级索引排序和转换

In [18]: frame = pd.DataFrame(np.arange(12).reshape((4,3)),

    ...:                     index=[list('aabb'),list('1212')],

    ...:                     columns = [['Ohio','Ohio','Colorado'],

    ...:                                 ['Green','Red','Green']])

In [19]: frame

Out[19]:

     Ohio     Colorado

    Green Red    Green

a 1     0   1        2

  2     3   4        5

b 1     6   7        8

  2     9  10       11

In [20]: frame.index.names = ['key1','key2']

In [21]: frame.columns.names = ['state','color']

In [22]: frame

Out[22]:

state      Ohio     Colorado

color     Green Red    Green

key1 key2

a    1        0   1        2

     2        3   4        5

b    1        6   7        8

     2        9  10       11

#交换Index

In [26]: frame.swaplevel('key1','key2')

Out[26]:

state      Ohio     Colorado

color     Green Red    Green

key2 key1

1    a        0   1        2

2    a        3   4        5

1    b        6   7        8

2    b        9  10       11

In [27]: frame

Out[27]:

state      Ohio     Colorado

color     Green Red    Green

key1 key2

a    1        0   1        2

     2        3   4        5

b    1        6   7        8

     2        9  10       11

In [28]: frame.sort_index(level=1)

Out[28]:

state      Ohio     Colorado

color     Green Red    Green

key1 key2

a    1        0   1        2

b    1        6   7        8

a    2        3   4        5

b    2        9  10       11

In [29]: frame.swaplevel(0,1).sort_index(level=0)

Out[29]:

state      Ohio     Colorado

color     Green Red    Green

key2 key1

1    a        0   1        2

     b        6   7        8

2    a        3   4        5

     b        9  10       11

多级索引行列转换和重置(stack、unstack、set_index、reset_index)

Series

In [6]: pop

Out[6]:

state       year

California  2000    23343

            2010    33453

Texas       2000    78893

            2010    34569

New York    2000    56493

            2010    89392

Name: population, dtype: int64

#通过level参数设置转换索引的层级

In [7]: pop.unstack(level=0)

Out[7]:

state  California  New York  Texas

year

2000        23343     56493  78893

2010        33453     89392  34569

In [8]: pop.unstack(level=1)

Out[8]:

year         2000   2010

state

California  23343  33453

New York    56493  89392

Texas       78893  34569

In [9]: pop.unstack(level='year')

Out[9]:

year         2000   2010

state

California  23343  33453

New York    56493  89392

Texas       78893  34569

In [10]: pop.unstack(level='year').stack(level='year')

Out[10]:

state       year

California  2000    23343

            2010    33453

New York    2000    56493

            2010    89392

Texas       2000    78893

            2010    34569

dtype: int64

#reset_index()把层级索引变成列标签

In [11]: pop_flat = pop.reset_index(name='population')

In [12]: pop_flat

Out[12]:

        state  year  population

0  California  2000       23343

1  California  2010       33453

2       Texas  2000       78893

3       Texas  2010       34569

4    New York  2000       56493

5    New York  2010       89392

#set_index()把列标签设置成层级索引

In [13]: pop_flat.set_index(['state','year'])

Out[13]:

                 population

state      year

California 2000       23343

           2010       33453

Texas      2000       78893

           2010       34569

New York   2000       56493

           2010       89392

DataFrame

In [30]: health_data

Out[30]:

name          Bill        Andy        Coco

type          Temp    HR  Temp    HR  Temp    HR

year quarter

2019 1        45.0  35.4  32.0  38.1  27.0  37.7

     2        41.0  36.9  59.0  35.4  41.0  37.1

     3        51.0  37.1  18.0  37.5  19.0  36.7

     4        45.0  37.6  36.0  37.6  28.0  37.5

2020 1        39.0  35.9  32.0  37.2  41.0  38.0

     2        54.0  36.8  44.0  38.7  41.0  36.9

     3        37.0  37.0  40.0  35.1  39.0  37.9

     4        40.0  37.5  48.0  37.5  26.0  35.7

#从列堆积到行：stack(level=*)，*表示索引的序号，用隐性索引则自上而下从0开始索引，用显性索引可直接赋值索引名称

#默认level=1

In [31]: health_data.stack()

Out[31]:

name               Andy  Bill  Coco

year quarter type

2019 1       HR    38.1  35.4  37.7

             Temp  32.0  45.0  27.0

     2       HR    35.4  36.9  37.1

             Temp  59.0  41.0  41.0

     3       HR    37.5  37.1  36.7

             Temp  18.0  51.0  19.0

     4       HR    37.6  37.6  37.5

             Temp  36.0  45.0  28.0

2020 1       HR    37.2  35.9  38.0

             Temp  32.0  39.0  41.0

     2       HR    38.7  36.8  36.9

             Temp  44.0  54.0  41.0

     3       HR    35.1  37.0  37.9

             Temp  40.0  37.0  39.0

     4       HR    37.5  37.5  35.7

             Temp  48.0  40.0  26.0

In [33]: health_data.stack(level=1)

Out[33]:

name               Andy  Bill  Coco

year quarter type

2019 1       HR    38.1  35.4  37.7

             Temp  32.0  45.0  27.0

     2       HR    35.4  36.9  37.1

             Temp  59.0  41.0  41.0

     3       HR    37.5  37.1  36.7

             Temp  18.0  51.0  19.0

     4       HR    37.6  37.6  37.5

             Temp  36.0  45.0  28.0

2020 1       HR    37.2  35.9  38.0

             Temp  32.0  39.0  41.0

     2       HR    38.7  36.8  36.9

             Temp  44.0  54.0  41.0

     3       HR    35.1  37.0  37.9

             Temp  40.0  37.0  39.0

     4       HR    37.5  37.5  35.7

             Temp  48.0  40.0  26.0

In [32]: health_data.stack(level=0)

Out[32]:

type                 HR  Temp

year quarter name

2019 1       Andy  38.1  32.0

             Bill  35.4  45.0

             Coco  37.7  27.0

     2       Andy  35.4  59.0

             Bill  36.9  41.0

             Coco  37.1  41.0

     3       Andy  37.5  18.0

             Bill  37.1  51.0

             Coco  36.7  19.0

     4       Andy  37.6  36.0

             Bill  37.6  45.0

             Coco  37.5  28.0

2020 1       Andy  37.2  32.0

             Bill  35.9  39.0

             Coco  38.0  41.0

     2       Andy  38.7  44.0

             Bill  36.8  54.0

             Coco  36.9  41.0

     3       Andy  35.1  40.0

             Bill  37.0  37.0

             Coco  37.9  39.0

     4       Andy  37.5  48.0

             Bill  37.5  40.0

             Coco  35.7  26.0

In [38]: health_data.stack(level='name')

Out[38]:

type                 HR  Temp

year quarter name

2019 1       Andy  38.1  32.0

             Bill  35.4  45.0

             Coco  37.7  27.0

     2       Andy  35.4  59.0

             Bill  36.9  41.0

             Coco  37.1  41.0

     3       Andy  37.5  18.0

             Bill  37.1  51.0

             Coco  36.7  19.0

     4       Andy  37.6  36.0

             Bill  37.6  45.0

             Coco  37.5  28.0

2020 1       Andy  37.2  32.0

             Bill  35.9  39.0

             Coco  38.0  41.0

     2       Andy  38.7  44.0

             Bill  36.8  54.0

             Coco  36.9  41.0

     3       Andy  35.1  40.0

             Bill  37.0  37.0

             Coco  37.9  39.0

     4       Andy  37.5  48.0

             Bill  37.5  40.0

             Coco  35.7  26.0

#从行散开到列：unstack(level=*)，*表示索引的序号，用隐性索引则自左而右从0开始索引，用显性索引可直接赋值索引名称

#默认level=1

In [35]: health_data.unstack(level=0)

Out[35]:

name     Bill                    Andy                    Coco

type     Temp          HR        Temp          HR        Temp          HR

year     2019  2020  2019  2020  2019  2020  2019  2020  2019  2020  2019  2020

quarter

1        45.0  39.0  35.4  35.9  32.0  32.0  38.1  37.2  27.0  41.0  37.7  38.0

2        41.0  54.0  36.9  36.8  59.0  44.0  35.4  38.7  41.0  41.0  37.1  36.9

3        51.0  37.0  37.1  37.0  18.0  40.0  37.5  35.1  19.0  39.0  36.7  37.9

4        45.0  40.0  37.6  37.5  36.0  48.0  37.6  37.5  28.0  26.0  37.5  35.7

In [36]: health_data.unstack(level=1)

Out[36]:

name     Bill                                            Andy  ...        Coco

type     Temp                      HR                    Temp  ...    HR  Temp                      HR

quarter     1     2     3     4     1     2     3     4     1  ...     4     1     2     3     4     1     2     3     4

year                                                           ...

2019     45.0  41.0  51.0  45.0  35.4  36.9  37.1  37.6  32.0  ...  37.6  27.0  41.0  19.0  28.0  37.7  37.1  36.7  37.5

2020     39.0  54.0  37.0  40.0  35.9  36.8  37.0  37.5  32.0  ...  37.5  41.0  41.0  39.0  26.0  38.0  36.9  37.9  35.7

[2 rows x 24 columns]

In [39]: health_data.unstack(level='year')

Out[39]:

name     Bill                    Andy                    Coco

type     Temp          HR        Temp          HR        Temp          HR

year     2019  2020  2019  2020  2019  2020  2019  2020  2019  2020  2019  2020

quarter

1        45.0  39.0  35.4  35.9  32.0  32.0  38.1  37.2  27.0  41.0  37.7  38.0

2        41.0  54.0  36.9  36.8  59.0  44.0  35.4  38.7  41.0  41.0  37.1  36.9

3        51.0  37.0  37.1  37.0  18.0  40.0  37.5  35.1  19.0  39.0  36.7  37.9

4        45.0  40.0  37.6  37.5  36.0  48.0  37.6  37.5  28.0  26.0  37.5  35.7

多级索引数据统计方法

In [23]: health_data

Out[23]:

name          Bill        Andy        Coco

type          Temp    HR  Temp    HR  Temp    HR

year quarter

2019 1        37.1  35.3  37.2  37.5  36.2  37.6

     2        36.9  37.0  37.8  35.9  37.5  36.6

     3        36.8  34.7  35.4  37.5  37.0  38.2

     4        36.8  37.6  36.2  37.4  36.4  36.7

2020 1        37.5  34.8  36.8  36.5  36.3  37.4

     2        36.2  36.7  37.3  37.2  36.3  36.4

     3        37.7  38.1  36.9  37.7  36.6  36.1

     4        36.5  36.3  37.6  36.3  35.5  37.3

#level参数设置累计的索引层级

In [24]: data_mean = health_data.mean(level='year')

In [25]: data_mean

Out[25]:

name    Bill           Andy            Coco

type    Temp      HR   Temp      HR    Temp      HR

year

2019  36.900  36.150  36.65  37.075  36.775  37.275

2020  36.975  36.475  37.15  36.925  36.175  36.800

#axis参数设置累计的轴

In [26]: data_mean.mean(axis=1,level='type')

Out[26]:

type       Temp         HR

year

2019  36.775000  36.833333

2020  36.766667  36.733333

Pandas选择与索引的更多相关文章

如何选择普通索引和唯一索引《死磕MySQL系列五》
系列文章一.原来一条select语句在MySQL是这样执行的<死磕MySQL系列一> 二.一生挚友redo log.binlog<死磕MySQL系列二> 三.MySQL强 ...
pandas选择数据-【老鱼学pandas】
选择列根据列名来选择某列的数据 import pandas as pd import numpy as np dates = pd.date_range("2017-01-08" ...
pandas选择单元格，选择行列
首先创建示例df: df = pd.DataFrame(np.arange(16).reshape(4, 4), columns=list('ABCD'), index=list('5678')) d ...
pandas 之多层索引
In many applications, data may be spread across a number of files or datasets or be arranged in a fo ...
numpy和pandas的基础索引切片
Numpy的索引切片索引 In [72]: arr = np.array([[[1,1,1],[2,2,2]],[[3,3,3],[4,4,4]]]) In [73]: arr Out[73]: a ...
利用Python进行数据分析(11) pandas基础: 层次化索引
层次化索引层次化索引指你能在一个数组上拥有多个索引,例如: 有点像Excel里的合并单元格对么? 根据索引选择数据子集以外层索引的方式选择数据子集: 以内层索引的方式选择数据: 多重索引S ...
深入学习jQuery选择器系列第三篇——过滤选择器之索引选择器
× 目录 [1]通用形式 [2]首尾索引 [3]奇偶索引[4]范围索引前面的话上一篇介绍了过滤选择器中的子元素选择器部分,本文开始介绍极易与之混淆的索引选择器通用形式 $(':eq(index) ...
但从谈论性能点SQL Server选择聚集索引键
简单介绍在SQL Server中,数据是按页进行存放的.而为表加上聚集索引后,SQL Server对于数据的查找就是依照聚集索引的列作为keyword进行了. 因此对于聚集索引的选择对性能的影响就变 ...
pandas 选择列或者添加列生成新的DataFrame
选择某些列 import pandas as pd # 从Excel中读取数据,生成DataFrame数据 # 导入Excel路径和sheet name df = pd.read_excel(exce ...
pandas基础用法——索引
# -*- coding: utf-8 -*- # Time : 2016/11/28 15:14 # Author : XiaoDeng # version : python3.5 # Softwa ...

随机推荐

【Azure 应用服务】App Service 默认页面暴露Tomcat版本信息，存在安全风险
问题描述在创建Azure App Service时,服务端的配置使用Java 8 + Tomcat 8.5.默认的根目录页面显示出App Service Tomcat版本信息,存在一定的安全隐患. ...
【Azure 应用服务】Web.config中设置域名访问限制，IP地址限制访问特定的页面资源 (Rewrite)
问题描述问题一:web app已经绑定了域名,例如是www.a.com,现在只允许使用www.a.com 访问,如果使用默认的域名xxxx.chinacloundsites.cn访问的时候,需要显示 ...
Android drawable与mipmap的区别（android资源文件放置位置）
1.Drawable Android 把可绘制的对象抽象为Drawable,不同的图形图像代表着不同的darwable类型, 通常我们在代码中不会直接接触drawable实现类的,是由android ...
nginx判断是否手机访问
if ( $http_user_agent ~* "(Android|iPhone|Windows Phone|UC|Kindle|MicroMessenger |iPad)" ) ...
windows下如何结束Tomcat进程
问题描述: 使用IDEA启动java中的SSM项目之后,服务正常运行.操作过程中不小心把IDEA 开发工具给关闭啦,导致tomcat没有正常停止,使用的端口8080仍然被占用.再次打开IDEA,启动 ...
MP实现ID自增 &日志 &自动填充 & 逻辑删除 &乐观锁
默认ID自增 mybatis-plus.configuration.log-impl=org.apache.ibatis.logging.stdout.StdOutImpl 此时的依赖 <dep ...
Template String Converter - 字符串中加变量自动将单引号变换 - vscode插件
Template String Converter - 字符串中加变量自动将单引号变换 - vscode插件
koa 放置前台打包dist 目录
const Koa = require('koa'); const app = new Koa() //public是指的放置前端静态资源的文件夹名称 app.use(require('koa-sta ...
baidu 突然打不开了 20210621
换了个dns 123.125.81.6 114.114.114.114
Android WebView获取html源码
通过执行js语句来获取 val code = """ document.documentElement.outerHTML """.trim ...

Pandas选择与索引

字典形式选择和索引

数组形式选择和索引

索引选择器（iloc 和 loc）

iloc：隐私索引（顺序整数）

loc：显示索引（自定义）

多级索引应用(Hierarchical Indexing)

多级索引创建

多级索引取值与切片

Series

DataFrame

多级索引排序和转换

多级索引行列转换和重置(stack、unstack、set_index、reset_index)

Series

DataFrame

多级索引数据统计方法

Pandas选择与索引的更多相关文章

随机推荐

热门专题