Pandas快速上手（一）：基本操作

本文包含一些 Pandas 的基本操作，旨在快速上手 Pandas 的基本操作。

读者最好有 NumPy 的基础，如果你还不熟悉 NumPy，建议您阅读NumPy基本操作快速熟悉。

Pandas 数据结构

Pandas 有两个核心的数据结构：Series 和 DataFrame。

Series

Series 是一维的类数组对象，包含一个值序列以及对应的索引。

 obj = pd.Series([6, 66, 666, 6666])

 obj

0       6

1      66

2     666

3    6666

dtype: int64

此时索引默认为 0 到 N。我们可以分别访问 Series 的值和索引：

 obj.values

 obj.index  # 类似于 range(4)

array([   6,   66,  666, 6666])

RangeIndex(start=0, stop=4, step=1)

索引可以用标签来指定：

 obj2 = pd.Series([6, 66, 666, 6666], index=['d', 'b', 'a', 'c'])

 obj2

 obj2.index

d       6

b      66

a     666

c    6666

dtype: int64

Index(['d', 'b', 'a', 'c'], dtype='object')

可以使用标签索引来访问 Series 的值，这有点像 NumPy 和字典的结合。

 obj2['a']

 obj2['d'] = 66666

 obj2[['c', 'a', 'd']]

666

c     6666

a      666

d    66666

dtype: int64

Series 可以使用很多类似于 NumPy 的操作：

 obj2[obj2 > 100]

 obj2 / 2

 np.sqrt(obj2)

d    66666

a      666

c     6666

dtype: int64

d    33333.0

b       33.0

a      333.0

c     3333.0

dtype: float64

d    258.197599

b      8.124038

a     25.806976

c     81.645576

dtype: float64

判断某索引是否存在：

 'b' in obj2

 'e' in obj2

True

False

可以直接将字典传入 Series 来创建 Series 对象：

 sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

 obj3 = pd.Series(sdata)

 obj3

 states = ['Texas', 'California', 'Ohio', 'Oregon']

 obj4 = pd.Series(sdata, index=states)  # 指定了索引及其顺序

 obj4

Ohio      35000

Texas     71000

Oregon    16000

Utah       5000

dtype: int64

Texas         71000.0

California        NaN

Ohio          35000.0

Oregon        16000.0

dtype: float64

通过 isnull 和 notnull 可以检测是否有空值，既可以使用 Pandas 函数也可以使用 Series 方法：

 pd.isnull(obj4)

 pd.notnull(obj4)

 obj4.isnull()

 obj4.notnull()

Texas         False

California     True

Ohio          False

Oregon        False

dtype: bool

Texas          True

California    False

Ohio           True

Oregon         True

dtype: bool

Texas         False

California     True

Ohio          False

Oregon        False

dtype: bool

Texas          True

California    False

Ohio           True

Oregon         True

dtype: bool

Series 的数据对齐，类似于数据库的连接操作：

 obj3

 obj4

 obj3 + obj4

Ohio      35000

Texas     71000

Oregon    16000

Utah       5000

dtype: int64

Texas         71000.0

California        NaN

Ohio          35000.0

Oregon        16000.0

dtype: float64

California         NaN

Ohio           70000.0

Oregon         32000.0

Texas         142000.0

Utah               NaN

dtype: float64

Series 及其索引都有一个 name 属性：

 obj4.name = 'population'

 obj4.index.name = 'state'

 obj4

state

Texas         71000.0

California        NaN

Ohio          35000.0

Oregon        16000.0

Name: population, dtype: float64

Series 的索引可以原地（in-place）修改：

 obj

 obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']

 obj

0       6

1      66

2     666

3    6666

dtype: int64

Bob         6

Steve      66

Jeff      666

Ryan     6666

dtype: int64

DataFrame

DataFrame 表示一张矩阵数据表，其中包含有序的列集合，每列都可以表示不同的数据类型。

DataFrame 有行索引和列索引，可以把它当做是 Series 的字典，该字典共享同一套行索引。

通过等长列表（或 NumPy 数组）的字典是常用的创建方式：

 # 每个值都是长度为 6 的列表

 data = {

     'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],

     'year': [2000, 2001, 2002, 2001, 2002, 2003],

     'pop': [2.5, 1.7, 3.6, 2.4, 2.9, 3.2],

 }

 frame = pd.DataFrame(data)

 data

	state	year	pop

0	Ohio	2000	2.5

1	Ohio	2001	1.7

2	Ohio	2002	3.6

3	Nevada	2001	2.4

4	Nevada	2002	2.9

5	Nevada	2003	3.2

通过 head 方法可以选择前几行：

 frame.head()

 frame.head(2)

	state	year	pop

0	Ohio	2000	2.5

1	Ohio	2001	1.7

2	Ohio	2002	3.6

3	Nevada	2001	2.4

4	Nevada	2002	2.9
        state	year	pop

0	Ohio	2000	2.5

1	Ohio	2001	1.7

指定列的序列，可以按照响应顺序来展示：

 pd.DataFrame(data, columns=['year', 'state', 'pop'])

	year	state	pop

0	2000	Ohio	2.5

1	2001	Ohio	1.7

2	2002	Ohio	3.6

3	2001	Nevada	2.4

4	2002	Nevada	2.9

5	2003	Nevada	3.2

同样也可以指定索引：

 frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],

                      index=['one', 'two', 'three', 'four', 'five', 'six'])

 frame2

 frame2.columns

 frame2.index

	year	state	pop	debt

one	2000	Ohio	2.5	NaN

two	2001	Ohio	1.7	NaN

three	2002	Ohio	3.6	NaN

four	2001	Nevada	2.4	NaN

five	2002	Nevada	2.9	NaN

six	2003	Nevada	3.2	NaN

Index(['year', 'state', 'pop', 'debt'], dtype='object')

Index(['one', 'two', 'three', 'four', 'five', 'six'], dtype='object')

通过键或者属性提取 DataFrame 的一列，得到的是 Series：

 frame2['state']

 frame2.year

 type(frame2.year)

one        Ohio

two        Ohio

three      Ohio

four     Nevada

five     Nevada

six      Nevada

Name: state, dtype: object

one      2000

two      2001

three    2002

four     2001

five     2002

six      2003

Name: year, dtype: int64

pandas.core.series.Series

通过 loc 属性可以指定标签，检索行数据：

 frame2.loc['three']

 type(frame2.loc['three'])

year     2002

state    Ohio

pop       3.6

debt      NaN

Name: three, dtype: object

pandas.core.series.Series

对 'debt' 列进行赋值：

 frame2['debt'] = 16.5

 frame2

 frame2['debt'] = np.arange(6.)

 frame2

　　　　 year	state	pop	debt

one	2000	Ohio	2.5	16.5

two	2001	Ohio	1.7	16.5

three	2002	Ohio	3.6	16.5

four	2001	Nevada	2.4	16.5

five	2002	Nevada	2.9	16.5

six	2003	Nevada	3.2	16.5

　　　　 year	state	pop	debt

one	2000	Ohio	2.5	0.0

two	2001	Ohio	1.7	1.0

three	2002	Ohio	3.6	2.0

four	2001	Nevada	2.4	3.0

five	2002	Nevada	2.9	4.0

six	2003	Nevada	3.2	5.0

可以用 Series 来赋值 DataFrame 列：

 val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])

 frame2['debt'] = val

 frame2

	year	state	pop	debt

one	2000	Ohio	2.5	NaN

two	2001	Ohio	1.7	-1.2

three	2002	Ohio	3.6	NaN

four	2001	Nevada	2.4	-1.5

five	2002	Nevada	2.9	-1.7

six	2003	Nevada	3.2	NaN

给不存在的列赋值会创建新列：

 frame2['eastern'] = frame2.state == 'Ohio'

 frame2

	year	state	pop	debt	eastern

one	2000	Ohio	2.5	NaN	True

two	2001	Ohio	1.7	-1.2	True

three	2002	Ohio	3.6	NaN	True

four	2001	Nevada	2.4	-1.5	False

five	2002	Nevada	2.9	-1.7	False

six	2003	Nevada	3.2	NaN	False

删除某列：

del frame2['eastern']

frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

DataFrame 取出来的列是底层数据的视图，不是拷贝。所做的修改会反映到底层数据中，如果要拷贝必须使用显式的 copy 方法。

传入嵌套字典的情况：

 pop = {

     'Nevada': {

         2001: 2.4,

         2002: 2.9,

     },

     'Ohio': {

         2000: 1.5,

         2001: 1.7,

         2002: 3.6,

     }

 }

 frame3 =  pd.DataFrame(pop)

 frame3

	Nevada	Ohio

2000	NaN	1.5

2001	2.4	1.7

2002	2.9	3.6

DataFrame 的转置：

 frame3.T

	2000	2001	2002

Nevada	NaN	2.4	2.9

Ohio	1.5	1.7	3.6

values 属性是包含 DataFrame 值的 ndarray：

 frame3.values

array([[nan, 1.5],

       [2.4, 1.7],

       [2.9, 3.6]])

索引对象

Pandas 的索引对象用于存储轴标签和元数据（轴名称等）。

 obj = pd.Series(range(3), index=['a', 'b', 'c'])

 index = obj.index

 index

 index.name = 'alpha'

 index[1:]

Index(['a', 'b', 'c'], dtype='object')

Index(['b', 'c'], dtype='object', name='alpha')

索引对象是不可变对象，因此不可修改：

 index[1] = 'd'  # 报错

索引类似于固定长度的集合：

 frame3

 frame3.columns

 frame3.index

 # 类似于集合

 'Nevada' in frame3.columns

 2000 in frame3.index

Nevada	Ohio

2000	NaN	1.5

2001	2.4	1.7

2002	2.9	3.6

Index(['Nevada', 'Ohio'], dtype='object')

Int64Index([2000, 2001, 2002], dtype='int64')

True

True

但是索引和集合的不同之处是可以包含重复的标签：

 dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])

 dup_labels

 set_labels = set(['foo', 'foo', 'bar', 'bar'])

 set_labels

Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

{'bar', 'foo'}

索引对象的一些常用方法：

append
difference
intersection
union
isin
delete
drop
insert
is_monotonic
is_unique
unique

Pandas 基本功能

重新索引

reindex 用于创建一个新对象，其索引进行了重新编排。

 obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])

 obj

 obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])

 obj2

d    4.5

b    7.2

a   -5.3

c    3.6

dtype: float64

a   -5.3

b    7.2

c    3.6

d    4.5

e    NaN

dtype: float64

ffill 用于向前填充值，在重新索引时会进行插值：

 obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])

 obj3

 obj3.reindex(range(6), method='ffill')

0      blue

2    purple

4    yellow

dtype: object

0      blue

1      blue

2    purple

3    purple

4    yellow

5    yellow

dtype: object

无论行索引还是列索引都可以重新编排：

 frame = pd.DataFrame(np.arange(9).reshape((3, 3)),

                      index=['a', 'c', 'd'],

                      columns=['Ohio', 'Texas', 'California'])

 frame

 frame.reindex(['a', 'b', 'c', 'd'])

 frame.reindex(columns=['Texas', 'Utah', 'California'])

	Ohio	Texas	California

a	0	1	2

c	3	4	5

d	6	7	8

　　    Ohio	Texas	California

a	0.0	1.0	2.0

b	NaN	NaN	NaN

c	3.0	4.0	5.0

d	6.0	7.0	8.0

　　　　 Texas	Utah	California

a	1	NaN	2

c	4	NaN	5

d	7	NaN	8

根据轴删除数据

使用 drop 方法对数据行或列进行删除：

 obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])

 obj

 new_obj = obj.drop('c')

 new_obj

 obj.drop(['d', 'c'])

a    0.0

b    1.0

c    2.0

d    3.0

e    4.0

dtype: float64

a    0.0

b    1.0

d    3.0

e    4.0

dtype: float64

a    0.0

b    1.0

e    4.0

dtype: float64

 data = pd.DataFrame(np.arange(16).reshape((4, 4)),

                     index=['Ohio', 'Colorado', 'Utah', 'New York'],

                     columns=['one', 'two', 'three', 'four'])

 data

 data.drop(['Colorado', 'Ohio'])

 data.drop('two', axis=1)

 data.drop(['two', 'four'], axis='columns')

	one	two	three	four

Ohio	0	1	2	3

Colorado	4	5	6	7

Utah	8	9	10	11

New York	12	13	14	15

　　　　 one	two	three	four

Utah	8	9	10	11

New York	12	13	14	15

　　　　 one	three	four

Ohio	0	2	3

Colorado	4	6	7

Utah	8	10	11

New York	12	14	15

　　　　 one	three

Ohio	0	2

Colorado	4	6

Utah	8	10

New York	12	14

原地删除：

 obj

 obj.drop('c', inplace=True)

 obj

a    0.0

b    1.0

c    2.0

d    3.0

e    4.0

dtype: float64

a    0.0

b    1.0

d    3.0

e    4.0

dtype: float64

索引、选择、过滤

Series 的索引类似于 NumPy，只不过 Series 还可以用索引标签，不一定是整型索引。

 obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])

 obj

 obj['b']

 obj[1]

 obj[2:4]

 obj[['b', 'a', 'd']]

 obj[[1, 3]]

 obj[obj < 2]

a    0.0

b    1.0

c    2.0

d    3.0

dtype: float64

1.0

1.0

c    2.0

d    3.0

dtype: float64

b    1.0

a    0.0

d    3.0

dtype: float64

b    1.0

d    3.0

dtype: float64

a    0.0

b    1.0

dtype: float64

Series 的切片和原生 Python 有所不同，是闭区间（Python 是左闭右开区间）。

 obj['b':'c']

b    1.0

c    2.0

dtype: float64

使用切片进行赋值：

 obj['b':'c'] = 5

 obj

a    0.0

b    5.0

c    5.0

d    3.0

dtype: float64

选择 DataFrame 的若干列：

 data = pd.DataFrame(np.arange(16).reshape((4, 4)),

                     index=['Ohio', 'Colorado', 'Utah', 'New York'],

                     columns=['one', 'two', 'three', 'four'])

 data

 data['two']

 type(data['two'])

 data[['three', 'one']]

 type(data[['three', 'one']])

　　　　 one	two	three	four

Ohio	0	1	2	3

Colorado	4	5	6	7

Utah	8	9	10	11

New York	12	13	14	15

Ohio         1

Colorado     5

Utah         9

New York    13

Name: two, dtype: int64

pandas.core.series.Series

　　　　 three	one

Ohio	2	0

Colorado	6	4

Utah	10	8

New York	14	12

pandas.core.frame.DataFrame

使用切片和布尔数组选择 DataFrame：

 data[:2]

 data[data['three'] > 5]

	one	two	three	four

Ohio	0	1	2	3

Colorado	4	5	6	7

        one	two	three	four

Colorado	4	5	6	7

Utah	8	9	10	11

New York	12	13	14	15

DataFrame 语法上很像二维的 NumPy 数组，使用布尔 DataFrame：

 data < 5

 data[data < 5] = 0

 data

	one	two	three	four

Ohio	True	True	True	True

Colorado	True	False	False	False

Utah	False	False	False	False

New York	False	False	False	False

        one	two	three	four

Ohio	0	0	0	0

Colorado	0	5	6	7

Utah	8	9	10	11

New York	12	13	14	15

使用 loc 和 iloc 进行选择

loc 和 iloc 可以用于选择 DataFrame 的行和列。

 data.loc['Colorado', ['two', 'three']]

two      5

three    6

Name: Colorado, dtype: int64

 data.iloc[2, [3, 0, 1]]

 data.iloc[2]

 data.iloc[[1, 2], [3, 0, 1]]

four    11

one      8

two      9

Name: Utah, dtype: int64

one       8

two       9

three    10

four     11

Name: Utah, dtype: int64

        four	one	two

Colorado	7	0	5

Utah	11	8	9

同样可以使用切片：

 data.loc[:'Utah', 'two']

 data.iloc[:, :3][data.three > 5]

Ohio        0

Colorado    5

Utah        9

Name: two, dtype: int64

        one	two	three

Colorado	0	5	6

Utah	8	9	10

New York	12	13	14

DataFrame 常用索引方法：

df[val]：选列
df.loc[val]：选行
df.loc[:, val]：选列
df.loc[val1, val2]：选列和行
df.iloc[where]：选行
df.iloc[:, where]：选列
df.iloc[where_i, where_j]：选列和行
df.at[label_i, label_j]：选某一标量
df.iat[i, j]：选某一标量
reindex：选列和行
get_value, set_value：选某一标量

算术和数据对齐

前面我们介绍了 Series 的算术对齐，接下来是 DataFrame 的：

 df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),

                    index=['Ohio', 'Texas', 'Colorado'])

 df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),

                    index=['Utah', 'Ohio', 'Texas', 'Oregon'])

 df1

 df2

 df1 + df2

	b	c	d

Ohio	0.0	1.0	2.0

Texas	3.0	4.0	5.0

Colorado6.0	7.0	8.0

        b	d	e

Utah	0.0	1.0	2.0

Ohio	3.0	4.0	5.0

Texas	6.0	7.0	8.0

Oregon	9.0	10.0	11.0

        b	c	d	e

ColoradoNaN	NaN	NaN	NaN

Ohio	3.0	NaN	6.0	NaN

Oregon	NaN	NaN	NaN	NaN

Texas	9.0	NaN	12.0	NaN

Utah	NaN	NaN	NaN	NaN

没有相同行索引和列索引的情况：

 df1 = pd.DataFrame({'A': [1, 2]})

 df2 = pd.DataFrame({'B': [3, 4]})

 df1

 df2

 df1 - df2

A

0	1

1	2

B

0	3

1	4

A	B

0	NaN	NaN

1	NaN	NaN

填补值的算术方法

对于算术运算后产生空值的情况：

 df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),

                    columns=list('abcd'))

 df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),

                    columns=list('abcde'))

 df2.loc[1, 'b'] = np.nan

 df1

 df2

 df1 + df2

	a	b	c	d

0	0.0	1.0	2.0	3.0

1	4.0	5.0	6.0	7.0

2	8.0	9.0	10.0	11.0

        a	b	c	d	e

0	0.0	1.0	2.0	3.0	4.0

1	5.0	NaN	7.0	8.0	9.0

2	10.0	11.0	12.0	13.0	14.0

3	15.0	16.0	17.0	18.0	19.0

        a	b	c	d	e

0	0.0	2.0	4.0	6.0	NaN

1	9.0	NaN	13.0	15.0	NaN

2	18.0	20.0	22.0	24.0	NaN

3	NaN	NaN	NaN	NaN	NaN

可以使用 fill_value 属性来自动填空值：

 df1.add(df2, fill_value=0)

        a	b	c	d	e

0	0.0	2.0	4.0	6.0	4.0

1	9.0	5.0	13.0	15.0	9.0

2	18.0	20.0	22.0	24.0	14.0

3	15.0	16.0	17.0	18.0	19.0

常用的算术方法（可用于补空值）：

add, radd
sub, rsub
div, rdiv
floordiv, rfloordiv
mul, rmul
pow, rpow

DataFrame 和 Series 之间的运算

首先，看看一维数组和二维数组的减法的情况。

 arr = np.arange(12.).reshape((3, 4))

 arr

 arr[0]

 arr - arr[0]

array([[ 0.,  1.,  2.,  3.],

       [ 4.,  5.,  6.,  7.],

       [ 8.,  9., 10., 11.]])

array([0., 1., 2., 3.])

array([[0., 0., 0., 0.],

       [4., 4., 4., 4.],

       [8., 8., 8., 8.]])

默认情况下，DataFrame 和 Series 会匹配索引进行计算。

 frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),

                      columns=list('bde'),

                      index=['Utah', 'Ohio', 'Texas', 'Oregon'])

 series = frame.iloc[0]

 frame

 series

 frame - series

        b	d	e

Utah	0.0	1.0	2.0

Ohio	3.0	4.0	5.0

Texas	6.0	7.0	8.0

Oregon	9.0	10.0	11.0

b    0.0

d    1.0

e    2.0

Name: Utah, dtype: float64

        b	d	e

Utah	0.0	0.0	0.0

Ohio	3.0	3.0	3.0

Texas	6.0	6.0	6.0

Oregon	9.0	9.0	9.0

如果索引不匹配，会用外连接的方式重组索引：

 series2 = pd.Series(range(3), index=['b', 'e', 'f'])

 frame + series2

        b	d	e	f

Utah	0.0	NaN	3.0	NaN

Ohio	3.0	NaN	6.0	NaN

Texas	6.0	NaN	9.0	NaN

Oregon	9.0	NaN	12.0	NaN

如果希望对于所有列进行算术计算（broadcast 机制），必须使用前面介绍的算术方法。

 series3 = frame['d']

 frame

 series3

 frame.sub(series3, axis=0)

	b	d	e

Utah	0.0	1.0	2.0

Ohio	3.0	4.0	5.0

Texas	6.0	7.0	8.0

Oregon	9.0	10.0	11.0

Utah       1.0

Ohio       4.0

Texas      7.0

Oregon    10.0

Name: d, dtype: float64

        b	d	e

Utah	-1.0	0.0	1.0

Ohio	-1.0	0.0	1.0

Texas	-1.0	0.0	1.0

Oregon	-1.0	0.0	1.0

apply 和 map

NumPy 的通用函数（按元素的数组方法）可以用于 Pandas 对象：

 frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),

                      index=['Utah', 'Ohio', 'Texas', 'Oregon'])

 frame

 np.abs(frame)

	b	d	e

Utah	-0.590458	-0.352861	0.820549

Ohio	-1.708280	0.174739	-1.081811

Texas	0.857712	0.246972	0.532208

Oregon	0.812756	1.260538	0.818304

        b	d	e

Utah	0.590458	0.352861	0.820549

Ohio	1.708280	0.174739	1.081811

Texas	0.857712	0.246972	0.532208

Oregon	0.812756	1.260538	0.818304

一个常用的操作是按列或者行调用某一个函数，apply 可以达到该功能：

 f = lambda x: x.max() - x.min()

 frame.apply(f)

b    2.565993

d    1.613398

e    1.902360

dtype: float64

 frame.apply(f, axis=1)

Utah      1.411007

Ohio      1.883019

Texas     0.610741

Oregon    0.447782

dtype: float64

apply 传入的函数不一定非得返回一个标量值，可以返回 Series：

 def f(x):

     return pd.Series([x.min(), x.max()], index=['min', 'max'])

 frame.apply(f)

                b	        d	        e

min	-1.708280	-0.352861	-1.081811

max	0.857712	1.260538	0.820549

还可以传入按元素计算的函数：

 frame

 format = lambda x: '%.2f' % x

 frame.applymap(format)

        b	        d	        e

Utah	-0.590458	-0.352861	0.820549

Ohio	-1.708280	0.174739	-1.081811

Texas	0.857712	0.246972	0.532208

Oregon	0.812756	1.260538	0.818304

        b	d	e

Utah	-0.59	-0.35	0.82

Ohio	-1.71	0.17	-1.08

Texas	0.86	0.25	0.53

Oregon	0.81	1.26	0.82

按元素应用某函数必须使用 applymap。取这个名字的原因是 Series 有一个 map 方法就是用来按元素调用函数的。

 frame['e'].map(format)

Utah       0.82

Ohio      -1.08

Texas      0.53

Oregon     0.82

Name: e, dtype: object

排序和排名

对行索引进行排列：

 obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])

 obj

 obj.sort_index()

d    0

a    1

b    2

c    3

dtype: int64

a    1

b    2

c    3

d    0

dtype: int64

对列索引进行排列：

 frame = pd.DataFrame(np.arange(8).reshape((2, 4)),

                      index=['three', 'one'],

                      columns=['d', 'a', 'b', 'c'])

 frame

 frame.sort_index()

 frame.sort_index(axis=1)

	d	a	b	c

three	0	1	2	3

one	4	5	6	7

        d	a	b	c

one	4	5	6	7

three	0	1	2	3

        a	b	c	d

three	1	2	3	0

one	5	6	7	4

按降序排列：

 frame.sort_index(axis=1, ascending=False)

        d	c	b	a

three	0	3	2	1

one	4	7	6	5

对于 Series 和 DataFrame，如果需要根据值来进行排列，使用 sort_values 方法：

 obj = pd.Series([4, 7, -3, 2])

 obj.sort_values()

 obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])

 obj.sort_values()

2   -3

3    2

0    4

1    7

dtype: int64

4   -3.0

5    2.0

0    4.0

2    7.0

1    NaN

3    NaN

dtype: float64

 frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})

 frame

 frame.sort_values(by='b')

 frame.sort_values(by=['a', 'b'])

排名（ranking）用于计算出排名值。

 obj = pd.Series([7, -5, 7, 4, 2, 4, 0, 4, 4])

 obj.rank()

 obj.rank(method='first')  # 不使用平均值的排名方法

0    8.5

1    1.0

2    8.5

3    5.5

4    3.0

5    5.5

6    2.0

7    5.5

8    5.5

dtype: float64

0    8.0

1    1.0

2    9.0

3    4.0

4    3.0

5    5.0

6    2.0

7    6.0

8    7.0

dtype: float64

降序排列，并且按最大值指定名次：

 obj.rank(ascending=False, method='max')

0    2.0

1    9.0

2    2.0

3    6.0

4    7.0

5    6.0

6    8.0

7    6.0

8    6.0

dtype: float64

按列进行排名：

 frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],'c': [-2, 5, 8, -2.5]})

 frame

 frame.rank(axis=1)

        b	a	c

0	4.3	0	-2.0

1	7.0	1	5.0

2	-3.0	0	8.0

3	2.0	1	-2.5

        b	a	c

0	3.0	2.0	1.0

1	3.0	1.0	2.0

2	1.0	2.0	3.0

3	3.0	2.0	1.0

带有重复标签的索引

Series 的重复索引：

 obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])

 obj

 obj.index.is_unique

 obj['a']

 obj['c']

a    0

a    1

b    2

b    3

c    4

dtype: int64

False

a    0

a    1

dtype: int64

4

Pandas 的重复索引：

 df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])

 df

 df.loc['b']

        0	        1	        2

a	-0.975030	2.041130	1.022168

a	0.321428	2.124496	0.037530

b	0.343309	-0.386692	-0.577290

b	0.002090	-0.890841	1.759072

        0	        1	        2

b	0.343309	-0.386692	-0.577290

b	0.002090	-0.890841	1.759072

求和与计算描述性统计量

sum 求和：

 df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],

                    [np.nan, np.nan], [0.75, -1.3]],

                   index=['a', 'b', 'c', 'd'],

                   columns=['one', 'two'])

 df

 df.sum()

 df.sum(axis=1)

 df.mean(axis=1, skipna=False)  # 不跳过 NAN

	one	two

a	1.40	NaN

b	7.10	-4.5

c	NaN	NaN

d	0.75	-1.3

one    9.25

two   -5.80

dtype: float64

a    1.40

b    2.60

c    0.00

d   -0.55

dtype: float64

a      NaN

b    1.300

c      NaN

d   -0.275

dtype: float64

返回最大值或最小值对应的索引：

 df.idxmax()

 df.idxmin(axis=1)

one    b

two    d

dtype: object

a    one

b    two

c    NaN

d    two

dtype: object

累加：

 df.cumsum()

	one	two

a	1.40	NaN

b	8.50	-4.5

c	NaN	NaN

d	9.25	-5.8

describe 方法生成一些统计信息：

 df.describe()

        one	        two

count	3.000000	2.000000

mean	3.083333	-2.900000

std	3.493685	2.262742

min	0.750000	-4.500000

25%	1.075000	-3.700000

50%	1.400000	-2.900000

75%	4.250000	-2.100000

max	7.100000	-1.300000

 # 非数值数据

 obj = pd.Series(['a', 'a', 'b', 'c'] * 4)

 obj

 obj.describe()

0     a

1     a

2     b

3     c

4     a

5     a

6     b

7     c

8     a

9     a

10    b

11    c

12    a

13    a

14    b

15    c

dtype: object

count     16

unique     3

top        a

freq       8

dtype: object

一些常用的统计方法：

count
describe
min, max
argmin, argmax
idxmin, idxmax
quantile
sum
mean
median
mad
prod
var
std
skew
kurt
cumsum
cummin, cummax
cumprod
diff
pct_change

自相关和协方差

首先安装一个读取数据集的模块：

pip install pandas-datareader -i https://pypi.douban.com/simple

下载一个股票行情的数据集：

 import pandas_datareader.data as web

 all_data = {ticker: web.get_data_yahoo(ticker)

             for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']}

 price = pd.DataFrame({ticker: data['Adj Close']

                       for ticker, data in all_data.items()})

 volume = pd.DataFrame({ticker: data['Volume']

                        for ticker, data in all_data.items()})

计算价格的百分比变化：

 returns = price.pct_change()

 returns.tail()

                AAPL	        IBM	        MSFT	        GOOG

Date

2019-08-06	0.018930	-0.000213	0.018758	0.015300

2019-08-07	0.010355	-0.011511	0.004380	0.003453

2019-08-08	0.022056	0.018983	0.026685	0.026244

2019-08-09	-0.008240	-0.028337	-0.008496	-0.013936

2019-08-12	-0.002537	-0.014765	-0.013942	-0.011195

Series 的 corr 方法会计算两个 Series 的相关性，而 cov 方法计算协方差。

 returns['MSFT'].corr(returns['IBM'])

 returns.MSFT.cov(returns['IBM'])

0.48863990166304594

8.714318020797283e-05

DataFrame 的 corr 方法和 cov 会计算相关性和协方差的矩阵：

 returns.corr()

 returns.cov()

	AAPL	        IBM	        MSFT	        GOOG

AAPL	1.000000	0.381659	0.453727	0.459663

IBM	0.381659	1.000000	0.488640	0.402751

MSFT	0.453727	0.488640	1.000000	0.535898

GOOG	0.459663	0.402751	0.535898	1.000000

        AAPL	        IBM	        MSFT	        GOOG

AAPL	0.000266	0.000077	0.000107	0.000117

IBM	0.000077	0.000152	0.000087	0.000077

MSFT	0.000107	0.000087	0.000209	0.000120

GOOG	0.000117	0.000077	0.000120	0.000242

DataFrame 的 corrwith 方法可以计算和其他 Series 或 DataFrame 的相关性：

 returns.corrwith(returns.IBM)

 returns.corrwith(volume)

AAPL    0.381659

IBM     1.000000

MSFT    0.488640

GOOG    0.402751

dtype: float64

AAPL   -0.061924

IBM    -0.151708

MSFT   -0.089946

GOOG   -0.018591

dtype: float64

唯一值、值的计数、关系

unique 方法唯一值数组：

 obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

 obj.unique()

array(['c', 'a', 'd', 'b'], dtype=object)

value_counts 方法返回值的计数：

 obj.value_counts()

 pd.value_counts(obj.values, sort=False)  # 等价的写法

a    3

c    3

b    2

d    1

dtype: int64

c    3

b    2

d    1

a    3

dtype: int64

isin 检查 Series 中的数是否属于某个集合：

 obj

 mask = obj.isin(['b', 'c'])

 mask

 obj[mask]

0    c

1    a

2    d

3    a

4    a

5    b

6    b

7    c

8    c

dtype: object

0     True

1    False

2    False

3    False

4    False

5     True

6     True

7     True

8     True

dtype: bool

0    c

5    b

6    b

7    c

8    c

dtype: object

get_indexer 将唯一值转换为索引（对于标签转换为数值很管用）：

 to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])

 unique_vals = pd.Series(['c', 'b', 'a'])

 pd.Index(unique_vals).get_indexer(to_match)

array([0, 2, 1, 1, 0, 2])

计算多个列的直方图：

 data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],

                      'Qu2': [2, 3, 1, 2, 3],

                      'Qu3': [1, 5, 2, 4, 4]})

 data

 data.apply(pd.value_counts).fillna(0)

        Qu1	Qu2	Qu3

0	1	2	1

1	3	3	5

2	4	1	2

3	3	2	4

4	4	3	4

        Qu1	Qu2	Qu3

1	1.0	1.0	1.0

2	0.0	2.0	1.0

3	2.0	2.0	0.0

4	2.0	0.0	2.0

5	0.0	0.0	1.0

参考

《Python for Data Analysis, 2nd Edition》by Wes McKinney