重拾Python(3):Pandas之Series对象的使用

Pandas是Python下最强大的数据分析和探索库，是基于Numpy库构建的，支持类似SQL的结构化数据的增、删、查、改，具有丰富的数据处理函数。Pandas有两大数据结构：Series和DataFrame，本文主要对Series的常用用法进行总结梳理。

约定：

import pandas as pd

1.什么是Series对象?

Series对象本质上类似于一个一维数组，由一列元素（由值和对应的索引）组成。

2.Series对象的创建

Series对象的创建主要是使用pd.Series方法。具体又分为两种：

（1）通过列表创建

向pd.Series方法中传入一个列表，未指定索引时，默认从0到N-1。

ser1=pd.Series([11,22,33,44])

ser1

Out[60]:

0    11

1    22

2    33

3    44

dtype: int64

也可以使用index参数指定索引：

ser2=pd.Series([11,22,33,44],index=['a','b','c','d'])

ser2

Out[61]:

a    11

b    22

c    33

d    44

dtype: int64

（2）通过字典创建

向传入一个字典，字典的键就是索引，值就是值。

ser3=pd.Series({'a':11,'d':22,'c':33})

ser3

Out[62]:

a    11

d    22

c    33

dtype: int64

##3.Series对象的四个主要属性
Series对象的四个主要属性：索引、值、名称、数据类型。
###（1）索引
**a.索引的查看**
通过Series对象的**index属性**查看索引，返回一个Index对象。

ser1.index

Out[63]: RangeIndex(start=0, stop=4, step=1)

ser2.index

Out[64]: Index([u'a', u'b', u'c', u'd'], dtype='object')

索引允许有重复，可使用Index对象的is_unique属性查看是否有重复。

ser1.index.is_unique

Out[65]: True

b.索引的修改

索引对象是一个不可变数组，不能修改其中的值。

ser1.index[0]=5

Traceback (most recent call last):

  File "<ipython-input-68-2029117c9570>", line 1, in <module>

    ser1.index[0]=5

  File "/usr/local/share/anaconda2/lib/python2.7/site-packages/pandas/indexes/base.py", line 1404, in __setitem__

    raise TypeError("Index does not support mutable operations")

TypeError: Index does not support mutable operations

如果想修改Series对象的索引，只能将其重定向到一个新的索引对象上。

ser1.index=[5,6,7,8]

ser1.index

Out[70]: Int64Index([5, 6, 7, 8], dtype='int64')

c.索引的重排

使用reindex方法对索引进行重排。

ser2.reindex(['b','a','c','d'])

Out[73]:

b    22

a    11

c    33

d    44

dtype: int64

重排产生一个新Series对象，原对象不发生改变。

索引重排可实现3个目的：

① 对现有索引进行顺序指定，即重新排列原来的元素顺序；

② 删除某个旧索引，即删除对应元素；

ser2.reindex(['b','a','d'])

Out[74]:

b    22

a    11

d    44

dtype: int64

③ 增加某个新索引，即增加新元素，值为NaN。

ser2.reindex(['b','a','e','c','d'])

Out[75]:

b    22.0

a    11.0

e     NaN

c    33.0

d    44.0

dtype: float64

d.索引的排序

使用sort_index方法根据现有索引进行升序、降序排列。

ser3.sort_index()

Out[80]:

a    11

c    33

d    22

dtype: int64

默认按索引取值升序排列，排序后产生一个新Series对象，原对象不发生改变。

e.索引是否存在

使用in判断元素是否存在，实质是判断某索引是否存在。

'a' in ser3

Out[110]: True

11 in ser3

Out[111]: False

（2）值

a.值的查看

通过Series对象的values属性查看值，返回一个数组对象。

ser1.values

Out[81]: array([11, 22, 33, 44])

b.值的修改

可以通过直接对values属性返回的数组对象进行修改来修改Series对象的值。这种修改是对原对象的直接修改。

ser1.values[1]=23

ser1

Out[83]:

5    11

6    23

7    33

8    44

dtype: int64

c.值的排序

使用sort_values方法按照值进行升序、降序排列。

ser3.sort_values()

Out[84]:

a    11

d    22

c    33

dtype: int64

默认按索引取值升序排列，排序后产生一个新Series对象，原对象不发生改变。

d.值的排名

使用rank方法获取元素取值排名。

ser2.rank()

Out[145]:

a    1.0

c    2.0

f    3.0

dtype: float64

默认升序排名，对于并列排名，默认取其均值。

e.值是否存在

使用isin方法判断，要求传入一个列表，返回一个布尔型Series对象。

ser6.isin(['a'])

Out[164]:

a    False

b    False

d    False

e    False

dtype: bool

（3）名称

Series对象有名称，可通过name属性获得。

Series对象的索引对象也有名称，可通过Index对象的name属性获得。

（4）数据类型

通过Series对象的dtype属性获得。

ser2.dtype

Out[146]: dtype('float64')

##4.元素操作
###（1）元素选取
**选择一个元素：**
**a.以对应的索引选取**

ser2['b']

Out[90]: 22

b.以对应的索引序号选取

ser2[1]

Out[91]: 22

选择多个元素：

a.以对应的索引组成的列表选取

ser2[['a','c']]

Out[93]:

a    11

c    33

dtype: int64

b.以对应的索引组成的切片选取

ser2['a':'d']

Out[94]:

a    11

b    22

c    33

d    44

dtype: int64

c.以对应的索引序号组成的切片选取

ser2[0:3]

Out[92]:

a    11

b    22

c    33

dtype: int64

注意：a和c的区别是，前者包括右端点的元素，后者不包括右端点的元素。

（2）元素过滤

可直接使用基于值的比较运算条件进行过滤。

ser2[ser2>30]

Out[95]:

c    33

d    44

dtype: int64

（3）元素新增

a.使用赋值新增

ser2['e']=55

ser2

Out[97]:

a    11

b    22

c    33

d    44

e    55

dtype: int64

b.使用索引重排新增（注意reindex方法产生新对象，不会修改原对象）

ser2=ser2.reindex(['a','c','f'])

ser2

Out[100]:

a    11.0

c    33.0

f     NaN

dtype: float64

（4）元素删除

使用drop方法删除，drop方法产生新对象，不会修改原对象。

ser2=ser2.drop('f')

ser2

Out[106]:

a    11.0

c    33.0

dtype: float64

（5）算术运算

Series对象支持直接进行算术运算。

ser2+2

Out[107]:

a    13.0

c    35.0

dtype: float64

ser2*2

Out[108]:

a    22.0

c    66.0

dtype: float64

（6）获取元素唯一值

使用unique方法获取元素的唯一值。

ser6=pd.Series([11,22,44,22],index=['a','b','d','e'])

ser6

Out[159]:

a    11

b    22

d    44

e    22

dtype: int64

ser6.unique()

Out[160]: array([11, 22, 44])

使用value_counts方法获取元素唯一值的频数分布。

ser6.value_counts()

Out[161]:

22    2

11    1

44    1

dtype: int64

（7）判断是否存在某元素

a.使用in判断

使用in判断元素是否存在，实质是判断某索引是否存在。

'a' in ser3

Out[110]: True

11 in ser3

Out[111]: False

b.使用isin方法判断

使用isin方法判断，要求传入一个列表，返回一个布尔型Series对象。

ser6.isin(['a'])

Out[164]:

a    False

b    False

d    False

e    False

dtype: bool

（8）判断是否有空值

使用isnull或者notnull方法判断是否有空值。

ser3.isnull()

Out[114]:

a    False

c    False

d    False

dtype: bool

ser3.notnull()

Out[115]:

a    True

c    True

d    True

dtype: bool

（9）缺失值处理

缺失值的处理主要有两种方法：填充和过滤。

a.填充

使用fillna方法进行空值填充，该方法产生新对象，不会修改原对象。

ser2=ser2.reindex(['a','c','h'])

ser2=ser2.fillna(99)

ser2

Out[125]:

a    11.0

c    33.0

h    99.0

dtype: float64

b.过滤

使用dropna方法进行空值过滤，该方法产生新对象，不会修改原对象。

ser6=ser6.reindex(['a','b','d','f'])

ser6

Out[168]:

a    11.0

b    22.0

d    44.0

f     NaN

dtype: float64

ser6.dropna()

Out[169]:

a    11.0

b    22.0

d    44.0

dtype: float64

（10）过滤重复值

使用duplicated方法返回布尔型Series对象，判断哪些元素是重复值。

ser7=pd.Series([11,22,44,22,11],index=['a','b','d','e','h'])

ser7

Out[173]:

a    11

b    22

d    44

e    22

h    11

dtype: int64

ser7.duplicated()

Out[174]:

a    False

b    False

d    False

e     True

h     True

dtype: bool

使用drop_duplicates方法过滤其中的重复值，不修改原对象，而是产生一个没有重复值的新Series对象。

ser7.drop_duplicates()

Out[175]:

a    11

b    22

d    44

dtype: int64

（11）替换指定值

使用replace方法进行指定值的替换。第一个参数是旧值，第二个参数是新值。不修改原对象，产生一个新对象。

ser7

Out[177]:

a    11

b    22

d    44

e    22

h    11

dtype: int64

ser7.replace(44,55)

Out[178]:

a    11

b    22

d    55

e    22

h    11

dtype: int64

一次替换多个值，共用同一个新值，可以将旧值放在列表中传入。

ser7.replace([44,11],55)

Out[180]:

a    55

b    22

d    55

e    22

h    55

dtype: int64

一次替换多个值，分别使用不同新值，要使用字典建立映射对象。

ser7.replace({44:55,11:66})

Out[182]:

a    66

b    22

d    55

e    22

h    66

dtype: int64

（12）汇总统计

常规的统计方法：sum（求和）、mean（均值）、cumsum（累计求和）。

ser7.sum()

Out[183]: 110

ser7.mean()

Out[184]: 22.0

ser7.cumsum()

Out[185]:

a     11

b     33

d     77

e     99

h    110

dtype: int64

也可以使用describe方法直接生成描述性统计结果。

a.当元素的数据类型为数值型时，生成的结果包括：均值、最大值、最小值、标准差、元素个数、百分位数。

ser7.describe()

Out[186]:

count     5.000000

mean     22.000000

std      13.472194

min      11.000000

25%      11.000000

50%      22.000000

75%      22.000000

max      44.000000

dtype: float64

b.当元素的数据类型为类别型时，生成的结果包括：唯一值个数、最大类别、最大类别频数。

ser8=pd.Series({'a':'v1','b':'v2','c':'v3'})

ser8

Out[189]:

a    v1

b    v2

c    v3

dtype: object

ser8.describe()

Out[190]:

count      3

unique     3

top       v3

freq       1

dtype: object

##5.Series对象之间的操作
###（1）Series之间算术运算
自动按索引进行对齐，对应元素与元素之间进行算术运算，未对齐的索引，最后的运算结果为NaN。

ser4=pd.Series([11,22,44],index=['a','b','d'])

ser5=pd.Series([11,33,44],index=['a','c','d'])

ser4+ser5

Out[126]:

a    22.0

b     NaN

c     NaN

d    88.0

dtype: float64

（2）Series之间连接

a. 使用append方法连接

使用append方法进行两个Series对象的连接，对二者的数据类型不做要求，索引也可以重复。结果为一个新对象，不会修改原对象。

ser4.append(ser5)

Out[127]:

a    11

b    22

d    44

a    11

c    33

d    44

dtype: int64

b. 使用concat方法连接

使用concat方法进行两个或多个Series对象的连接，对二者的数据类型不做要求，索引也可以重复。结果为一个新对象，不会修改原对象。

① 默认axis=0，合并各个Series对象的行。

ser1

Out[191]:

5    11

6    23

7    33

8    44

dtype: int64

ser2

Out[192]:

a    11.0

c    33.0

f    99.0

dtype: float64

ser3

Out[193]:

a    11

c    33

d    22

dtype: int64

pd.concat([ser1,ser2,ser3])

Out[194]:

5    11.0

6    23.0

7    33.0

8    44.0

a    11.0

c    33.0

f    99.0

a    11.0

c    33.0

d    22.0

dtype: float64

② axis=1时，合并各个Series对象的列，产生一个DataFrame对象，每个Series对象自成一列，行索引对齐。

pd.concat([ser1,ser2,ser3],axis=1)

Out[195]:

      0     1     2

5  11.0   NaN   NaN

6  23.0   NaN   NaN

7  33.0   NaN   NaN

8  44.0   NaN   NaN

a   NaN  11.0  11.0

c   NaN  33.0  33.0

d   NaN   NaN  22.0

f   NaN  99.0   NaN

##6.参考与感谢
\[1] [利用Python进行数据分析](https://book.douban.com/subject/25779298/)