pandas模块（数据分析）------Series

pandas是一个强大的Python数据分析的工具包。

pandas是基于NumPy构建的。

pandas的主要功能：

具备对其功能的数据结构DataFrame、Series

集成时间序列功能

提供丰富的数学运算和操作

灵活处理缺失数据

安装方法：pip install pandas
引用方法：import pandas as pd

------> 以下测试都是在ipython中 <------

Series

Series是一种类似于一维数组的对象，由一组数据和一组与之相关的数据标签（索引）组成

Series比较像列表（数组）和字典的结合体

 import pandas as pd

 a=pd.Series([12,34,56,87,9800])

 a

 0      12

 1      34

 2      56

 3      87

 4    9800

 dtype: int64

 左边是数组的索引或者下标

 #也可以自定义标签

 b=pd.Series([445,234,12,688,33],index=list("abcds"))

 b

 a    445

 b    234

 c     12

 d    688

 s     33

 dtype: int64

 # 像字典一样取值

 b["a"]

 445

 b[0]

 445

 #获取左侧的标签，返回的是一个数组

 b.index

 Index(['a', 'b', 'c', 'd', 's'], dtype='object')

 #获取右侧的值

 b.values

 array([445, 234,  12, 688,  33], dtype=int64)

Series支持NumPy模块的特性

从ndarray创建Series：Series(arr)

 import numpy as np

 c=pd.Series(np.arange(10,15))

 c

 0    10

 1    11

 2    12

 3    13

 4    14

 dtype: int32

与标量运算

 c*2

 0    20

 1    22

 2    24

 3    26

 4    28

 dtype: int32

 c+c+c

 0    30

 1    33

 2    36

 3    39

 4    42

 dtype: int32

 a+c

 0      22

 1      45

 2      68

 3     100

 4    9814

 dtype: int64

索引，切片（跟numpy切片一样，也是一个视图，不单独复制，在切片后的数据上修改，将影响原数据，要解决这个文图用copy方法）

 c

 0     10

 1     11

 2     12

 3    345

 4     14

 dtype: int32

 #花式索引

 c[[0,2,4]]

 0    10

 2    12

 4    14

 dtype: int32

 c[1:3]

 1    11

 2    12

 dtype: int32

 f=c[2:5]

 f

 2     12

 3    345

 4     14

 dtype: int32

 #修改f的值会修改c的值

 f[3]=345

 c

 0     10

 1     11

 2     12

 3    345

 4     14

 dtype: int32

 e=c[1:4].copy()

 #e中没有0这个索引，所以会像字典一样，增加一个索引0的值

 e[0]=23

 e[1]=255

 e

 1    255

 2     12

 3    345

 0     23

 dtype: int64

 #这个时候的c不会被改变

 c

 0     10

 1     11

 2     12

 3    345

 4     14

 dtype: int32

函数，布尔值过滤

 c.max()

 14

 c[c>12]

 3    13

 4    14

 dtype: int32

 #取索引为0和索引为4的值

 c[[True,False,False,False,True]]

 0    10

 4    14

 dtype: int32

Series支持字典的特性

从字典创建Series：Series(dic)

 import pandas as pd

 a=pd.Series({"a":12,"b":23,"c":22,"e":2,"f":9})

 a

 a    12

 b    23

 c    22

 e     2

 f     9

 dtype: int64

in运算：’a’ in sr

 #  在python的字典中，in运算是值键的判断，在pandas里是对标签的比较

 "a"  in a

 True

 "d" in a

 False

键索引：sr['a'], sr[['a', 'b', 'd']]

 a["c"]

 22

 # 花式索引

 a[["a","c"]]

 a    12

 c    22

 dtype: int64

 # 也可以使用位置索引

 a[1]

 23

 # 但是for 循环的每一项是对应的值而非索引

 for i in a:

     print(i)

 12

 23

 22

 2

 9

get取值

 a.get("c")

 22

 # 如果没有匹配项，则返回一个指定的默认值，如果不指定default，则啥也不返回，也不报错

 a.get("werwe",default=0)

 0

花式切片

 # 这个切片的与列表不一样，这里顾前也顾尾，可以取到f标签

 a[["a","c","f"]]

 a    12

 c    22

 f     9

 dtype: int64

整数索引和标签索引

 import numpy as np

 a=pd.Series(np.arange(10,25))

 a

 0     10

 1     11

 2     12

 3     13

 4     14

 5     15

 6     16

 7     17

 8     18

 9     19

 10    20

 11    21

 12    22

 13    23

 14    24

 dtype: int32

 b=a[8:].copy()

 b

 8     18

 9     19

 10    20

 11    21

 12    22

 13    23

 14    24

 dtype: int32

 #   如果我们想拿b的最后一个值，按照我们之前的项目切片

 b[-1]

 ---------------------------------------------------------------------------

 KeyError                                  Traceback (most recent call last)

 <ipython-input-40-122a9a7eadfa> in <module>()

       1 #   如果我们想拿b的最后一个值，按照我们之前的项目切片

       2

 ----> 3 b[-1]

 d:\program files (x86)\python35\lib\site-packages\pandas\core\series.py in __getitem__(self, key)

     599         key = com._apply_if_callable(key, self)

     600         try:

 --> 601             result = self.index.get_value(self, key)

     602

     603             if not is_scalar(result):

 d:\program files (x86)\python35\lib\site-packages\pandas\core\indexes\base.py in get_value(self, series, key)

    2475         try:

    2476             return self._engine.get_value(s, k,

 -> 2477                                           tz=getattr(series.dtype, 'tz', None))

    2478         except KeyError as e1:

    2479             if len(self) > 0 and self.inferred_type in ['integer', 'boolean']:

 pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_value (pandas\_libs\index.c:4404)()

 pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_value (pandas\_libs\index.c:4087)()

 pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5126)()

 pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas\_libs\hashtable.c:14031)()

 pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas\_libs\hashtable.c:13975)()

 KeyError: -1

 b[7]

 ---------------------------------------------------------------------------

 KeyError                                  Traceback (most recent call last)

 <ipython-input-41-b01957796a20> in <module>()

 ----> 1 b[7]

 d:\program files (x86)\python35\lib\site-packages\pandas\core\series.py in __getitem__(self, key)

     599         key = com._apply_if_callable(key, self)

     600         try:

 --> 601             result = self.index.get_value(self, key)

     602

     603             if not is_scalar(result):

 d:\program files (x86)\python35\lib\site-packages\pandas\core\indexes\base.py in get_value(self, series, key)

    2475         try:

    2476             return self._engine.get_value(s, k,

 -> 2477                                           tz=getattr(series.dtype, 'tz', None))

    2478         except KeyError as e1:

    2479             if len(self) > 0 and self.inferred_type in ['integer', 'boolean']:

 pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_value (pandas\_libs\index.c:4404)()

 pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_value (pandas\_libs\index.c:4087)()

 pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5126)()

 pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas\_libs\hashtable.c:14031)()

 pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas\_libs\hashtable.c:13975)()

 KeyError: 7

 b[8]

 18

上边两个报错的原因是，在整数索引中，当既可以解释成下标，也可以解释成标签的话，那它就解释成标签，在上边的b中，前边的数字既是下标也是标签，比如第一行，0既是下标，也是标签，所以b[-1]被当作找标签为-1的项。b[7]也是同样的道理

iloc ：定义使用下标来查找（这里的下标是从0开始的，第一个位置是0）

 #  那如果我不知道别的信息，只想拿最后一个值

 b.iloc[-1]

 24

 # b中下标从0开始，不是从8开始

 b.iloc[4]

 22

 # b中下标从0开始，不是从8开始

 b.iloc[[1,3,5]]

 9     19

 11    21

 13    23

 dtype: int32

 # b中下标从0开始，不是从8开始

 b.iloc[2:7]

 10    20

 11    21

 12    22

 13    23

 14    24

 dtype: int32

loc ：定义使用标签来查找

 b

 8     18

 9     19

 10    20

 11    21

 12    22

 13    23

 14    24

 dtype: int32

 b.loc[10]

 20

 b.loc[10:13]

 10    20

 11    21

 12    22

 13    23

 dtype: int32

 b.loc[[10,13]]

 10    20

 13    23

 dtype: int32

 # 这里因为是按标签解释，所以切片是顾头也顾尾

 b.loc[8:9]

 8    18

 9    19

 dtype: int32

loc和iloc只是定义了定义了解释的方式，后边还是可以传索引值，切片等

数据对齐和数据缺失

 a= pd.Series([12,23,34], index=['c','a','d'])

 a

 c    12

 a    23

 d    34

 dtype: int64

 b = pd.Series([11,20,10], index=['d','c','a',])

 b

 d    11

 c    20

 a    10

 dtype: int64

 a.values+b.values

 array([23, 43, 44], dtype=int64)

 a+b

 a    33

 c    32

 d    45

 dtype: int64

在pandas里，使用NaN（Not a Number）来表示缺失数据。其值等于np.nan。内置的None值也会被当做NaN处理

 a= pd.Series([11,20,10], index=['d','c','a',])

 a

 d    11

 c    20

 a    10

 dtype: int64

 b= pd.Series([11,20,10], index=['d','c','f',])

 b

 d    11

 c    20

 f    10

 dtype: int64

 c=a.add(b)

 c

 a     NaN

 c    40.0

 d    22.0

 f     NaN

 dtype: float64

pandas先保证索引对齐，，如果存在不同索引，则结果的索引也就是求两个索引的并集

 # 如果我不想要nan这个缺失值，我想要找不到的用0来补充（比如计算工资，我这个月入职，年底统计工资的时候，我得到的总工资不会是nann）

 # 普通的做法是

 c["a"]=0

 c

 a     0.0

 c    40.0

 d    22.0

 f    10.0

 dtype: float64

 # 那如果数据量大的话，单个去操作就麻烦

 #当对应的索引不存在的时，补充一个0，这样就保证计算的时候，始终能相加（其中一个值是0），注意这里不是说在已经得到的结果中去将nan的项替换

 c=a.add(b,fill_value=0)

 c

 a    10.0

 c    40.0

 d    22.0

 f    10.0

 dtype: float64

 c=a.add(b,fill_value=100)

 c

 a    110.0

 c     40.0

 d     22.0

 f    110.0

 dtype: float64

加减乘除运算

 a

 d    11

 c    20

 a    10

 dtype: int64

 b

 d    11

 c    20

 f    10

 dtype: int64

 # 相减运算

 c=a.sub(b,fill_value=1)

 c

 a    9.0

 c    0.0

 d    0.0

 f   -9.0

 dtype: float64

 # 除法运算

 c=a.div(b,fill_value=1)

 c

 a    10.0

 c     1.0

 d     1.0

 f     0.1

 dtype: float64

 # 乘法运算

 c=a.mul(b,fill_value=1)

 c

 a     10.0

 c    400.0

 d    121.0

 f     10.0

 dtype: float64

如何处理结果集中的缺失值

 c=a+b

 c

 a     NaN

 c    40.0

 d    22.0

 f     NaN

 dtype: float64

 # 去掉含有nan的项

 c.dropna()

 c    40.0

 d    22.0

 dtype: float64

 #  填充缺失数据

 c.fillna(0)

 a     0.0

 c    40.0

 d    22.0

 f     0.0

 dtype: float64

 # 返回布尔数组，缺失值对应为True

 c[~c.isnull()]

 c    40.0

 d    22.0

 dtype: float64

 # 返回布尔数组，缺失值对应为False

 c[c.notnull()]

 c    40.0

 d    22.0

 dtype: float64

自定义函数

map（函数名）

 import pandas as pd

 a=pd.Series([3,4,5,2,21,3])

 a

 0     3

 1     4

 2     5

 3     2

 4    21

 5     3

 dtype: int64

 a.map(lambda x:x+100)

 0    103

 1    104

 2    105

 3    102

 4    121

 5    103

 dtype: int64