数据分析入门——Pandas类库基础知识

使用python进行数据分析时，经常会用Pandas类库处理数据，将数据转换成我们需要的格式。Pandas中的有两个数据结构和处理数据相关，分别是Series和DataFrame。

Series

Series是一种类似于一维数组的对象，它有两个属性，value和index索引。可以像数组那样通过索引访问对应的值，它和数组有点类似也是python中的dict有点类似，数组中的索引只能是数字，而Series的索引既可以是数字类型也可以是字符类型。

创建Series对象

最简单的方式是通过list序列就可以创建Series对象

s1 = Series(['a','b','c','d'])

s1

Out[16]:

0    a

1    b

2    c

3    d

没有指定索引时，会默认生成一个从0开始到N-1的整型索引。

Series会根据传入的list序列中元素的类型判断Series对象的数据类型，如果全部都是整型，则创建的Series对象是整型，如果有一个元素是浮点型，则创建的Series对象是浮点型，如果有一个是字符串，则创建的Series对象是object类型。

s1 = Series([1,2,3,4])

s1

Out[23]:

0    1

1    2

2    3

3    4

dtype: int64

s2 = Series([1,2,3,4.0])

s2

Out[25]:

0    1.0

1    2.0

2    3.0

3    4.0

dtype: float64

s3 = Series([1,2,3,'4'])

s3

Out[27]:

0    1

1    2

2    3

3    4

dtype: object

除了通过list序列创建Series对象外，还可以通过dict创建Series对象。

s1 = Series({'a':1,'b':2,'c':3,'d':4})

s1

Out[37]:

a    1

b    2

c    3

d    4

dtype: int64

通过dict词典创建Series对象时，会将词典的键初始化Series的Index，而dict的value初始化Series的value。

Series还支持传入一个dict词典和一个list序列创建Series对象：

dict1 = {'a':1,'b':2,'c':3,'d':4}

index1 = ['a','b','e']

s1 = Series(dict1,index=index1)

s1

Out[51]:

a    1.0

b    2.0

e    NaN

dtype: float64

上面的代码中，指定了创建的Series对象s1的索引是index1，即'a','b'和'e'。s1的值是dict1中和index1索引相匹配的值，如果不匹配，则显示NaN。例如索引'e'和dict1中的键没有相匹配的，则索引'e'的值为NaN。索引'a'和索引'b'都匹配得上，因此值为1和2。

Series通过索引访问值：

s1 = Series({'a':1,'b':2,'c':3,'d':4})

s1

Out[39]:

a    1

b    2

c    3

d    4

dtype: int64

s1['b']

Out[40]: 2

上面代码中通过s1['b']就可以访问到索引b对应的值。

Series支持逻辑和数学运算：

s1 = Series([2,5,-10,200])

s1 * 2

Out[53]:

0      4

1     10

2    -20

3    400

dtype: int64

s1[s1>0]

Out[54]:

0      2

1      5

3    200

dtype: int64

对Series变量做数学运算，会作用于Series对象中的每一个元素。

s1 = Series([2,5,-10,200])

s1[s1>0]

Out[7]:

0      2

1      5

3    200

dtype: int64

对Series做逻辑运算时，会将Series中的值替换为bool类型的对象。

s1 = Series([2,5,-10,200])

s1

Out[10]:

0      2

1      5

2    -10

3    200

dtype: int64

s1 > 0

Out[11]:

0     True

1     True

2    False

3     True

dtype: bool

通过series的逻辑运算，可以过滤掉一些不符合条件的数据，例如过滤掉上面例子中小于0的元素：

s1 = Series([2,5,-10,200])

s1[s1 >0]

Out[23]:

0      2

1      5

3    200

dtype: int64

Series对象和索引都有一个name属性，通过下面的方法可以设置Series对象和索引的name值：

fruit = {0:'apple',1:'orange',2:'banana'}

fruitSeries = Series(fruit)

fruitSeries.name='Fruit'

fruitSeries

Out[27]:

0     apple

1    orange

2    banana

Name: Fruit, dtype: object

fruitSeries.index.name='Fruit Index'

fruitSeries

Out[29]:

Fruit Index

0     apple

1    orange

2    banana

Name: Fruit, dtype: object

可以通过index复制方式直接修改Series对象的index：

fruitSeries.index=['a','b','c']

fruitSeries

Out[31]:

a     apple

b    orange

c    banana

Name: Fruit, dtype: object

DataFrame

DataFrame是表格型的数据结构，和关系型数据库中的表很像，都是行和列组成，有列名，索引等属性。

我们可以认为DataFrame中的列其实就是上面提到的Series，有多少列就有多少个Series对象，它们共享同一个索引index。

通过dict字典创建DataFrame对象：

data = {'fruit':['Apple','Apple','Orange','Orange','Banana'],

'year':[2010,2011,2012,2011,2012],

'sale':[15000,17000,36000,24000,29000]}

frame = DataFrame(data)

frame

Out[12]:

    fruit  year   sale

0   Apple  2010  15000

1   Apple  2011  17000

2  Orange  2012  36000

3  Orange  2011  24000

4  Banana  2012  29000

使用上面的方式创建DataFrame对象时，字典中每个元素的value值必须是列表，并且长度必须一致，如果长度不一致会报错。例如key为fruit、year、sale对应的列表长度必须一致。

创建DataFrame对象和会创建Series对象一样自动加上索引。

通过传入columns参数指定列的顺序：

data = {'fruit':['Apple','Apple','Orange','Orange','Banana'],

'year':[2010,2011,2012,2011,2012],

'sale':[15000,17000,36000,24000,29000]}

frame = DataFrame(data,columns=['sale','fruit','year','price'])

frame

Out[25]:

    sale   fruit  year price

0  15000   Apple  2010   NaN

1  17000   Apple  2011   NaN

2  36000  Orange  2012   NaN

3  24000  Orange  2011   NaN

4  29000  Banana  2012   NaN

如果传入的列在数据中找不到，就会产生NaN值。

DataFrame的index也是可以修改的，同样传入一个列表：

frame = DataFrame(data,columns=['sale','fruit','year'],index=[4,3,2,1,0])

frame

Out[22]:

    sale   fruit  year

4  15000   Apple  2010

3  17000   Apple  2011

2  36000  Orange  2012

1  24000  Orange  2011

0  29000  Banana  2012

通过传入的[4,3,2,1,0]就将原来的index从0,1,2,3,4改变为4,3,2,1,0。

通过DataFrame对象获取Series对象：

frame['year']

Out[26]:

0    2010

1    2011

2    2012

3    2011

4    2012

Name: year, dtype: int64

frame['fruit']

Out[27]:

0     Apple

1     Apple

2    Orange

3    Orange

4    Banana

Name: fruit, dtype: object

frame['fruit']和frame.fruit都可以获取列，并且返回的是Series对象。

DataFrame赋值，就是对列赋值，首先获取DataFrame对象中某列的Series对象，然后通过赋值的方式就可以修改列的值：

data = {'fruit':['Apple','Apple','Orange','Orange','Banana'],

'year':[2010,2011,2012,2011,2012],

'sale':[15000,17000,36000,24000,29000]}

frame = DataFrame(data,columns=['sale','fruit','year','price'])

frame

Out[24]:

    sale   fruit  year price

0  15000   Apple  2010   NaN

1  17000   Apple  2011   NaN

2  36000  Orange  2012   NaN

3  24000  Orange  2011   NaN

4  29000  Banana  2012   NaN

frame['price'] = 20

frame

Out[26]:

    sale   fruit  year  price

0  15000   Apple  2010     20

1  17000   Apple  2011     20

2  36000  Orange  2012     20

3  24000  Orange  2011     20

4  29000  Banana  2012     20

frame.price = 40

frame

Out[28]:

    sale   fruit  year  price

0  15000   Apple  2010     40

1  17000   Apple  2011     40

2  36000  Orange  2012     40

3  24000  Orange  2011     40

4  29000  Banana  2012     40

frame.price=np.arange(5)

frame

Out[30]:

    sale   fruit  year  price

0  15000   Apple  2010      0

1  17000   Apple  2011      1

2  36000  Orange  2012      2

3  24000  Orange  2011      3

4  29000  Banana  2012      4

通过frame['price']或者frame.price获取price列，然后通过frame['price']=20或frame.price=20就可以将price列都赋值为20。

也可以通过numpy的arange方法进行赋值。如上面的代码所示。

可以通过Series给DataFrame对象赋值：

data = {'fruit':['Apple','Apple','Orange','Orange','Banana'],

'year':[2010,2011,2012,2011,2012],

'sale':[15000,17000,36000,24000,29000]}

frame = DataFrame(data,columns=['sale','fruit','year','price'])

frame

Out[6]:

    sale   fruit  year price

0  15000   Apple  2010   NaN

1  17000   Apple  2011   NaN

2  36000  Orange  2012   NaN

3  24000  Orange  2011   NaN

4  29000  Banana  2012   NaN

priceSeries = Series([3.4,4.2,2.4],index = [1,2,4])

frame.price = priceSeries

frame

Out[9]:

    sale   fruit  year  price

0  15000   Apple  2010    NaN

1  17000   Apple  2011    3.4

2  36000  Orange  2012    4.2

3  24000  Orange  2011    NaN

4  29000  Banana  2012    2.4

这种赋值方式，DataFrame的索引会和Series的索引自动匹配，在对应的索引位置赋值，匹配不上的位置将填上缺失值NaN。

创建的Series对象如果不指定索引时的赋值结果：

priceSeries = Series([3.4,4.2,2.4])

frame.price = priceSeries

frame

Out[12]:

    sale   fruit  year  price

0  15000   Apple  2010    3.4

1  17000   Apple  2011    4.2

2  36000  Orange  2012    2.4

3  24000  Orange  2011    NaN

4  29000  Banana  2012    NaN

DataFrame还支持通过列表或者数组的方式给列赋值，但是必须保证两者的长度一致：

priceList=[3.4,2.4,4.6,3.8,7.3]

frame.price=priceList

frame

Out[15]:

    sale   fruit  year  price

0  15000   Apple  2010    3.4

1  17000   Apple  2011    2.4

2  36000  Orange  2012    4.6

3  24000  Orange  2011    3.8

4  29000  Banana  2012    7.3

priceList=[3.4,2.4,4.6,3.8,7.3]

frame.price=priceList

赋值的列如果不存在时，相当于创建出一个新列：

frame['total'] = 30000

frame

Out[45]:

    sale   fruit  year  price  total

0  15000   Apple  2010    3.4  30000

1  17000   Apple  2011    2.4  30000

2  36000  Orange  2012    4.6  30000

3  24000  Orange  2011    3.8  30000

4  29000  Banana  2012    7.3  30000

上面的例子通过给不存在的列赋值，新增了新列total。必须使用frame['total']的方式赋值，不建议使用frame.total，使用frame.的方式给不存在的列赋值时，这个列会隐藏起来，直接输出DataFrame对象是不会看到这个total这个列的，但是它又真实的存在，下面的代码是分别使用frame['total']和frame.total给frame对象的total列赋值，total列开始是不存在的：

frame

Out[60]:

    sale   fruit  year  price

0  15000   Apple  2010    3.4

1  17000   Apple  2011    2.4

2  36000  Orange  2012    4.6

3  24000  Orange  2011    3.8

4  29000  Banana  2012    7.3

frame.total = 20

frame

Out[62]:

    sale   fruit  year  price

0  15000   Apple  2010    3.4

1  17000   Apple  2011    2.4

2  36000  Orange  2012    4.6

3  24000  Orange  2011    3.8

4  29000  Banana  2012    7.3

frame['total'] = 20

frame

Out[64]:

    sale   fruit  year  price  total

0  15000   Apple  2010    3.4     20

1  17000   Apple  2011    2.4     20

2  36000  Orange  2012    4.6     20

3  24000  Orange  2011    3.8     20

4  29000  Banana  2012    7.3     20

使用frame.total方式赋值时，是看不到total这一列的，而用frame['total']方式赋值时，则可以看到total这一列。