简介

使用Pandas的pivot方法可以将DF进行旋转变换，本文将会详细讲解pivot的秘密。

使用Pivot

pivot用来重组DF，使用指定的index，columns和values来对现有的DF进行重构。

看一个Pivot的例子：

通过pivot变化，新的DF使用foo中的值作为index，使用bar的值作为columns，zoo作为对应的value。

再看一个时间变化的例子：

In [1]: df

Out[1]:

         date variable     value

0  2000-01-03        A  0.469112

1  2000-01-04        A -0.282863

2  2000-01-05        A -1.509059

3  2000-01-03        B -1.135632

4  2000-01-04        B  1.212112

5  2000-01-05        B -0.173215

6  2000-01-03        C  0.119209

7  2000-01-04        C -1.044236

8  2000-01-05        C -0.861849

9  2000-01-03        D -2.104569

10 2000-01-04        D -0.494929

11 2000-01-05        D  1.071804

In [3]: df.pivot(index='date', columns='variable', values='value')

Out[3]:

variable           A         B         C         D

date

2000-01-03  0.469112 -1.135632  0.119209 -2.104569

2000-01-04 -0.282863  1.212112 -1.044236 -0.494929

2000-01-05 -1.509059 -0.173215 -0.861849  1.071804

如果剩余的value，多于一列的话，每一列都会有相应的columns值：

In [4]: df['value2'] = df['value'] * 2

In [5]: pivoted = df.pivot(index='date', columns='variable')

In [6]: pivoted

Out[6]:

               value                                  value2

variable           A         B         C         D         A         B         C         D

date

2000-01-03  0.469112 -1.135632  0.119209 -2.104569  0.938225 -2.271265  0.238417 -4.209138

2000-01-04 -0.282863  1.212112 -1.044236 -0.494929 -0.565727  2.424224 -2.088472 -0.989859

2000-01-05 -1.509059 -0.173215 -0.861849  1.071804 -3.018117 -0.346429 -1.723698  2.143608

通过选择value2，可以得到相应的子集：

In [7]: pivoted['value2']

Out[7]:

variable           A         B         C         D

date

2000-01-03  0.938225 -2.271265  0.238417 -4.209138

2000-01-04 -0.565727  2.424224 -2.088472 -0.989859

2000-01-05 -3.018117 -0.346429 -1.723698  2.143608

使用Stack

Stack是对DF进行转换，将列转换为新的内部的index。

上面我们将列A，B转成了index。

unstack是stack的反向操作，是将最内层的index转换为对应的列。

举个具体的例子：

In [8]: tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',

   ...:                      'foo', 'foo', 'qux', 'qux'],

   ...:                     ['one', 'two', 'one', 'two',

   ...:                      'one', 'two', 'one', 'two']]))

   ...: 

In [9]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

In [10]: df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])

In [11]: df2 = df[:4]

In [12]: df2

Out[12]:

                     A         B

first second

bar   one     0.721555 -0.706771

      two    -1.039575  0.271860

baz   one    -0.424972  0.567020

      two     0.276232 -1.087401

In [13]: stacked = df2.stack()

In [14]: stacked

Out[14]:

first  second

bar    one     A    0.721555

               B   -0.706771

       two     A   -1.039575

               B    0.271860

baz    one     A   -0.424972

               B    0.567020

       two     A    0.276232

               B   -1.087401

dtype: float64

默认情况下unstack是unstack最后一个index，我们还可以指定特定的index值：

In [15]: stacked.unstack()

Out[15]:

                     A         B

first second

bar   one     0.721555 -0.706771

      two    -1.039575  0.271860

baz   one    -0.424972  0.567020

      two     0.276232 -1.087401

In [16]: stacked.unstack(1)

Out[16]:

second        one       two

first

bar   A  0.721555 -1.039575

      B -0.706771  0.271860

baz   A -0.424972  0.276232

      B  0.567020 -1.087401

In [17]: stacked.unstack(0)

Out[17]:

first          bar       baz

second

one    A  0.721555 -0.424972

       B -0.706771  0.567020

two    A -1.039575  0.276232

       B  0.271860 -1.087401

默认情况下stack只会stack一个level，还可以传入多个level：

In [23]: columns = pd.MultiIndex.from_tuples([

   ....:     ('A', 'cat', 'long'), ('B', 'cat', 'long'),

   ....:     ('A', 'dog', 'short'), ('B', 'dog', 'short')],

   ....:     names=['exp', 'animal', 'hair_length']

   ....: )

   ....: 

In [24]: df = pd.DataFrame(np.random.randn(4, 4), columns=columns)

In [25]: df

Out[25]:

exp                 A         B         A         B

animal            cat       cat       dog       dog

hair_length      long      long     short     short

0            1.075770 -0.109050  1.643563 -1.469388

1            0.357021 -0.674600 -1.776904 -0.968914

2           -1.294524  0.413738  0.276662 -0.472035

3           -0.013960 -0.362543 -0.006154 -0.923061

In [26]: df.stack(level=['animal', 'hair_length'])

Out[26]:

exp                          A         B

  animal hair_length

0 cat    long         1.075770 -0.109050

  dog    short        1.643563 -1.469388

1 cat    long         0.357021 -0.674600

  dog    short       -1.776904 -0.968914

2 cat    long        -1.294524  0.413738

  dog    short        0.276662 -0.472035

3 cat    long        -0.013960 -0.362543

  dog    short       -0.006154 -0.923061

上面等价于：

In [27]: df.stack(level=[1, 2])

使用melt

melt指定特定的列作为标志变量，其他的列被转换为行的数据。并放置在新的两个列：variable和value中。

上面例子中我们指定了两列first和last，这两列是不变的，height和weight被变换成为行数据。

举个例子：

In [41]: cheese = pd.DataFrame({'first': ['John', 'Mary'],

   ....:                        'last': ['Doe', 'Bo'],

   ....:                        'height': [5.5, 6.0],

   ....:                        'weight': [130, 150]})

   ....: 

In [42]: cheese

Out[42]:

  first last  height  weight

0  John  Doe     5.5     130

1  Mary   Bo     6.0     150

In [43]: cheese.melt(id_vars=['first', 'last'])

Out[43]:

  first last variable  value

0  John  Doe   height    5.5

1  Mary   Bo   height    6.0

2  John  Doe   weight  130.0

3  Mary   Bo   weight  150.0

In [44]: cheese.melt(id_vars=['first', 'last'], var_name='quantity')

Out[44]:

  first last quantity  value

0  John  Doe   height    5.5

1  Mary   Bo   height    6.0

2  John  Doe   weight  130.0

3  Mary   Bo   weight  150.0

使用Pivot tables

虽然Pivot可以进行DF的轴转置，Pandas还提供了 pivot_table() 在转置的同时可以进行数值的统计。

pivot_table() 接收下面的参数：

data: 一个df对象

values:一列或者多列待聚合的数据。

Index: index的分组对象

Columns: 列的分组对象

Aggfunc: 聚合的方法。

先创建一个df：

In [59]: import datetime

In [60]: df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 6,

   ....:                    'B': ['A', 'B', 'C'] * 8,

   ....:                    'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 4,

   ....:                    'D': np.random.randn(24),

   ....:                    'E': np.random.randn(24),

   ....:                    'F': [datetime.datetime(2013, i, 1) for i in range(1, 13)]

   ....:                    + [datetime.datetime(2013, i, 15) for i in range(1, 13)]})

   ....: 

In [61]: df

Out[61]:

        A  B    C         D         E          F

0     one  A  foo  0.341734 -0.317441 2013-01-01

1     one  B  foo  0.959726 -1.236269 2013-02-01

2     two  C  foo -1.110336  0.896171 2013-03-01

3   three  A  bar -0.619976 -0.487602 2013-04-01

4     one  B  bar  0.149748 -0.082240 2013-05-01

..    ... ..  ...       ...       ...        ...

19  three  B  foo  0.690579 -2.213588 2013-08-15

20    one  C  foo  0.995761  1.063327 2013-09-15

21    one  A  bar  2.396780  1.266143 2013-10-15

22    two  B  bar  0.014871  0.299368 2013-11-15

23  three  C  bar  3.357427 -0.863838 2013-12-15

[24 rows x 6 columns]

下面是几个聚合的例子：

In [62]: pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])

Out[62]:

C             bar       foo

A     B

one   A  1.120915 -0.514058

      B -0.338421  0.002759

      C -0.538846  0.699535

three A -1.181568       NaN

      B       NaN  0.433512

      C  0.588783       NaN

two   A       NaN  1.000985

      B  0.158248       NaN

      C       NaN  0.176180

In [63]: pd.pivot_table(df, values='D', index=['B'], columns=['A', 'C'], aggfunc=np.sum)

Out[63]:

A       one               three                 two

C       bar       foo       bar       foo       bar       foo

B

A  2.241830 -1.028115 -2.363137       NaN       NaN  2.001971

B -0.676843  0.005518       NaN  0.867024  0.316495       NaN

C -1.077692  1.399070  1.177566       NaN       NaN  0.352360

In [64]: pd.pivot_table(df, values=['D', 'E'], index=['B'], columns=['A', 'C'],

   ....:                aggfunc=np.sum)

   ....:

Out[64]:

          D                                                           E

A       one               three                 two                 one               three                 two

C       bar       foo       bar       foo       bar       foo       bar       foo       bar       foo       bar       foo

B

A  2.241830 -1.028115 -2.363137       NaN       NaN  2.001971  2.786113 -0.043211  1.922577       NaN       NaN  0.128491

B -0.676843  0.005518       NaN  0.867024  0.316495       NaN  1.368280 -1.103384       NaN -2.128743 -0.194294       NaN

C -1.077692  1.399070  1.177566       NaN       NaN  0.352360 -1.976883  1.495717 -0.263660       NaN       NaN  0.872482

添加margins=True会添加一个All列，表示对所有的列进行聚合：

In [69]: df.pivot_table(index=['A', 'B'], columns='C', margins=True, aggfunc=np.std)

Out[69]:

                D                             E

C             bar       foo       All       bar       foo       All

A     B

one   A  1.804346  1.210272  1.569879  0.179483  0.418374  0.858005

      B  0.690376  1.353355  0.898998  1.083825  0.968138  1.101401

      C  0.273641  0.418926  0.771139  1.689271  0.446140  1.422136

three A  0.794212       NaN  0.794212  2.049040       NaN  2.049040

      B       NaN  0.363548  0.363548       NaN  1.625237  1.625237

      C  3.915454       NaN  3.915454  1.035215       NaN  1.035215

two   A       NaN  0.442998  0.442998       NaN  0.447104  0.447104

      B  0.202765       NaN  0.202765  0.560757       NaN  0.560757

      C       NaN  1.819408  1.819408       NaN  0.650439  0.650439

All      1.556686  0.952552  1.246608  1.250924  0.899904  1.059389

使用crosstab

Crosstab 用来统计表格中元素的出现次数。

In [70]: foo, bar, dull, shiny, one, two = 'foo', 'bar', 'dull', 'shiny', 'one', 'two'

In [71]: a = np.array([foo, foo, bar, bar, foo, foo], dtype=object)

In [72]: b = np.array([one, one, two, one, two, one], dtype=object)

In [73]: c = np.array([dull, dull, shiny, dull, dull, shiny], dtype=object)

In [74]: pd.crosstab(a, [b, c], rownames=['a'], colnames=['b', 'c'])

Out[74]:

b    one        two

c   dull shiny dull shiny

a

bar    1     0    0     1

foo    2     1    1     0

crosstab可以接收两个Series:

In [75]: df = pd.DataFrame({'A': [1, 2, 2, 2, 2], 'B': [3, 3, 4, 4, 4],

   ....:                    'C': [1, 1, np.nan, 1, 1]})

   ....: 

In [76]: df

Out[76]:

   A  B    C

0  1  3  1.0

1  2  3  1.0

2  2  4  NaN

3  2  4  1.0

4  2  4  1.0

In [77]: pd.crosstab(df['A'], df['B'])

Out[77]:

B  3  4

A

1  1  0

2  1  3

还可以使用normalize来指定比例值：

In [82]: pd.crosstab(df['A'], df['B'], normalize=True)

Out[82]:

B    3    4

A

1  0.2  0.0

2  0.2  0.6

还可以normalize行或者列：

In [83]: pd.crosstab(df['A'], df['B'], normalize='columns')

Out[83]:

B    3    4

A

1  0.5  0.0

2  0.5  1.0

可以指定聚合方法：

In [84]: pd.crosstab(df['A'], df['B'], values=df['C'], aggfunc=np.sum)

Out[84]:

B    3    4

A

1  1.0  NaN

2  1.0  2.0

get_dummies

get_dummies可以将DF中的一列转换成为k列的0和1组合：

df = pd.DataFrame({'key': list('bbacab'), 'data1': range(6)})

df

Out[9]:

   data1 key

0      0   b

1      1   b

2      2   a

3      3   c

4      4   a

5      5   b

pd.get_dummies(df['key'])

Out[10]:

   a  b  c

0  0  1  0

1  0  1  0

2  1  0  0

3  0  0  1

4  1  0  0

5  0  1  0

get_dummies 和 cut 可以进行结合用来统计范围内的元素：

In [95]: values = np.random.randn(10)

In [96]: values

Out[96]:

array([ 0.4082, -1.0481, -0.0257, -0.9884,  0.0941,  1.2627,  1.29  ,

        0.0824, -0.0558,  0.5366])

In [97]: bins = [0, 0.2, 0.4, 0.6, 0.8, 1]

In [98]: pd.get_dummies(pd.cut(values, bins))

Out[98]:

   (0.0, 0.2]  (0.2, 0.4]  (0.4, 0.6]  (0.6, 0.8]  (0.8, 1.0]

0           0           0           1           0           0

1           0           0           0           0           0

2           0           0           0           0           0

3           0           0           0           0           0

4           1           0           0           0           0

5           0           0           0           0           0

6           0           0           0           0           0

7           1           0           0           0           0

8           0           0           0           0           0

9           0           0           1           0           0

get_dummies还可以接受一个DF参数：

In [99]: df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['c', 'c', 'b'],

   ....:                    'C': [1, 2, 3]})

   ....: 

In [100]: pd.get_dummies(df)

Out[100]:

   C  A_a  A_b  B_b  B_c

0  1    1    0    0    1

1  2    0    1    0    1

2  3    1    0    1    0

本文已收录于 http://www.flydean.com/05-python-pandas-reshaping-pivot/

最通俗的解读，最深刻的干货，最简洁的教程，众多你不知道的小技巧等你来发现！

欢迎关注我的公众号:「程序那些事」,懂技术，更懂你！

Pandas高级教程之:Dataframe的重排和旋转的更多相关文章

Pandas高级教程之:Dataframe的合并
目录简介使用concat 使用append 使用merge 使用join 覆盖数据简介 Pandas提供了很多合并Series和Dataframe的强大的功能,通过这些功能可以方便的进行数据分析 ...
Pandas高级教程之:GroupBy用法
Pandas高级教程之:GroupBy用法目录简介分割数据多index get_group dropna groups属性 index的层级 group的遍历聚合操作通用聚合方法同时使用 ...
Pandas高级教程之:category数据类型
目录简介创建category 使用Series创建使用DF创建创建控制转换为原始类型 categories的操作获取category的属性重命名categories 使用add_cate ...
Pandas高级教程之:处理text数据
目录简介创建text的DF String 的方法 columns的String操作分割和替换String String的连接使用 .str来index extract extractall c ...
Pandas高级教程之:处理缺失数据
目录简介 NaN的例子整数类型的缺失值 Datetimes 类型的缺失值 None 和 np.nan 的转换缺失值的计算使用fillna填充NaN数据使用dropna删除包含NA的数据插值 ...
Pandas高级教程之:plot画图详解
目录简介基础画图其他图像 bar stacked bar barh Histograms box Area Scatter Hexagonal bin Pie 在画图中处理NaN数据其他作图工 ...
Pandas高级教程之:统计方法
目录简介变动百分百 Covariance协方差 Correlation相关系数 rank等级简介数据分析中经常会用到很多统计类的方法,本文将会介绍Pandas中使用到的统计方法. 变动百分百 ...
Pandas高级教程之:window操作
目录简介滚动窗口 Center window Weighted window 加权窗口扩展窗口指数加权窗口简介在数据统计中,经常需要进行一些范围操作,这些范围我们可以称之为一个window ...
Pandas高级教程之:稀疏数据结构
目录简介 Spare data的例子 SparseArray SparseDtype Sparse的属性 Sparse的计算 SparseSeries 和 SparseDataFrame 简介如果 ...
Pandas高级教程之:自定义选项
目录简介常用选项 get/set 选项经常使用的选项最大展示行数超出数据展示最大列的宽度显示精度零转换的门槛列头的对齐方向简介 pandas有一个option系统可以控制panda ...

随机推荐

Qt三方库开发技术：QXlsx介绍、编译和使用
若该文为原创文章,未经允许不得转载原博主博客地址:https://blog.csdn.net/qq21497936原博主博客导航:https://blog.csdn.net/qq21497936/ar ...
高效的PDF文字提取技术
无论是行政法规.学术论文还是企业合同,PDF文档为我们提供了一种便捷.稳定的信息传递方式.然而,从PDF文件中提取文本信息对于数据分析.内容编辑等后续处理来说至关重要. PDF文本提取技术是一种可以从 ...
NSSRound#17 Basic web
NSSRound#17 Basic web 真签到审题一个登录界面看到页面名字Robots? 转到robots.txt 看到加密知识点: 加密解密. 解题 hint解密,使用Hex加密方式解出 ...
Java Reactive Programming
Java Reactive Programming 响应式编程在 Spring Boot 中,支持了响应式编程,带来了性能和内存使用方面的优化. 详见: Spring: Blocking vs no ...
Jenkins+maven+svn+tomcat持续集成环境
前言团队最近要把项目发布的工作拿过来,所以需要一个持续集成发布系统直接上步骤. 下载 http://mirrors.jenkins-ci.org/war/latest/ 直接下载war包,我下载的 ...
【Azure 应用服务】App Services 恶意软件防护相关
问题描述 App Services 恶意软件防护相关资料,App Service是否默认开启病毒防护呢? 问题解答 App Services 默认启用了Antimalware 软件功能,Microso ...
【Azure 存储服务】ADLS Gen 2 Backup/软删除/Version管理/快照等功能参考资料
问题描述 ADLS Gen 2 存储的备份,软删除和version管理, 快照等功能应该怎么启用? 问题回答存储的备份测试显示 Premium 定价层的 ADLS Gen 2 在中国区Azure ...
高性能图计算系统 Plato 在 Nebula Graph 中的实践
本文首发于 Nebula Graph Community 公众号 1.图计算介绍 1.1 图数据库 vs 图计算图数据库是面向 OLTP 场景,强调增删改查,并且一个查询往往只涉及到全图中的少量数据 ...
Task Manager 的设计简述
讲解 Task Manager 之前,在这里先介绍一些 Task Manager 会使用到的概念术语. 图数据库 Nebula Graph 中,存在一些长期在后台运行的任务,我们称之为 Job.存储层 ...
全面解析 Redis 持久化：RDB、AOF与混合持久化
前言: 每次你在游戏中看到玩家排行榜,或者在音乐应用中浏览热门歌单,有没有想过这个排行榜是如何做到实时更新的?当然,依靠 Redis 即可做到. 在技术领域,我们经常听到「键值存储」这个词.但在 R ...

Pandas高级教程之:Dataframe的重排和旋转

简介