import numpy as np
import pandas as pd

认识

A pivot table is a data summarization tool(数据汇总工具) frequently found in spreadsheet programs and other data analysis software(广泛应用于数据分析中). It aggregates a table of data by one or more keys, arranging the data in a rectangle(矩形) with some of the group keys along the rows and some along the columns.

Pivot tables in Python with pandas are made possible through the groupby facility(促进) described in this chapter combined with reshape operations utilizing hierarchical indexing.

DataFrame has a pivot_table method, and there is also a top-level pandas.pivot_table function. In addition to providing a convenience interface to groupby, pivot_table can add partial totals , also known as margins.

Returning to the tipping dataset, suppose you wanted to compute a table of group means(the default pivot_table aggregation type) arranged by day and smoker on the rows: (对分组计算组内平均)

tips = pd.read_csv('../examples/tips.csv')

"新增一列 tip_pct"

tips['tip_pct'] = tips['tip'] / tips['total_bill']

tips[:6]
'新增一列 tip_pct'

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
total_bill tip smoker day time size tip_pct
0 16.99 1.01 No Sun Dinner 2 0.059447
1 10.34 1.66 No Sun Dinner 3 0.160542
2 21.01 3.50 No Sun Dinner 3 0.166587
3 23.68 3.31 No Sun Dinner 2 0.139780
4 24.59 3.61 No Sun Dinner 4 0.146808
5 25.29 4.71 No Sun Dinner 4 0.186240
"默认的aggregation 是 mean"
tips.pivot_table(index=['day', 'smoker'])
'默认的aggregation 是 mean'

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
size tip tip_pct total_bill
day smoker
Fri No 2.250000 2.812500 0.151650 18.420000
Yes 2.066667 2.714000 0.174783 16.813333
Sat No 2.555556 3.102889 0.158048 19.661778
Yes 2.476190 2.875476 0.147906 21.276667
Sun No 2.929825 3.167895 0.160113 20.506667
Yes 2.578947 3.516842 0.187250 24.120000
Thur No 2.488889 2.673778 0.160298 17.113111
Yes 2.352941 3.030000 0.163863 19.190588

This could have been produced with groupby directly. Now, suppose we want to aggregate only tip_pct and size, and additionally group by time. I'll put smoker in the table columns and day in the rows:

tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'],
columns='smoker')

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead tr th {
text-align: left;
} .dataframe thead tr:last-of-type th {
text-align: right;
}
size tip_pct
smoker No Yes No Yes
time day
Dinner Fri 2.000000 2.222222 0.139622 0.165347
Sat 2.555556 2.476190 0.158048 0.147906
Sun 2.929825 2.578947 0.160113 0.187250
Thur 2.000000 NaN 0.159744 NaN
Lunch Fri 3.000000 1.833333 0.187735 0.188937
Thur 2.500000 2.352941 0.160311 0.163863

We could augment this table to include partial totals by passing margins=True. This has the effect of adding all row and column labels, with corresponding values being the group statistics for all the data within a single tier:

tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'],
columns='smoker', margins=True)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead tr th {
text-align: left;
} .dataframe thead tr:last-of-type th {
text-align: right;
}
size tip_pct
smoker No Yes All No Yes All
time day
Dinner Fri 2.000000 2.222222 2.166667 0.139622 0.165347 0.158916
Sat 2.555556 2.476190 2.517241 0.158048 0.147906 0.153152
Sun 2.929825 2.578947 2.842105 0.160113 0.187250 0.166897
Thur 2.000000 NaN 2.000000 0.159744 NaN 0.159744
Lunch Fri 3.000000 1.833333 2.000000 0.187735 0.188937 0.188765
Thur 2.500000 2.352941 2.459016 0.160311 0.163863 0.161301
All 2.668874 2.408602 2.569672 0.159328 0.163196 0.160803

Here, the All values are means without taking into account smoker versus non-smoker or any of the two levels of grouping on the rows.

To use a different aggregation function, pass it to aggfunc. For example, count or len will give you a cross-tabulation of group sizes:

tips.pivot_table('tip_pct', index=['time', 'smoker'],
columns='day', aggfunc=len, margins=True)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
day Fri Sat Sun Thur All
time smoker
Dinner No 3.0 45.0 57.0 1.0 106.0
Yes 9.0 42.0 19.0 NaN 70.0
Lunch No 1.0 NaN NaN 44.0 45.0
Yes 6.0 NaN NaN 17.0 23.0
All 19.0 87.0 76.0 62.0 244.0

If some combinations are empty, you may wish to pass a fill_value

tips.pivot_table('tip_pct', index=['time', 'size', 'smoker'],
columns='day', aggfunc='mean', fill_value=0)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
day Fri Sat Sun Thur
time size smoker
Dinner 1 No 0.000000 0.137931 0.000000 0.000000
Yes 0.000000 0.325733 0.000000 0.000000
2 No 0.139622 0.162705 0.168859 0.159744
Yes 0.171297 0.148668 0.207893 0.000000
3 No 0.000000 0.154661 0.152663 0.000000
Yes 0.000000 0.144995 0.152660 0.000000
4 No 0.000000 0.150096 0.148143 0.000000
Yes 0.117750 0.124515 0.193370 0.000000
5 No 0.000000 0.000000 0.206928 0.000000
Yes 0.000000 0.106572 0.065660 0.000000
6 No 0.000000 0.000000 0.103799 0.000000
Lunch 1 No 0.000000 0.000000 0.000000 0.181728
Yes 0.223776 0.000000 0.000000 0.000000
2 No 0.000000 0.000000 0.000000 0.166005
Yes 0.181969 0.000000 0.000000 0.158843
3 No 0.187735 0.000000 0.000000 0.084246
Yes 0.000000 0.000000 0.000000 0.204952
4 No 0.000000 0.000000 0.000000 0.138919
Yes 0.000000 0.000000 0.000000 0.155410
5 No 0.000000 0.000000 0.000000 0.121389
6 No 0.000000 0.000000 0.000000 0.173706

See Table 10-2 for a summary of pivot_table methods.

function anme Description
values Column name or names to aggregate; 默认聚合所有的数值列
index Column names or other group keys to group on the rows of the resulting pivot table
columns Column names or other group keys to group on the columns of the result pivot table
aggfunc Aggregation function or list of function(默认是mean); can be any function valid in a groupby context
fill_value Replace missing values in result table
dropna If True, do not include columns whose entries are all NA
margins Add row/column subtotals and grand total

交叉表: Crosstab

  • 是透视表的一部分, aggfunc=count而已

    A cross-tabulation (or crosstab for short) is a special case of a pivot table that computes group frequencies.Here is an example:

As part of some survey analysis, we might want to summarize this data nationality and handedness. You could use pivot_table to do this, but the pandas.crosstab function can be more convenient:

pd.crosstab(data.Nationality, data.Handedness, margins=True)

The first two arguments to crosstab can each either be an array or Series or a list of arrays. As in the tips data:

"根据 day, time 对 smoker 进行统计"
pd.crosstab([tips.time, tips.day], tips.smoker, margins=True)
'根据 day, time 对 smoker 进行统计'

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
smoker No Yes All
time day
Dinner Fri 3 9 12
Sat 45 42 87
Sun 57 19 76
Thur 1 0 1
Lunch Fri 1 6 7
Thur 44 17 61
All 151 93 244

小结

Mastering pandas's data grouping tools can help both with data cleaning as well as modeling or statistical analysis work.

(熟练掌握 groupby 对 数据清洗, 建模统计等都是有认识和实操方面的帮助的.)

pandas 之 交叉表-透视表的更多相关文章

  1. pandas-10 pd.pivot_table()透视表功能

    pandas-10 pd.pivot_table()透视表功能 和excel一样,pandas也有一个透视表的功能,具体demo如下: import numpy as np import pandas ...

  2. Pandas透视表和交叉表

    透视表 参数名 说明 values 待聚合的列的名称.默认聚合所有数值列 index 用于分组的列名或其他分组键,出现在结果透视表的行 columns 用于分组的列表或其他分组键,出现在结果透视表的列 ...

  3. pandas交叉表和透视表及案例分析

    一.交叉表: 作用: 交叉表是一种用于计算分组频率的特殊透视图,对数据进行汇总 考察预测数据和正式数据的对比情况,一个作为行,一个作为列 案例: 医院预测病人病情: 真实病情如下数组(B:有病,M:没 ...

  4. 04. Pandas 3| 数值计算与统计、合并连接去重分组透视表文件读取

    1.数值计算和统计基础 常用数学.统计方法 数值计算和统计基础 基本参数:axis.skipna df.mean(axis=1,skipna=False)  -->> axis=1是按行来 ...

  5. pandas_使用透视表与交叉表查看业绩汇总数据

    # 使用透视表与交叉表查看业绩汇总数据 import pandas as pd import numpy as np import copy # 设置列对齐 pd.set_option("d ...

  6. 【转载】使用Pandas创建数据透视表

    使用Pandas创建数据透视表 本文转载自:蓝鲸的网站分析笔记 原文链接:使用Pandas创建数据透视表 目录 pandas.pivot_table() 创建简单的数据透视表 增加一个行维度(inde ...

  7. Pandas透视表(pivot_table)详解

    介绍 也许大多数人都有在Excel中使用数据透视表的经历,其实Pandas也提供了一个类似的功能,名为pivot_table.虽然pivot_table非常有用,但是我发现为了格式化输出我所需要的内容 ...

  8. pandas实现excel中的数据透视表和Vlookup函数功能

    在孩子王实习中做的一个小工作,方便整理数据. 目前这几行代码是实现了一个数据透视表和匹配的功能,但是将做好的结果写入了不同的excel中, 如何实现将结果连续保存到同一个Excel的同一个工作表中?还 ...

  9. python pandas使用数据透视表

    1) 官网啰嗦这一堆, pandas.pivot_table函数中包含四个主要的变量,以及一些可选择使用的参数.四个主要的变量分别是数据源data,行索引index,列columns,和数值value ...

随机推荐

  1. 使用Azure进行自动化机器学习

    什么是自动化机器学习? 自动化的机器学习,也称为 AutoML,让数据科研人员. 分析人员和开发人员,同时维护模型质量构建具有高缩放性. 效率和工作效率的机器学习模型. 自动化机器学习生成的机器学习模 ...

  2. MATLAB常见的学习率下降策略

    MATLAB常见的学习率下降策略 凯鲁嘎吉 - 博客园 http://www.cnblogs.com/kailugaji/ 1. 几种常见的学习率下降策略(learning rate decay st ...

  3. VMWare虚拟机提示:锁定文件失败,打不开磁盘...模块"Disk"启动失败的解决办法

    我出现该问题的原因: 昨天电脑一下子卡死,于是我就重启了电脑,重启之后我没有打开VMware虚拟机,结果第二天一上班打开VMware就发现出现了“锁定文件失败,打不开磁盘......模块"D ...

  4. 消息中间件(二)MQ使用场景

    一.消息队列概述 消息队列中间件是分布式系统中重要的组件,主要解决应用解耦,异步消息,流量削锋等问题,实现高性能,高可用,可伸缩和最终一致性架构.目前使用较多的消息队列有ActiveMQ,Rabbit ...

  5. Centos7 yum安装MySQL5.7.25

    1 下载并安装MySQL官方的 Yum Repository[root@localhost ~]# wget -i -c http://dev.mysql.com/get/mysql57-commun ...

  6. 第七章 确保Web安全的HTTPS

    第七章 确保Web安全的HTTPS 使用HTTPS通信机制可以有效防止信息窃听或身份伪装等安全问题. 1.HTTP缺点 [通信使用明文(不加密)]:内容容易被窃听. 加密处理防止被窃听.根据加密的对象 ...

  7. mysql数据库的创建问题

    数据库客户端工具navicate 1.使用create database语句创建数据库 (1)指定字符集 create [database|schema ]if not exists 数据库名 def ...

  8. hdu6468 dfs剪枝 or char数组 or 构造

    http://acm.hdu.edu.cn/showproblem.php?pid=6468 题意 有一个序列,是1到n的一种排列,排列的顺序是字典序小的在前,那么第k个数字是什么?(\(1 \leq ...

  9. Python调用C的DLL(动态链接库)

    开发环境:mingw64位,python3.6 64位 参考博客: mingw编译dll: https://blog.csdn.net/liyuanbhu/article/details/426123 ...

  10. 《一起学netty》

    o文章摘自 netty 官网(netty.io)   netty 是一个异步的,事件驱动的网络应用通信框架,可以让我们快速编写可靠,高性能,高可扩展的服务端和客户端   样例一:discard ser ...