pandas 之交叉表-透视表

import numpy as np

import pandas as pd

认识

A pivot table is a data summarization tool(数据汇总工具) frequently found in spreadsheet programs and other data analysis software(广泛应用于数据分析中). It aggregates a table of data by one or more keys, arranging the data in a rectangle(矩形) with some of the group keys along the rows and some along the columns.

Pivot tables in Python with pandas are made possible through the groupby facility(促进) described in this chapter combined with reshape operations utilizing hierarchical indexing.

DataFrame has a pivot_table method, and there is also a top-level pandas.pivot_table function. In addition to providing a convenience interface to groupby, pivot_table can add partial totals , also known as margins.

Returning to the tipping dataset, suppose you wanted to compute a table of group means(the default pivot_table aggregation type) arranged by day and smoker on the rows: (对分组计算组内平均)

tips = pd.read_csv('../examples/tips.csv')

"新增一列 tip_pct"

tips['tip_pct'] = tips['tip'] / tips['total_bill']

tips[:6]

'新增一列 tip_pct'

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	total_bill	tip	smoker	day	time	size	tip_pct
0	16.99	1.01	No	Sun	Dinner	2	0.059447
1	10.34	1.66	No	Sun	Dinner	3	0.160542
2	21.01	3.50	No	Sun	Dinner	3	0.166587
3	23.68	3.31	No	Sun	Dinner	2	0.139780
4	24.59	3.61	No	Sun	Dinner	4	0.146808
5	25.29	4.71	No	Sun	Dinner	4	0.186240

"默认的aggregation 是 mean"

tips.pivot_table(index=['day', 'smoker'])

'默认的aggregation 是 mean'

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

		size	tip	tip_pct	total_bill
day	smoker
Fri	No	2.250000	2.812500	0.151650	18.420000
Fri	Yes	2.066667	2.714000	0.174783	16.813333
Sat	No	2.555556	3.102889	0.158048	19.661778
Sat	Yes	2.476190	2.875476	0.147906	21.276667
Sun	No	2.929825	3.167895	0.160113	20.506667
Sun	Yes	2.578947	3.516842	0.187250	24.120000
Thur	No	2.488889	2.673778	0.160298	17.113111
Thur	Yes	2.352941	3.030000	0.163863	19.190588

This could have been produced with groupby directly. Now, suppose we want to aggregate only tip_pct and size, and additionally group by time. I'll put smoker in the table columns and day in the rows:

tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'],

                 columns='smoker')

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead tr th {

    text-align: left;

}

.dataframe thead tr:last-of-type th {

    text-align: right;

}

		size		tip_pct
	smoker	No	Yes	No	Yes
time	day
Dinner	Fri	2.000000	2.222222	0.139622	0.165347
	Sat	2.555556	2.476190	0.158048	0.147906
	Sun	2.929825	2.578947	0.160113	0.187250
	Thur	2.000000	NaN	0.159744	NaN
Lunch	Fri	3.000000	1.833333	0.187735	0.188937
Lunch	Thur	2.500000	2.352941	0.160311	0.163863

We could augment this table to include partial totals by passing margins=True. This has the effect of adding all row and column labels, with corresponding values being the group statistics for all the data within a single tier:

tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'],

                columns='smoker', margins=True)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead tr th {

    text-align: left;

}

.dataframe thead tr:last-of-type th {

    text-align: right;

}

		size			tip_pct
	smoker	No	Yes	All	No	Yes	All
time	day
Dinner	Fri	2.000000	2.222222	2.166667	0.139622	0.165347	0.158916
	Sat	2.555556	2.476190	2.517241	0.158048	0.147906	0.153152
	Sun	2.929825	2.578947	2.842105	0.160113	0.187250	0.166897
	Thur	2.000000	NaN	2.000000	0.159744	NaN	0.159744
Lunch	Fri	3.000000	1.833333	2.000000	0.187735	0.188937	0.188765
Lunch	Thur	2.500000	2.352941	2.459016	0.160311	0.163863	0.161301
All		2.668874	2.408602	2.569672	0.159328	0.163196	0.160803

Here, the All values are means without taking into account smoker versus non-smoker or any of the two levels of grouping on the rows.

To use a different aggregation function, pass it to aggfunc. For example, count or len will give you a cross-tabulation of group sizes:

tips.pivot_table('tip_pct', index=['time', 'smoker'],

                columns='day', aggfunc=len, margins=True)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	day	Fri	Sat	Sun	Thur	All
time	smoker
Dinner	No	3.0	45.0	57.0	1.0	106.0
Dinner	Yes	9.0	42.0	19.0	NaN	70.0
Lunch	No	1.0	NaN	NaN	44.0	45.0
Lunch	Yes	6.0	NaN	NaN	17.0	23.0
All		19.0	87.0	76.0	62.0	244.0

If some combinations are empty, you may wish to pass a fill_value

tips.pivot_table('tip_pct', index=['time', 'size', 'smoker'],

                columns='day', aggfunc='mean', fill_value=0)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

		day	Fri	Sat	Sun	Thur
time	size	smoker
Dinner	1	No	0.000000	0.137931	0.000000	0.000000
	1	Yes	0.000000	0.325733	0.000000	0.000000
	2	No	0.139622	0.162705	0.168859	0.159744
	2	Yes	0.171297	0.148668	0.207893	0.000000
	3	No	0.000000	0.154661	0.152663	0.000000
	3	Yes	0.000000	0.144995	0.152660	0.000000
	4	No	0.000000	0.150096	0.148143	0.000000
	4	Yes	0.117750	0.124515	0.193370	0.000000
	5	No	0.000000	0.000000	0.206928	0.000000
	5	Yes	0.000000	0.106572	0.065660	0.000000
	6	No	0.000000	0.000000	0.103799	0.000000
Lunch	1	No	0.000000	0.000000	0.000000	0.181728
	1	Yes	0.223776	0.000000	0.000000	0.000000
	2	No	0.000000	0.000000	0.000000	0.166005
	2	Yes	0.181969	0.000000	0.000000	0.158843
	3	No	0.187735	0.000000	0.000000	0.084246
	3	Yes	0.000000	0.000000	0.000000	0.204952
	4	No	0.000000	0.000000	0.000000	0.138919
	4	Yes	0.000000	0.000000	0.000000	0.155410
	5	No	0.000000	0.000000	0.000000	0.121389
	6	No	0.000000	0.000000	0.000000	0.173706

See Table 10-2 for a summary of pivot_table methods.

function anme	Description
values	Column name or names to aggregate; 默认聚合所有的数值列
index	Column names or other group keys to group on the rows of the resulting pivot table
columns	Column names or other group keys to group on the columns of the result pivot table
aggfunc	Aggregation function or list of function(默认是mean); can be any function valid in a groupby context
fill_value	Replace missing values in result table
dropna	If True, do not include columns whose entries are all NA
margins	Add row/column subtotals and grand total

交叉表: Crosstab

是透视表的一部分, aggfunc=count而已

A cross-tabulation (or crosstab for short) is a special case of a pivot table that computes group frequencies.Here is an example:

As part of some survey analysis, we might want to summarize this data nationality and handedness. You could use pivot_table to do this, but the pandas.crosstab function can be more convenient:

pd.crosstab(data.Nationality, data.Handedness, margins=True)

The first two arguments to crosstab can each either be an array or Series or a list of arrays. As in the tips data:

"根据 day, time 对 smoker 进行统计"

pd.crosstab([tips.time, tips.day], tips.smoker, margins=True)

'根据 day, time 对 smoker 进行统计'

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	smoker	No	Yes	All
time	day
Dinner	Fri	3	9	12
	Sat	45	42	87
	Sun	57	19	76
	Thur	1	0	1
Lunch	Fri	1	6	7
Lunch	Thur	44	17	61
All		151	93	244

小结

Mastering pandas's data grouping tools can help both with data cleaning as well as modeling or statistical analysis work.

(熟练掌握 groupby 对数据清洗, 建模统计等都是有认识和实操方面的帮助的.)

pandas 之交叉表-透视表的更多相关文章

pandas-10 pd.pivot_table()透视表功能
pandas-10 pd.pivot_table()透视表功能和excel一样,pandas也有一个透视表的功能,具体demo如下: import numpy as np import pandas ...
Pandas透视表和交叉表
透视表参数名说明 values 待聚合的列的名称.默认聚合所有数值列 index 用于分组的列名或其他分组键,出现在结果透视表的行 columns 用于分组的列表或其他分组键,出现在结果透视表的列 ...
pandas交叉表和透视表及案例分析
一.交叉表: 作用: 交叉表是一种用于计算分组频率的特殊透视图,对数据进行汇总考察预测数据和正式数据的对比情况,一个作为行,一个作为列案例: 医院预测病人病情: 真实病情如下数组(B:有病,M:没 ...
04. Pandas 3| 数值计算与统计、合并连接去重分组透视表文件读取
1.数值计算和统计基础常用数学.统计方法数值计算和统计基础基本参数:axis.skipna df.mean(axis=1,skipna=False) -->> axis=1是按行来 ...
pandas_使用透视表与交叉表查看业绩汇总数据
# 使用透视表与交叉表查看业绩汇总数据 import pandas as pd import numpy as np import copy # 设置列对齐 pd.set_option("d ...
【转载】使用Pandas创建数据透视表
使用Pandas创建数据透视表本文转载自:蓝鲸的网站分析笔记原文链接:使用Pandas创建数据透视表目录 pandas.pivot_table() 创建简单的数据透视表增加一个行维度(inde ...
Pandas透视表（pivot_table）详解
介绍也许大多数人都有在Excel中使用数据透视表的经历,其实Pandas也提供了一个类似的功能,名为pivot_table.虽然pivot_table非常有用,但是我发现为了格式化输出我所需要的内容 ...
pandas实现excel中的数据透视表和Vlookup函数功能
在孩子王实习中做的一个小工作,方便整理数据. 目前这几行代码是实现了一个数据透视表和匹配的功能,但是将做好的结果写入了不同的excel中, 如何实现将结果连续保存到同一个Excel的同一个工作表中?还 ...
python pandas使用数据透视表
1) 官网啰嗦这一堆, pandas.pivot_table函数中包含四个主要的变量,以及一些可选择使用的参数.四个主要的变量分别是数据源data,行索引index,列columns,和数值value ...

随机推荐

SQLyog使用期限（治标不治本的，治本的还没找到）
在注册表中找到 HKEY_CURRENT_USER\Software 选中其中的类似下列文件名的文件 HKEY_CURRENT_USER\Software\{d58cb4b1-47f3-45cb ...
Linux 解决Deepin深度系统无法在root用户启动Google Chrome浏览器的问题
解决Deepin无法在root用户启动Google Chrome浏览器的问题,步骤如下. 前提:如何用root用户登录系统?编辑 vim /etc/lightdm/lightdm.conf , 找到并 ...
解决汉化pycharme之后设置打不开的问题
首先进入安装pycharme目录下lib目录下,将汉化包移出去,只留下英文包然后打开pycharme即可打开设置在你改完设置之后,可以再将汉化包放进来英文包:https://pan.baidu. ...
Python常见异常及常用单词翻译
Python常见异常及常用单词意思 AttributeError 试图访问一个对象没有的树形,比如foo.x,但是foo没有属性x IOError 输入/输出异常:基本上是无法打开文件 ImportE ...
Vue v-for使用 vue中循环输出数据
v-for的使用: 代码: <!doctype html> <html lang="en"> <head> <meta charset=& ...
1130不允许连接到MySql server
连接远程服务器mysql时,报错: 1130-host ... is not allowed to connect to this MySql server 这是因为默认只让localhost的主机连 ...
解决 cannot find reference 'LSHForest' in '__init__.py'
from sklearn.neighbors import LSHForest cannot find reference 'LSHForest' in '__init__.py'报错 pip3 li ...
Loj #2554. 「CTSC2018」青蕈领主
Loj #2554. 「CTSC2018」青蕈领主题目描述 "也许,我的生命也已经如同风中残烛了吧."小绿如是说. 小绿同学因为微积分这门课,对"连续"这一概 ...
flink solt，并行度
转自:https://www.jianshu.com/p/3598f23031e6 简介 Flink运行时主要角色有两个:JobManager和TaskManager,无论是standalone集群, ...
VMware 中安装kvm虚拟机
环境准备: 安装vmware时需要自定义安装-开启虚拟化技术安装成功之后就可以继续进行了. 1 查看CPU是否支持KVM egrep 'vmx|svm' /proc/cpuinfo --colo ...

pandas 之 交叉表-透视表

认识

交叉表: Crosstab

小结

pandas 之 交叉表-透视表的更多相关文章

随机推荐

热门专题

pandas 之交叉表-透视表

pandas 之交叉表-透视表的更多相关文章