pandas 之 交叉表-透视表
import numpy as np
import pandas as pd
认识
A pivot table is a data summarization tool(数据汇总工具) frequently found in spreadsheet programs and other data analysis software(广泛应用于数据分析中). It aggregates a table of data by one or more keys, arranging the data in a rectangle(矩形) with some of the group keys along the rows and some along the columns.
Pivot tables in Python with pandas are made possible through the groupby facility(促进) described in this chapter combined with reshape operations utilizing hierarchical indexing.
DataFrame has a pivot_table method, and there is also a top-level pandas.pivot_table function. In addition to providing a convenience interface to groupby, pivot_table can add partial totals , also known as margins.
Returning to the tipping dataset, suppose you wanted to compute a table of group means(the default pivot_table aggregation type) arranged by day and smoker on the rows: (对分组计算组内平均)
tips = pd.read_csv('../examples/tips.csv')
"新增一列 tip_pct"
tips['tip_pct'] = tips['tip'] / tips['total_bill']
tips[:6]
'新增一列 tip_pct'
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| total_bill | tip | smoker | day | time | size | tip_pct | |
|---|---|---|---|---|---|---|---|
| 0 | 16.99 | 1.01 | No | Sun | Dinner | 2 | 0.059447 |
| 1 | 10.34 | 1.66 | No | Sun | Dinner | 3 | 0.160542 |
| 2 | 21.01 | 3.50 | No | Sun | Dinner | 3 | 0.166587 |
| 3 | 23.68 | 3.31 | No | Sun | Dinner | 2 | 0.139780 |
| 4 | 24.59 | 3.61 | No | Sun | Dinner | 4 | 0.146808 |
| 5 | 25.29 | 4.71 | No | Sun | Dinner | 4 | 0.186240 |
"默认的aggregation 是 mean"
tips.pivot_table(index=['day', 'smoker'])
'默认的aggregation 是 mean'
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| size | tip | tip_pct | total_bill | ||
|---|---|---|---|---|---|
| day | smoker | ||||
| Fri | No | 2.250000 | 2.812500 | 0.151650 | 18.420000 |
| Yes | 2.066667 | 2.714000 | 0.174783 | 16.813333 | |
| Sat | No | 2.555556 | 3.102889 | 0.158048 | 19.661778 |
| Yes | 2.476190 | 2.875476 | 0.147906 | 21.276667 | |
| Sun | No | 2.929825 | 3.167895 | 0.160113 | 20.506667 |
| Yes | 2.578947 | 3.516842 | 0.187250 | 24.120000 | |
| Thur | No | 2.488889 | 2.673778 | 0.160298 | 17.113111 |
| Yes | 2.352941 | 3.030000 | 0.163863 | 19.190588 |
This could have been produced with groupby directly. Now, suppose we want to aggregate only tip_pct and size, and additionally group by time. I'll put smoker in the table columns and day in the rows:
tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'],
columns='smoker')
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
.dataframe thead tr:last-of-type th {
text-align: right;
}
| size | tip_pct | ||||
|---|---|---|---|---|---|
| smoker | No | Yes | No | Yes | |
| time | day | ||||
| Dinner | Fri | 2.000000 | 2.222222 | 0.139622 | 0.165347 |
| Sat | 2.555556 | 2.476190 | 0.158048 | 0.147906 | |
| Sun | 2.929825 | 2.578947 | 0.160113 | 0.187250 | |
| Thur | 2.000000 | NaN | 0.159744 | NaN | |
| Lunch | Fri | 3.000000 | 1.833333 | 0.187735 | 0.188937 |
| Thur | 2.500000 | 2.352941 | 0.160311 | 0.163863 | |
We could augment this table to include partial totals by passing margins=True. This has the effect of adding all row and column labels, with corresponding values being the group statistics for all the data within a single tier:
tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'],
columns='smoker', margins=True)
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
.dataframe thead tr:last-of-type th {
text-align: right;
}
| size | tip_pct | ||||||
|---|---|---|---|---|---|---|---|
| smoker | No | Yes | All | No | Yes | All | |
| time | day | ||||||
| Dinner | Fri | 2.000000 | 2.222222 | 2.166667 | 0.139622 | 0.165347 | 0.158916 |
| Sat | 2.555556 | 2.476190 | 2.517241 | 0.158048 | 0.147906 | 0.153152 | |
| Sun | 2.929825 | 2.578947 | 2.842105 | 0.160113 | 0.187250 | 0.166897 | |
| Thur | 2.000000 | NaN | 2.000000 | 0.159744 | NaN | 0.159744 | |
| Lunch | Fri | 3.000000 | 1.833333 | 2.000000 | 0.187735 | 0.188937 | 0.188765 |
| Thur | 2.500000 | 2.352941 | 2.459016 | 0.160311 | 0.163863 | 0.161301 | |
| All | 2.668874 | 2.408602 | 2.569672 | 0.159328 | 0.163196 | 0.160803 | |
Here, the All values are means without taking into account smoker versus non-smoker or any of the two levels of grouping on the rows.
To use a different aggregation function, pass it to aggfunc. For example, count or len will give you a cross-tabulation of group sizes:
tips.pivot_table('tip_pct', index=['time', 'smoker'],
columns='day', aggfunc=len, margins=True)
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| day | Fri | Sat | Sun | Thur | All | |
|---|---|---|---|---|---|---|
| time | smoker | |||||
| Dinner | No | 3.0 | 45.0 | 57.0 | 1.0 | 106.0 |
| Yes | 9.0 | 42.0 | 19.0 | NaN | 70.0 | |
| Lunch | No | 1.0 | NaN | NaN | 44.0 | 45.0 |
| Yes | 6.0 | NaN | NaN | 17.0 | 23.0 | |
| All | 19.0 | 87.0 | 76.0 | 62.0 | 244.0 |
If some combinations are empty, you may wish to pass a fill_value
tips.pivot_table('tip_pct', index=['time', 'size', 'smoker'],
columns='day', aggfunc='mean', fill_value=0)
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| day | Fri | Sat | Sun | Thur | ||
|---|---|---|---|---|---|---|
| time | size | smoker | ||||
| Dinner | 1 | No | 0.000000 | 0.137931 | 0.000000 | 0.000000 |
| Yes | 0.000000 | 0.325733 | 0.000000 | 0.000000 | ||
| 2 | No | 0.139622 | 0.162705 | 0.168859 | 0.159744 | |
| Yes | 0.171297 | 0.148668 | 0.207893 | 0.000000 | ||
| 3 | No | 0.000000 | 0.154661 | 0.152663 | 0.000000 | |
| Yes | 0.000000 | 0.144995 | 0.152660 | 0.000000 | ||
| 4 | No | 0.000000 | 0.150096 | 0.148143 | 0.000000 | |
| Yes | 0.117750 | 0.124515 | 0.193370 | 0.000000 | ||
| 5 | No | 0.000000 | 0.000000 | 0.206928 | 0.000000 | |
| Yes | 0.000000 | 0.106572 | 0.065660 | 0.000000 | ||
| 6 | No | 0.000000 | 0.000000 | 0.103799 | 0.000000 | |
| Lunch | 1 | No | 0.000000 | 0.000000 | 0.000000 | 0.181728 |
| Yes | 0.223776 | 0.000000 | 0.000000 | 0.000000 | ||
| 2 | No | 0.000000 | 0.000000 | 0.000000 | 0.166005 | |
| Yes | 0.181969 | 0.000000 | 0.000000 | 0.158843 | ||
| 3 | No | 0.187735 | 0.000000 | 0.000000 | 0.084246 | |
| Yes | 0.000000 | 0.000000 | 0.000000 | 0.204952 | ||
| 4 | No | 0.000000 | 0.000000 | 0.000000 | 0.138919 | |
| Yes | 0.000000 | 0.000000 | 0.000000 | 0.155410 | ||
| 5 | No | 0.000000 | 0.000000 | 0.000000 | 0.121389 | |
| 6 | No | 0.000000 | 0.000000 | 0.000000 | 0.173706 |
See Table 10-2 for a summary of pivot_table methods.
| function anme | Description |
|---|---|
| values | Column name or names to aggregate; 默认聚合所有的数值列 |
| index | Column names or other group keys to group on the rows of the resulting pivot table |
| columns | Column names or other group keys to group on the columns of the result pivot table |
| aggfunc | Aggregation function or list of function(默认是mean); can be any function valid in a groupby context |
| fill_value | Replace missing values in result table |
| dropna | If True, do not include columns whose entries are all NA |
| margins | Add row/column subtotals and grand total |
交叉表: Crosstab
- 是透视表的一部分, aggfunc=count而已
A cross-tabulation (or crosstab for short) is a special case of a pivot table that computes group frequencies.Here is an example:
As part of some survey analysis, we might want to summarize this data nationality and handedness. You could use pivot_table to do this, but the pandas.crosstab function can be more convenient:
pd.crosstab(data.Nationality, data.Handedness, margins=True)
The first two arguments to crosstab can each either be an array or Series or a list of arrays. As in the tips data:
"根据 day, time 对 smoker 进行统计"
pd.crosstab([tips.time, tips.day], tips.smoker, margins=True)
'根据 day, time 对 smoker 进行统计'
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| smoker | No | Yes | All | |
|---|---|---|---|---|
| time | day | |||
| Dinner | Fri | 3 | 9 | 12 |
| Sat | 45 | 42 | 87 | |
| Sun | 57 | 19 | 76 | |
| Thur | 1 | 0 | 1 | |
| Lunch | Fri | 1 | 6 | 7 |
| Thur | 44 | 17 | 61 | |
| All | 151 | 93 | 244 |
小结
Mastering pandas's data grouping tools can help both with data cleaning as well as modeling or statistical analysis work.
(熟练掌握 groupby 对 数据清洗, 建模统计等都是有认识和实操方面的帮助的.)
pandas 之 交叉表-透视表的更多相关文章
- pandas-10 pd.pivot_table()透视表功能
pandas-10 pd.pivot_table()透视表功能 和excel一样,pandas也有一个透视表的功能,具体demo如下: import numpy as np import pandas ...
- Pandas透视表和交叉表
透视表 参数名 说明 values 待聚合的列的名称.默认聚合所有数值列 index 用于分组的列名或其他分组键,出现在结果透视表的行 columns 用于分组的列表或其他分组键,出现在结果透视表的列 ...
- pandas交叉表和透视表及案例分析
一.交叉表: 作用: 交叉表是一种用于计算分组频率的特殊透视图,对数据进行汇总 考察预测数据和正式数据的对比情况,一个作为行,一个作为列 案例: 医院预测病人病情: 真实病情如下数组(B:有病,M:没 ...
- 04. Pandas 3| 数值计算与统计、合并连接去重分组透视表文件读取
1.数值计算和统计基础 常用数学.统计方法 数值计算和统计基础 基本参数:axis.skipna df.mean(axis=1,skipna=False) -->> axis=1是按行来 ...
- pandas_使用透视表与交叉表查看业绩汇总数据
# 使用透视表与交叉表查看业绩汇总数据 import pandas as pd import numpy as np import copy # 设置列对齐 pd.set_option("d ...
- 【转载】使用Pandas创建数据透视表
使用Pandas创建数据透视表 本文转载自:蓝鲸的网站分析笔记 原文链接:使用Pandas创建数据透视表 目录 pandas.pivot_table() 创建简单的数据透视表 增加一个行维度(inde ...
- Pandas透视表(pivot_table)详解
介绍 也许大多数人都有在Excel中使用数据透视表的经历,其实Pandas也提供了一个类似的功能,名为pivot_table.虽然pivot_table非常有用,但是我发现为了格式化输出我所需要的内容 ...
- pandas实现excel中的数据透视表和Vlookup函数功能
在孩子王实习中做的一个小工作,方便整理数据. 目前这几行代码是实现了一个数据透视表和匹配的功能,但是将做好的结果写入了不同的excel中, 如何实现将结果连续保存到同一个Excel的同一个工作表中?还 ...
- python pandas使用数据透视表
1) 官网啰嗦这一堆, pandas.pivot_table函数中包含四个主要的变量,以及一些可选择使用的参数.四个主要的变量分别是数据源data,行索引index,列columns,和数值value ...
随机推荐
- MySQL 部署分布式架构 MyCAT (一)
架构 环境 主机名 IP db1 192.168.31.205 db2 192.168.31.206 前期准备 开启防火墙,安装配置 mysql (db1,db2) firewall-cmd --pe ...
- [视频教程] 包管理器方式安装使用openresty新手上路
OpenResty是一个通过Lua扩展Nginx实现的可伸缩的Web平台,内部集成了大量精良的Lua库.第三方模块以及大多数的依赖项.用于方便地搭建能够处理超高并发.扩展性极高的动态Web应用.Web ...
- CentOS 7 Apache 绑定域名和网站
CentOS 7 Apache 绑定域名和网站适用场景一台服务器,运行有多个网站,每个网站都希望用户直接通过二级域名来访问,而不是同一个域名通过子目录来访问 配置过程确定自己的 Apache 服务器的 ...
- CodeForces - 1255C(构造+模拟)
题意 https://vjudge.net/problem/CodeForces-1255C 一个长度为n的序列,给你n-2个三元组,比如p=[1,4,2,3,5],那么三元组为[1,4,2],[4, ...
- 苏州市java岗位的薪资状况(1)
8月份已经正式离职,这两个月主要在做新书校对工作.9月份陆续投了几份简历,参加了两次半面试,第一次是家做办公自动化的公司,开的薪水和招聘信息严重不符,感觉实在是在浪费时间,你说你给不了那么多为什还往上 ...
- Maven中使用tomcat:run出现错误org.eclipse.jdt.internal.compiler.classfmt.ClassFormatException
配置是正常的.查阅资料以后说是jdk版本什么的问题.多方修改没有任何改观.换一个思路去查询tomcat:run怎么运行. 是因为他还是沿用了上一次的tomcat插件(默认是6)所以运行的时候使用 to ...
- python27期JavaScript:
JavaScript:(简称“JS”) 是一种轻量级的编程语言(ECMAscript5或6)是一种解释性脚本语言(代码不进行预编译)主要用来向HTML页面添加交互行为JavaScript 是互联网上最 ...
- MongoDB介绍(一)
MongoDB是一个基于分布式文件存储的数据库.由C++语言编写.旨在为WEB应用提供可扩展的高性能数据存储解决方案. MongoDB是一个介于关系数据库和非关系数据库之间的产品,是非关系数据库当中功 ...
- 《HBase在滴滴出行的应用场景和最佳实践》
HBase在滴滴出行的应用场景和最佳实践 背景 对接业务类型 HBase是建立在Hadoop生态之上的Database,源生对离线任务支持友好,又因为LSM树是一个优秀的高吞吐数据库结构,所以同时 ...
- ubuntu建立文件或者文件夹软链接
文件夹建立软链接(用绝对地址) ln -s 源地址 目的地址 比如我把linux文件系统rootfs_dir软链接到/home/jyg/目录下 ln -s /opt/linux/rootfs_dir ...