pandas 之 索引重塑
import numpy as np
import pandas as pd
There are a number of basic operations for rearanging tabular data. These are alternatingly referred to as reshape or pivot operations.
多层索引重塑
Hierarchical indexing provides a consistent way to rearrange data in a DataFrame. There are two primary actions:
stack - 列拉长index
This "rotates" or pivots from the columns in the data to the rows.
unstack
This pivots from the rows into the columns.
I'll illustrate these operations through a series of examples. Consider a small DataFrame with string arrays as row and column indexes:
data = pd.DataFrame(np.arange(6).reshape((2, 3)),
index=pd.Index(['Ohio', 'Colorado'], name='state'),
columns=pd.Index(['one', 'two', 'three'],
name='number'))
data
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| number | one | two | three |
|---|---|---|---|
| state | |||
| Ohio | 0 | 1 | 2 |
| Colorado | 3 | 4 | 5 |
Using the stack method on this data pivots the columns into the rows, producing a Series.
"stack 将每一行, 叠成一个Series, 堆起来"
result = data.stack()
result
'stack 将每一行, 叠成一个Series, 堆起来'
state number
Ohio one 0
two 1
three 2
Colorado one 3
two 4
three 5
dtype: int32
From a hierarchically indexed Series, you can rearrage the data back into a DataFrame with unstack
"unstack 将叠起来的Series, 变回DF"
result.unstack()
'unstack 将叠起来的Series, 变回DF'
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| number | one | two | three |
|---|---|---|---|
| state | |||
| Ohio | 0 | 1 | 2 |
| Colorado | 3 | 4 | 5 |
By default the innermost level is unstacked(same with stack). You can unstack a different level by passing a level number or name.
result.unstack(level=0)
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| state | Ohio | Colorado |
|---|---|---|
| number | ||
| one | 0 | 3 |
| two | 1 | 4 |
| three | 2 | 5 |
result.unstack(level='state')
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| state | Ohio | Colorado |
|---|---|---|
| number | ||
| one | 0 | 3 |
| two | 1 | 4 |
| three | 2 | 5 |
Unstacking might introduce missing data if all of the values in the level aren't found in each of the subgroups.
s1 = pd.Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([4, 5, 6], index=['c', 'd', 'e'])
data2 = pd.concat([s1, s2], keys=['one', 'two'])
data2
one a 0
b 1
c 2
d 3
two c 4
d 5
e 6
dtype: int64
data2.unstack() # 外连接哦
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| a | b | c | d | e | |
|---|---|---|---|---|---|
| one | 0.0 | 1.0 | 2.0 | 3.0 | NaN |
| two | NaN | NaN | 4.0 | 5.0 | 6.0 |
%time data2.unstack().stack()
Wall time: 5 ms
one a 0.0
b 1.0
c 2.0
d 3.0
two c 4.0
d 5.0
e 6.0
dtype: float64
%time data2.unstack().stack(dropna=False)
Wall time: 3 ms
one a 0.0
b 1.0
c 2.0
d 3.0
e NaN
two a NaN
b NaN
c 4.0
d 5.0
e 6.0
dtype: float64
When you unstack in a DataFrame, the level unstacked becomes the lowest level in the result:
df = pd.DataFrame({'left': result, 'right': result + 5},
columns=pd.Index(['left', 'right'], name='side'))
df
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| side | left | right | |
|---|---|---|---|
| state | number | ||
| Ohio | one | 0 | 5 |
| two | 1 | 6 | |
| three | 2 | 7 | |
| Colorado | one | 3 | 8 |
| two | 4 | 9 | |
| three | 5 | 10 |
df.unstack("state")
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
.dataframe thead tr:last-of-type th {
text-align: right;
}
| side | left | right | ||
|---|---|---|---|---|
| state | Ohio | Colorado | Ohio | Colorado |
| number | ||||
| one | 0 | 3 | 5 | 8 |
| two | 1 | 4 | 6 | 9 |
| three | 2 | 5 | 7 | 10 |
When calling stack, we can indicate the name of the axis to stack:
%time df.unstack('state').stack('side')
Wall time: 118 ms
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| state | Colorado | Ohio | |
|---|---|---|---|
| number | side | ||
| one | left | 3 | 0 |
| right | 8 | 5 | |
| two | left | 4 | 1 |
| right | 9 | 6 | |
| three | left | 5 | 2 |
| right | 10 | 7 |
长转宽形
A common way to store multiple time series in databases and CSV is in so-called long or stacked format. Let's load some example data and do a small amonut of time series wrangling and other data cleaning:
%%time
data = pd.read_csv("../examples/macrodata.csv")
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 203 entries, 0 to 202
Data columns (total 14 columns):
year 203 non-null float64
quarter 203 non-null float64
realgdp 203 non-null float64
realcons 203 non-null float64
realinv 203 non-null float64
realgovt 203 non-null float64
realdpi 203 non-null float64
cpi 203 non-null float64
m1 203 non-null float64
tbilrate 203 non-null float64
unemp 203 non-null float64
pop 203 non-null float64
infl 203 non-null float64
realint 203 non-null float64
dtypes: float64(14)
memory usage: 22.3 KB
Wall time: 142 ms
data.head()
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| year | quarter | realgdp | realcons | realinv | realgovt | realdpi | cpi | m1 | tbilrate | unemp | pop | infl | realint | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1959.0 | 1.0 | 2710.349 | 1707.4 | 286.898 | 470.045 | 1886.9 | 28.98 | 139.7 | 2.82 | 5.8 | 177.146 | 0.00 | 0.00 |
| 1 | 1959.0 | 2.0 | 2778.801 | 1733.7 | 310.859 | 481.301 | 1919.7 | 29.15 | 141.7 | 3.08 | 5.1 | 177.830 | 2.34 | 0.74 |
| 2 | 1959.0 | 3.0 | 2775.488 | 1751.8 | 289.226 | 491.260 | 1916.4 | 29.35 | 140.5 | 3.82 | 5.3 | 178.657 | 2.74 | 1.09 |
| 3 | 1959.0 | 4.0 | 2785.204 | 1753.7 | 299.356 | 484.052 | 1931.3 | 29.37 | 140.0 | 4.33 | 5.6 | 179.386 | 0.27 | 4.06 |
| 4 | 1960.0 | 1.0 | 2847.699 | 1770.5 | 331.722 | 462.199 | 1955.5 | 29.54 | 139.6 | 3.50 | 5.2 | 180.007 | 2.31 | 1.19 |
periods = pd.PeriodIndex(year=data.year, quarter=data.quarter, name='date')
columns = pd.Index(['realgdp', 'infl', 'unemp'], name='item')
# 修改列索引名
data = data.reindex(columns=columns)
data.index = periods.to_timestamp('D', 'end')
ldata = data.stack().reset_index().rename(columns={0:'value'})
ldata[:10]
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| date | item | value | |
|---|---|---|---|
| 0 | 1959-03-31 | realgdp | 2710.349 |
| 1 | 1959-03-31 | infl | 0.000 |
| 2 | 1959-03-31 | unemp | 5.800 |
| 3 | 1959-06-30 | realgdp | 2778.801 |
| 4 | 1959-06-30 | infl | 2.340 |
| 5 | 1959-06-30 | unemp | 5.100 |
| 6 | 1959-09-30 | realgdp | 2775.488 |
| 7 | 1959-09-30 | infl | 2.740 |
| 8 | 1959-09-30 | unemp | 5.300 |
| 9 | 1959-12-31 | realgdp | 2785.204 |
This is so-called long format for multiple time series, or other observational data with two or more keys. Each row in the table represents a single observation.
Data is frequently stored this way in relational databases like MySQL, as a fixed schema allows the number of distinct values in the item columns to change as data is added to the table. In the previous example, date and keys offering both relational integrity and easier joins. In some cases, the data may be more difficult to work with in this format; you might prefer to have a DataFrame containing one column per distinct item value indexed by timestamps in the date column. DataFrame's pivot method performs exactly this transformation:
pivoted = ldata.pivot('date', 'item', 'value')
pivoted[:5]
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| item | infl | realgdp | unemp |
|---|---|---|---|
| date | |||
| 1959-03-31 | 0.00 | 2710.349 | 5.8 |
| 1959-06-30 | 2.34 | 2778.801 | 5.1 |
| 1959-09-30 | 2.74 | 2775.488 | 5.3 |
| 1959-12-31 | 0.27 | 2785.204 | 5.6 |
| 1960-03-31 | 2.31 | 2847.699 | 5.2 |
The first two values passed are the columns to be used respectively as the row and column index, then finally an optional value column to fill the DataFrame. Suppose you had two value columns that you wanted to reshape simultaneously:
ldata['valu2'] = np.random.randn(len(ldata))
ldata[:10]
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| date | item | value | valu2 | |
|---|---|---|---|---|
| 0 | 1959-03-31 | realgdp | 2710.349 | -0.143460 |
| 1 | 1959-03-31 | infl | 0.000 | -0.422318 |
| 2 | 1959-03-31 | unemp | 5.800 | 0.389872 |
| 3 | 1959-06-30 | realgdp | 2778.801 | -0.208526 |
| 4 | 1959-06-30 | infl | 2.340 | -1.538956 |
| 5 | 1959-06-30 | unemp | 5.100 | -0.143273 |
| 6 | 1959-09-30 | realgdp | 2775.488 | 0.385763 |
| 7 | 1959-09-30 | infl | 2.740 | 0.564365 |
| 8 | 1959-09-30 | unemp | 5.300 | 0.266295 |
| 9 | 1959-12-31 | realgdp | 2785.204 | -1.267871 |
By omitting the last argument, you obtain a DataFrame with hierarchical columns:
Wide to Long
An inverse operation to pivot for DataFrame is pandas.melt. Rather than transroming one columns into many in a new DataFrame, it merges multiple columns into one, producing a DataFrame that is longer than the input, Let's look at an example:
df = pd.DataFrame({
'key': ['foo', 'bar', 'baz'],
'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]
})
df
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| key | A | B | C | |
|---|---|---|---|---|
| 0 | foo | 1 | 4 | 7 |
| 1 | bar | 2 | 5 | 8 |
| 2 | baz | 3 | 6 | 9 |
The 'key' columns may be a group indicator, and the other columns are data values. When using pandas.melt, we must indicate which colmuns are group indicators Let's use 'key' as the only group indicator here:
melted = pd.melt(df, ['key'])
melted
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| key | variable | value | |
|---|---|---|---|
| 0 | foo | A | 1 |
| 1 | bar | A | 2 |
| 2 | baz | A | 3 |
| 3 | foo | B | 4 |
| 4 | bar | B | 5 |
| 5 | baz | B | 6 |
| 6 | foo | C | 7 |
| 7 | bar | C | 8 |
| 8 | baz | C | 9 |
Using pivot, we can reshape back to the original layout:(布局)
reshaped = melted.pivot('key', 'variable', 'value')
reshaped
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| variable | A | B | C |
|---|---|---|---|
| key | |||
| bar | 2 | 5 | 8 |
| baz | 3 | 6 | 9 |
| foo | 1 | 4 | 7 |
Since the result of pivot creats an index from the column used as the row labels, we may want to use reset_index to move the data back into a column:
reshaped.reset_index()
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| variable | key | A | B | C |
|---|---|---|---|---|
| 0 | bar | 2 | 5 | 8 |
| 1 | baz | 3 | 6 | 9 |
| 2 | foo | 1 | 4 | 7 |
You can also specify a subset of columns to use as value columns:
pd.melt(df, id_vars=['key'], value_vars=['A', 'B'])
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| key | variable | value | |
|---|---|---|---|
| 0 | foo | A | 1 |
| 1 | bar | A | 2 |
| 2 | baz | A | 3 |
| 3 | foo | B | 4 |
| 4 | bar | B | 5 |
| 5 | baz | B | 6 |
pandas.melt can be used without any group identifiers, too:
pd.melt(df, value_vars=['A', 'B', 'C'])
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| variable | value | |
|---|---|---|
| 0 | A | 1 |
| 1 | A | 2 |
| 2 | A | 3 |
| 3 | B | 4 |
| 4 | B | 5 |
| 5 | B | 6 |
| 6 | C | 7 |
| 7 | C | 8 |
| 8 | C | 9 |
pd.melt(df, value_vars=['key', 'A', 'B'])
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| variable | value | |
|---|---|---|
| 0 | key | foo |
| 1 | key | bar |
| 2 | key | baz |
| 3 | A | 1 |
| 4 | A | 2 |
| 5 | A | 3 |
| 6 | B | 4 |
| 7 | B | 5 |
| 8 | B | 6 |
小结
Now that you have some pandas basics for data import, clearning, and reorganization under your belt, we are ready to move on to data visualization with matplotlib. We will return to pandas later in the book when we discuss more advance analytics.
pandas 之 索引重塑的更多相关文章
- pandas重置索引的几种方法探究
pandas重置索引的几种方法探究 reset_index() reindex() set_index() 函数名字看起来非常有趣吧! 不仅如此. 需要探究. http://nbviewer.jupy ...
- (三)pandas 层次化索引
pandas层次化索引 1. 创建多层行索引 1) 隐式构造 最常见的方法是给DataFrame构造函数的index参数传递两个或更多的数组 Series也可以创建多层索引 import numpy ...
- pandas 数据索引与选取
我们对 DataFrame 进行选择,大抵从这三个层次考虑:行列.区域.单元格.其对应使用的方法如下:一. 行,列 --> df[]二. 区域 --> df.loc[], df.ilo ...
- Pandas之索引
Pandas的标签处理需要分成多种情况来处理,Series和DataFrame根据标签索引数据的操作方法是不同的,单列索引和双列索引的操作方法也是不同的. 单列索引 In [2]: import pa ...
- pandas重新索引
#重新索引会更改DataFrame的行标签和列标签.重新索引意味着符合数据以匹配特定轴上的一组给定的标签. #可以通过索引来实现多个操作 - #重新排序现有数据以匹配一组新的标签. #在没有标签数据的 ...
- pandas DataFrame 索引(iloc 与 loc 的区别)
Pandas--ix vs loc vs iloc区别 0. DataFrame DataFrame 的构造主要依赖如下三个参数: data:表格数据: index:行索引: columns:列名: ...
- Pandas重建索引
重新索引会更改DataFrame的行标签和列标签.重新索引意味着符合数据以匹配特定轴上的一组给定的标签. 可以通过索引来实现多个操作 - 重新排序现有数据以匹配一组新的标签. 在没有标签数据的标签位置 ...
- pandas层级索引1
层级索引(hierarchical indexing) 下面创建一个Series, 在输入索引Index时,输入了由两个子list组成的list,第一个子list是外层索引,第二个list是内层索引. ...
- pandas层级索引
层级索引(hierarchical indexing) 下面创建一个Series, 在输入索引Index时,输入了由两个子list组成的list,第一个子list是外层索引,第二个list是内层索引. ...
随机推荐
- 【主席树启发式合并】【P3302】[SDOI2013]森林
Description 给定一个 \(n\) 个节点的森林,有 \(Q\) 次操作,每次要么将森林中某两点联通,保证操作后还是个森林,要么查询两点间权值第 \(k\) 小,保证两点联通.强制在线. L ...
- vue中使用element-ui自定义主题后,vue-cli跑不起来了
环境:vue-cli 2.x版本 自己在官网配置了主题并放到了项目中https://element.eleme.cn/#/zh-CN/theme 然后,我的脚手架在我的电脑中休息了几天,就跑不通了呢! ...
- java 中 public default protected private 的区别
对于继承自己的class,父类可以认为他们都是自己的子女,而对于和自己都在同一个目录下的class,可以认为都是自己的朋友. public:对所有用户开发,所有用户都可以直接调用 private:自己 ...
- Java调用api使用企业邮箱账户发送邮件
package cn.ucmed.otaka.healthcare.rubik.common; import lombok.extern.slf4j.Slf4j; import javax.mail. ...
- git bash 乱码问题之解决方案
解决办法:右击左上方git标识,然后进入到如图中,点击Text,进行操作. 操作完毕后,关闭git bash,然后再重新打开,执行ls或ll命令,查看对应的以中文作为目录或文件名是否显示乱码,如果之前 ...
- prometheus、node_exporter、cAdvisor常用参数
本节将介绍一下我在使用过程中用到的promethues.node_exporter.cAdvisor的常用参数,做一个总结 一.prometheus prometheus分为容器安装和二进制文件安装, ...
- python虚拟环境切换无效问题
使用pycharm创建新项目,使用虚拟环境,但是进入到项目的cainiao_guoguo_health\venv\Scripts目录启动虚拟环境后,安装第三方库,却还是安装到其他环境中去了, 检查ac ...
- 二叉树遍历(前序、中序、后序)-Java实现
一.前序遍历 访问顺序:先根节点,再左子树,最后右子树:上图的访问结果为:GDAFEMHZ. 1)递归实现 public void preOrderTraverse1(TreeNode root) { ...
- c++篇 vc++2010设置和c#一样的代码段,vs2010 两下tab设置
设置vs2010 tab敲两下出 for 片段,因为vs2010的代码片段是在番茄助手里设置的...代码片段管理器中不能设置c++ 所以我只能安装一个番茄助手了... 然后就是修改番茄助手内的[提示] ...
- Android中getprop命令的使用
(1)getprop 在Android系统中,使用getprop命令可以从系统中读取一些设备信息,属性的文件例如: init.rc default.prop /system/build.prop 查询 ...