pandas 之 多层索引
In many applications, data may be spread across a number of files or datasets or be arranged in a form that is not easy to analyze. This chapter focuses on tools to help combine, and rearrange data.
(在许多应用中,数据可以分布在多个文件或数据集中,或者以不易分析的形式排列。 本章重点介绍帮助组合和重新排列数据的工具.)
import numpy as np
import pandas as pd
多层索引
Hierarchical indexing is an important featuer of pandas that enables you to have multiple(two or more) index levels on an axis. Somewhat abstractly, it provides a way for you to to work with higher dimensional data in a lower dimensional form.(通过多层索引的方式去从低维看待高维数据). Let's start with a simple example; create a Series with a list of lists(or arrays) as the index:
data = pd.Series(np.random.randn(9),
index=['a,a,a,b,b,c,c,d,d'.split(','),
[1,2,3,1,3,1,2,2,3]])
data
a 1 0.874880
2 1.424326
3 -2.028509
b 1 -1.081833
3 -0.072116
c 1 0.575918
2 -1.246831
d 2 -1.008064
3 0.988234
dtype: float64
What you're seeing is a prettified view of a Series with a MultiIndex as its index. The 'gaps' in the index display mean "use the lable directly above":
data.index
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 2, 0, 1, 1, 2]])
With a hierarchically indexed object(分层索引对象), so-called partial indexing is possible, enabling you to concisely(便捷地) select subsets of the data.
data['b'] # 1 3
1 -1.081833
3 -0.072116
dtype: float64
data['b':'c'] # 1 3 1 2
b 1 -1.081833
3 -0.072116
c 1 0.575918
2 -1.246831
dtype: float64
data.loc[['b', 'd']] # loc 通常按名字取, iloc 按下标取
b 1 -1.081833
3 -0.072116
d 2 -1.008064
3 0.988234
dtype: float64
"Selection is even possible from an inner level"
data.loc[:, 2]
'Selection is even possible from an inner level'
a 1.424326
c -1.246831
d -1.008064
dtype: float64
Hierarchical indexing plays an important role in reshapeing data and group-based operations like forming a pivot table. For example, you could rearrange the data into a DataFrame using its unstack method:
data.unstack()
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| 1 | 2 | 3 | |
|---|---|---|---|
| a | 0.874880 | 1.424326 | -2.028509 |
| b | -1.081833 | NaN | -0.072116 |
| c | 0.575918 | -1.246831 | NaN |
| d | NaN | -1.008064 | 0.988234 |
The inverse operation of unstack is stack:
data.unstack().stack() # 相当于没变
a 1 0.874880
2 1.424326
3 -2.028509
b 1 -1.081833
3 -0.072116
c 1 0.575918
2 -1.246831
d 2 -1.008064
3 0.988234
dtype: float64
stack and unstack will be explored more detail later in this chapter.
With a DataFrame, either axis can have a hierarchical index:
frame = pd.DataFrame(np.arange(12).reshape((4,3)),
index=[['a','a','b','b'], [1,2,1,2]],
columns=[['Ohio', 'Ohio', 'Colorado'],
['Green', 'Red', 'Green']]
)
frame
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
| Ohio | Colorado | |||
|---|---|---|---|---|
| Green | Red | Green | ||
| a | 1 | 0 | 1 | 2 |
| 2 | 3 | 4 | 5 | |
| b | 1 | 6 | 7 | 8 |
| 2 | 9 | 10 | 11 | |
The hierarchical levels can have names(as strings or any Python objects). If so, these will show up in the console output:
frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']
"可设置行列索引的名字呢"
frame
'可设置行列索引的名字呢'
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
.dataframe thead tr:last-of-type th {
text-align: right;
}
| state | Ohio | Colorado | ||
|---|---|---|---|---|
| color | Green | Red | Green | |
| key1 | key2 | |||
| a | 1 | 0 | 1 | 2 |
| 2 | 3 | 4 | 5 | |
| b | 1 | 6 | 7 | 8 |
| 2 | 9 | 10 | 11 | |
Be careful to distinguish(分辨) the index names 'state' and 'color'
Wiht partial column indexing you can similarly select groups of columns:
(使用部分列索引, 可以相应地使用列组)
frame['Ohio']
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| color | Green | Red | |
|---|---|---|---|
| key1 | key2 | ||
| a | 1 | 0 | 1 |
| 2 | 3 | 4 | |
| b | 1 | 6 | 7 |
| 2 | 9 | 10 |
A MultiIndex can be created by itself and then reused; the columns in the preceding DataFrame with level names could be created like this.
tmp = pd.MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']],
names=['state', 'color'])
tmp
MultiIndex(levels=[['Colorado', 'Ohio'], ['Green', 'Red']],
labels=[[1, 1, 0], [0, 1, 0]],
names=['state', 'color'])
重排列和Level排序
At times you will need to rearange the order of the levels on an axis or sort the data by the value in one specific level. The swaplevel takes two levle numbers or names and return a new object with the levels interchanged(but the data is otherwise unaltered):
frame
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
.dataframe thead tr:last-of-type th {
text-align: right;
}
| state | Ohio | Colorado | ||
|---|---|---|---|---|
| color | Green | Red | Green | |
| key1 | key2 | |||
| a | 1 | 0 | 1 | 2 |
| 2 | 3 | 4 | 5 | |
| b | 1 | 6 | 7 | 8 |
| 2 | 9 | 10 | 11 | |
frame.swaplevel('key1', 'key2') # 交换索引level
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
.dataframe thead tr:last-of-type th {
text-align: right;
}
| state | Ohio | Colorado | ||
|---|---|---|---|---|
| color | Green | Red | Green | |
| key2 | key1 | |||
| 1 | a | 0 | 1 | 2 |
| 2 | a | 3 | 4 | 5 |
| 1 | b | 6 | 7 | 8 |
| 2 | b | 9 | 10 | 11 |
sort_index, on the other hand, sorts the data using only the values in a single level. When swapping levels, it's not uncommon to also use sort_index so that the result is lexicographically(词典的) sorted by the indicated level:
frame.sort_index(level=1)
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
.dataframe thead tr:last-of-type th {
text-align: right;
}
| state | Ohio | Colorado | ||
|---|---|---|---|---|
| color | Green | Red | Green | |
| key1 | key2 | |||
| a | 1 | 0 | 1 | 2 |
| b | 1 | 6 | 7 | 8 |
| a | 2 | 3 | 4 | 5 |
| b | 2 | 9 | 10 | 11 |
# cj
frame.sort_index(level=0)
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
.dataframe thead tr:last-of-type th {
text-align: right;
}
| state | Ohio | Colorado | ||
|---|---|---|---|---|
| color | Green | Red | Green | |
| key1 | key2 | |||
| a | 1 | 0 | 1 | 2 |
| 2 | 3 | 4 | 5 | |
| b | 1 | 6 | 7 | 8 |
| 2 | 9 | 10 | 11 | |
"先交换轴索引, 再按照轴0排序"
frame.swaplevel(0, 1).sort_index(level=0)
'先交换轴索引, 再按照轴0排序'
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
.dataframe thead tr:last-of-type th {
text-align: right;
}
| state | Ohio | Colorado | ||
|---|---|---|---|---|
| color | Green | Red | Green | |
| key2 | key1 | |||
| 1 | a | 0 | 1 | 2 |
| b | 6 | 7 | 8 | |
| 2 | a | 3 | 4 | 5 |
| b | 9 | 10 | 11 | |
Data selection performance is much better on hierarchically indexed if the index is lexicographically sorted starting with the outermost level-that is the result of calling sort_index()
如果索引从最外层开始按字典顺序排序,则在分层索引上,>数据选择性能要好得多——这是调用sort index()的结果
按level描述性统计
Many descriptive and summary statistic on DataFrame and Series have a level option in which you can specify the level you want to aggregate by on a particular axis. Consider the above DataFrame; we can aggregate by level on either the rows or columns like so:
frame
frame.sum(level='key2')
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
.dataframe thead tr:last-of-type th {
text-align: right;
}
| state | Ohio | Colorado | ||
|---|---|---|---|---|
| color | Green | Red | Green | |
| key1 | key2 | |||
| a | 1 | 0 | 1 | 2 |
| 2 | 3 | 4 | 5 | |
| b | 1 | 6 | 7 | 8 |
| 2 | 9 | 10 | 11 | |
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
.dataframe thead tr:last-of-type th {
text-align: right;
}
| state | Ohio | Colorado | |
|---|---|---|---|
| color | Green | Red | Green |
| key2 | |||
| 1 | 6 | 8 | 10 |
| 2 | 12 | 14 | 16 |
frame.sum(level='color', axis=1)
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| color | Green | Red | |
|---|---|---|---|
| key1 | key2 | ||
| a | 1 | 2 | 1 |
| 2 | 8 | 4 | |
| b | 1 | 14 | 7 |
| 2 | 20 | 10 |
Under the hood, this utilizes(利用) pandas's groupby machinery, which will be discussed in more detail later in the book.
将DF某列值作为行索引
It's not unusual(不寻常的) to want to use one or more columns from a DataFrame as the row index; alternatively, you may wish to move the row index into the DataFrame's columns. Here' an example DataFrame:
想要使用DataFrame中的一个或多个列作为行索引并不罕见; 或者,您可能希望将行索引移动到DataFrame的列中。 这是一个示例DataFrame:
frame = pd.DataFrame({
'a': range(7),
'b': range(7, 0, -1),
'c':"one,one,one,two,two,two,two".split(','), # cj
'd':[0, 1, 2, 0, 1, 2, 3]
})
frame
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| a | b | c | d | |
|---|---|---|---|---|
| 0 | 0 | 7 | one | 0 |
| 1 | 1 | 6 | one | 1 |
| 2 | 2 | 5 | one | 2 |
| 3 | 3 | 4 | two | 0 |
| 4 | 4 | 3 | two | 1 |
| 5 | 5 | 2 | two | 2 |
| 6 | 6 | 1 | two | 3 |
DataFrame's set_index function will create a new DataFrame using one or more of its columns as the index:
"将 c, d 列作为index, 同时去掉c, d"
frame2 = frame.set_index(['c', 'd'])
frame2
'将 c, d 列作为index, 同时去掉c, d'
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| a | b | ||
|---|---|---|---|
| c | d | ||
| one | 0 | 0 | 7 |
| 1 | 1 | 6 | |
| 2 | 2 | 5 | |
| two | 0 | 3 | 4 |
| 1 | 4 | 3 | |
| 2 | 5 | 2 | |
| 3 | 6 | 1 |
By default the columns are removed from the DataFrame, though you can leave them in:
frame.set_index(['c', 'd'], drop=False)
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| a | b | c | d | ||
|---|---|---|---|---|---|
| c | d | ||||
| one | 0 | 0 | 7 | one | 0 |
| 1 | 1 | 6 | one | 1 | |
| 2 | 2 | 5 | one | 2 | |
| two | 0 | 3 | 4 | two | 0 |
| 1 | 4 | 3 | two | 1 | |
| 2 | 5 | 2 | two | 2 | |
| 3 | 6 | 1 | two | 3 |
reset_index, on the other hand, does the opposite of set_index; the hierachical index levels are moved into the columns:
frame2
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| a | b | ||
|---|---|---|---|
| c | d | ||
| one | 0 | 0 | 7 |
| 1 | 1 | 6 | |
| 2 | 2 | 5 | |
| two | 0 | 3 | 4 |
| 1 | 4 | 3 | |
| 2 | 5 | 2 | |
| 3 | 6 | 1 |
"将多层index给还原到列去..."
frame2.reset_index()
'将多层index给还原到列去...'
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| c | d | a | b | |
|---|---|---|---|---|
| 0 | one | 0 | 0 | 7 |
| 1 | one | 1 | 1 | 6 |
| 2 | one | 2 | 2 | 5 |
| 3 | two | 0 | 3 | 4 |
| 4 | two | 1 | 4 | 3 |
| 5 | two | 2 | 5 | 2 |
| 6 | two | 3 | 6 | 1 |
# cj test
time.clock()
6e-07
def f(x, l=[]):
for i in range(x):
l.append(i*i)
print(l)
f(2)
f(3, [3,2,1])
f(3)
[0, 1]
[3, 2, 1, 0, 1, 4]
[0, 1, 0, 1, 4]
pandas 之 多层索引的更多相关文章
- pandas:多层索引
多层索引是指在行或者列轴上有两个及以上级别的索引,一般表示一个数据的几个分项. 1.创建多层索引 1.1通过分组产生多层索引 1.2由序列创建 1.3由元组创建 1.4可迭代对象的笛卡尔积 1.5将D ...
- pandas学习(创建多层索引、数据重塑与轴向旋转)
pandas学习(创建多层索引.数据重塑与轴向旋转) 目录 创建多层索引 数据重塑与轴向旋转 创建多层索引 隐式构造 Series 最常见的方法是给DataFrame构造函数的index参数传递两个或 ...
- 8 pandas模块,多层索引
1 创建多层索引 1)隐式构造 最常见的方法是给DataFrame构造函数的index参数传递两个或更多的数组 · Series也可以创建多层索引 ...
- pandas中层次化索引与切片
Pandas层次化索引 1. 创建多层索引 隐式索引: 常见的方式是给dataframe构造函数的index参数传递两个或是多个数组 Series也可以创建多层索引 Series多层索引 B =Ser ...
- pandas基础用法——索引
# -*- coding: utf-8 -*- # Time : 2016/11/28 15:14 # Author : XiaoDeng # version : python3.5 # Softwa ...
- pandas 之 时间序列索引
import numpy as np import pandas as pd 引入 A basic kind of time series object in pandas is a Series i ...
- Pandas | 08 重建索引
重新索引会更改DataFrame的行标签和列标签. 可以通过索引来实现多个操作: 重新排序现有数据以匹配一组新的标签. 在没有标签数据的标签位置插入缺失值(NA)标记. import pandas a ...
- numpy和pandas的基础索引切片
Numpy的索引切片 索引 In [72]: arr = np.array([[[1,1,1],[2,2,2]],[[3,3,3],[4,4,4]]]) In [73]: arr Out[73]: a ...
- Lesson8——Pandas reindex重置索引
pandas目录 1 简介 重置索引(reindex)可以更改原 DataFrame 的行标签或列标签,并使更改后的行.列标签与 DataFrame 中的数据逐一匹配.通过重置索引操作,您可以完成对现 ...
随机推荐
- 【BZOJ3711】Druzyny
[BZOJ3711]Druzyny 题面 bzoj 题解 首先我们有一个\(O(n^2)\)的\(dp\): 设\(f_i\)表示现在已经分好了\(1...i\)的组,且\(i\)作为一组的结尾的最大 ...
- git 学习网站
GitBook :https://git-scm.com/book/zh/v2 Git 教程 廖雪峰 :https://www.liaoxuefeng.com/wiki/89604348802960 ...
- mysql 主从 数据不一致
用pt-table-checksum校验数据一致性 Jun 4th, 2013 主从数据的一致性校验是个头疼的问题,偶尔被业务投诉主从数据不一致,或者几个从库之间的 数据不一致,这会令人沮丧.通常我们 ...
- java基础之 hashmap
Hashmap是一种非常常用的.应用广泛的数据类型,最近研究到相关的内容,就正好复习一下.网上关于hashmap的文章很多,但到底是自己学习的总结,就发出来跟大家一起分享,一起讨论. 1.hashma ...
- 【Gamma】Scrum Meeting 10
目录 写在前面 任务进度表 燃尽图 照片 写在前面 例会时间:6.8 22:30-23.00 例会地点:微信群语音通话 代码进度记录github在这里 任务进度表 注:点击链接跳转至相应的issue ...
- [转帖]OLAP引擎这么多,为什么苏宁选择用Druid?
OLAP引擎这么多,为什么苏宁选择用Druid? 原创 51CTO 2018-12-21 11:24:12 [51CTO.com原创稿件]随着公司业务增长迅速,数据量越来越大,数据的种类也越来越丰富, ...
- idea(2018.3.5)破解
第一步:下载idea,https://www.jetbrains.com/idea/download/#section=windows,双击进行安装 第二步:下载破解的jar包:链接:https:// ...
- 宝塔webhook配合码云,本地git push 服务器自动pull
emmmm,这其实是一个很简单的一件事情,但是有很多坑,记录一下 先大概讲一下原理吧,就是每次您 push 代码后,都会给远程 HTTP URL 发送一个 POST 请求 更多说明 » 然后在宝塔这边 ...
- 【题解】保安站岗[P2458]皇宫看守[LOJ10157][SDOI2006]
[题解]保安站岗[P2458]皇宫看守[LOJ10157][SDOI2006] 传送门:皇宫看守\([LOJ10157]\) 保安站岗 \([P2458]\) \([SDOI2006]\) [题目描述 ...
- Replication:The transaction log for database 'tempdb' is full due to 'ACTIVE_TRANSACTION'
今天早上,Dev跟我说,执行query statement时出现一个error,detail info是: “The transaction log for database 'tempdb' is ...