pandas 之 数据合并
import numpy as np
import pandas as pd
Data contained in pandas objects can be combined together in a number of ways:
pandas.merge connects rows in DataFrame based on one or more keys. This will be familiar to users of SQL or other relational databases, as it impliemnts(工具) database join oprations.
pandas.concat concatenates or "stacks" together objects along an axis.
The combine_first instance method enables splicing(拼接) together overlapping data to fill in missing values in one object with values from another.
I will address each of these and give a number of examples. They'll be utilized in examples throughout the rest of the book.
SQL风格的Join
merge or join operations combine datasets by linking rows using one or more keys. These operations are central to relational database(e.g. SQL-based). The merge function in pandas is the main entry point for using theses algorithms on your data.
Let's start with a simple example:
df1 = pd.DataFrame({
'key': 'b, b, a, c, a, a, b'.split(','),
'data1': range(7)
})
df2 = pd.DataFrame({
'key': ['a', 'b', 'd'],
'data2': range(3)
})
df1
df2
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| key | data1 | |
|---|---|---|
| 0 | b | 0 |
| 1 | b | 1 |
| 2 | a | 2 |
| 3 | c | 3 |
| 4 | a | 4 |
| 5 | a | 5 |
| 6 | b | 6 |
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| key | data2 | |
|---|---|---|
| 0 | a | 0 |
| 1 | b | 1 |
| 2 | d | 2 |
This is an example of a many to one join; the data in df1 has multiple rows labeled a and b, whereas(然而) df2 has only one row for each value in the key column. Calling merge with these objects we obtain:
"merge 默认是内连接, if 没有指定key..."
pd.merge(df1, df2) # data1, key, data2
'merge 默认是内连接, if 没有指定key...'
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| key | data1 | data2 | |
|---|---|---|---|
| 0 | b | 0 | 1 |
Note that I didn't specify which columns to join on. if that infomation is not specified, merge uses the overlapping columns names as keys. It's a good practice to specify explicitly, though:
(cj. 好像不是这样的哦)
"内连接走一波, 相同的记录才会保留哦, 跟作者的不一样"
pd.merge(df2, df1, on='key') # data1, key, data2
'内连接走一波, 相同的记录才会保留哦, 跟作者的不一样'
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| key | data2 | data1 | |
|---|---|---|---|
| 0 | b | 1 | 0 |
# cj test
pd.merge(df1, df2, on='key', how='left')
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| key | data1 | data2 | |
|---|---|---|---|
| 0 | b | 0 | 1.0 |
| 1 | b | 1 | NaN |
| 2 | a | 2 | NaN |
| 3 | c | 3 | NaN |
| 4 | a | 4 | NaN |
| 5 | a | 5 | NaN |
| 6 | b | 6 | NaN |
If the column names are different in each object, you can specify them separately:
(两个df的键不同, 进行合并时可以分别指定)
df3 = pd.DataFrame({
'lkey': 'a b a c a a b'.split(),
'data1': range(7)
})
df4 = pd.DataFrame({
'rkey': ['a', 'b', 'd'],
'data2': range(3)
})
pd.merge(df3, df4, left_on='lkey', right_on='rkey')
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| lkey | data1 | rkey | data2 | |
|---|---|---|---|---|
| 0 | a | 0 | a | 0 |
| 1 | a | 2 | a | 0 |
| 2 | a | 4 | a | 0 |
| 3 | a | 5 | a | 0 |
| 4 | b | 1 | b | 1 |
| 5 | b | 6 | b | 1 |
You may notice that the 'c' and 'd' values and associate data are missing from the result. By defualt merge does an inner join; the keys in the result are intersection. or the common set found in both tables. Other possible options are left, right and outer. The outer join takes the union of the keys, combining the effect of applying both left and right joins.
(merge 默认是内连接, 相关的还有左, 右, 外连接;
外连接是包含了左,右连接哦)
"默认以所有的键, 其实就是穷举所有的可能结果而已"
pd.merge(df1, df2, how='outer')
'默认以所有的键, 其实就是穷举所有的可能结果而已'
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| key | data1 | data2 | |
|---|---|---|---|
| 0 | b | 0.0 | 1.0 |
| 1 | b | 1.0 | NaN |
| 2 | b | 6.0 | NaN |
| 3 | a | 2.0 | NaN |
| 4 | a | 4.0 | NaN |
| 5 | a | 5.0 | NaN |
| 6 | c | 3.0 | NaN |
| 7 | a | NaN | 0.0 |
| 8 | d | NaN | 2.0 |
See Table 8-1 for a summary of the options for how.
| Option | Behavior |
|---|---|
| 'inner' | Use only the key combinations observed in both tables |
| 'left' | Use all combinations found in the left table |
| 'right' | Use all key combinations found in the right table |
| 'outer' | Use all key combinations observed in both tables together |
Many-to-Many merges have well-defined, though not necessarily intuitive(直觉的), behavior. Here's an example:
df1 = pd.DataFrame({
'key': 'b b a c a b'.split(),
'data1': range(6)
})
df2 = pd.DataFrame({
'key': 'a b a b d'.split(),
'data2': range(5)
})
df1
df2
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| key | data1 | |
|---|---|---|
| 0 | b | 0 |
| 1 | b | 1 |
| 2 | a | 2 |
| 3 | c | 3 |
| 4 | a | 4 |
| 5 | b | 5 |
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| key | data2 | |
|---|---|---|
| 0 | a | 0 |
| 1 | b | 1 |
| 2 | a | 2 |
| 3 | b | 3 |
| 4 | d | 4 |
pd.merge(df1, df2, how='inner')
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| key | data1 | data2 | |
|---|---|---|---|
| 0 | b | 0 | 1 |
| 1 | b | 0 | 3 |
| 2 | b | 1 | 1 |
| 3 | b | 1 | 3 |
| 4 | b | 5 | 1 |
| 5 | b | 5 | 3 |
| 6 | a | 2 | 0 |
| 7 | a | 2 | 2 |
| 8 | a | 4 | 0 |
| 9 | a | 4 | 2 |
pd.merge(df1, df2, on='key', how='left')
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| key | data1 | data2 | |
|---|---|---|---|
| 0 | b | 0 | 1.0 |
| 1 | b | 0 | 3.0 |
| 2 | b | 1 | 1.0 |
| 3 | b | 1 | 3.0 |
| 4 | a | 2 | 0.0 |
| 5 | a | 2 | 2.0 |
| 6 | c | 3 | NaN |
| 7 | a | 4 | 0.0 |
| 8 | a | 4 | 2.0 |
| 9 | b | 5 | 1.0 |
| 10 | b | 5 | 3.0 |
To merge with multiple keys, pass a list of columns names:
left = pd.DataFrame({'key1': ['foo', 'foo', 'bar'],
'key2': ['one', 'two', 'one'],
'lval': [1, 2, 3]})
right = pd.DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'],
'key2': ['one', 'one', 'one', 'two'],
'rval': [4, 5, 6, 7]})
left
right
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| key1 | key2 | lval | |
|---|---|---|---|
| 0 | foo | one | 1 |
| 1 | foo | two | 2 |
| 2 | bar | one | 3 |
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| key1 | key2 | rval | |
|---|---|---|---|
| 0 | foo | one | 4 |
| 1 | foo | one | 5 |
| 2 | bar | one | 6 |
| 3 | bar | two | 7 |
"outer 所有可能的结果, 支持多个keys"
pd.merge(left, right, on=['key1', 'key2'], how='outer')
'outer 所有可能的结果, 支持多个keys'
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| key1 | key2 | lval | rval | |
|---|---|---|---|---|
| 0 | foo | one | 1.0 | 4.0 |
| 1 | foo | one | 1.0 | 5.0 |
| 2 | foo | two | 2.0 | NaN |
| 3 | bar | one | 3.0 | 6.0 |
| 4 | bar | two | NaN | 7.0 |
To determine which key combinations will appear in the result depending on the choice of merge method, think of the multiple keys as forming an array fo tuples to be used as a single join key.
When you are joining columns-on-columns, the indexes on the passed DataFrame objects are discarded.
pd.merge(left, right, on='key1')
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| key1 | key2_x | lval | key2_y | rval | |
|---|---|---|---|---|---|
| 0 | foo | one | 1 | one | 4 |
| 1 | foo | one | 1 | one | 5 |
| 2 | foo | two | 2 | one | 4 |
| 3 | foo | two | 2 | one | 5 |
| 4 | bar | one | 3 | one | 6 |
| 5 | bar | one | 3 | two | 7 |
pd.merge(left, right, on='key1', suffixes=('_left', '_right'))
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| key1 | key2_left | lval | key2_right | rval | |
|---|---|---|---|---|---|
| 0 | foo | one | 1 | one | 4 |
| 1 | foo | one | 1 | one | 5 |
| 2 | foo | two | 2 | one | 4 |
| 3 | foo | two | 2 | one | 5 |
| 4 | bar | one | 3 | one | 6 |
| 5 | bar | one | 3 | two | 7 |
See Table 8-2 for an argument reference on merge. Joining using the DataFrame's row index is the subject of the next section.
- left
- right
- how
- on
- left_on
- right_on
- left_index
- right_index
- sort
- suffixes 添加后缀
- copy
- indecator
按Index合并
In some cases, the merge key(s) in a DataFrame will be found on its index, In this case, you can pass left_index=True or right_index=True to indicate that the index should be used as the merge key:
left1 = pd.DataFrame({
'key': ['a', 'b', 'a', 'a', 'b', 'c'],
'value': range(6)
})
right1 = pd.DataFrame({'group_val':[3.5, 7]}, index=['a', 'b'])
left1
right1
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| key | value | |
|---|---|---|
| 0 | a | 0 |
| 1 | b | 1 |
| 2 | a | 2 |
| 3 | a | 3 |
| 4 | b | 4 |
| 5 | c | 5 |
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| group_val | |
|---|---|
| a | 3.5 |
| b | 7.0 |
pd.merge(left1, right1, left_on='key', right_index=True)
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| key | value | group_val | |
|---|---|---|---|
| 0 | a | 0 | 3.5 |
| 2 | a | 2 | 3.5 |
| 3 | a | 3 | 3.5 |
| 1 | b | 1 | 7.0 |
| 4 | b | 4 | 7.0 |
按轴水平/垂直合并
Another kind of data combination operation is referred to interchangeably as concat-enation, binding, or stacking, NumPy's concatenate function can do this with NumPy arrays:
arr = np.arange(12).reshape((3,4))
arr
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
"直接水平拼接"
np.concatenate([arr, arr], axis=1)
'直接水平拼接'
array([[ 0, 1, 2, 3, 0, 1, 2, 3],
[ 4, 5, 6, 7, 4, 5, 6, 7],
[ 8, 9, 10, 11, 8, 9, 10, 11]])
不再继续往下扩展了, 就目前我工作中用得最多的还是Merge, Join,在处理表vlookup的场景下.还有就是涉及垂直/水平拼接的 pd.concat(), np.vstack() 和 np.hstack(), 结合SQL来配合使用,就非常灵活和高效了.
pandas 之 数据合并的更多相关文章
- 利用Python进行数据分析(12) pandas基础: 数据合并
pandas 提供了三种主要方法可以对数据进行合并: pandas.merge()方法:数据库风格的合并: pandas.concat()方法:轴向连接,即沿着一条轴将多个对象堆叠到一起: 实例方法c ...
- 数据分析入门——pandas之数据合并
主要分为:级联:pd.concat.pd.append 合并:pd.merge 一.numpy级联的回顾 详细参考numpy章节 https://www.cnblogs.com/jiangbei/p/ ...
- python 数据合并
1. 数据合并 前言 一.横向合并 1. 基本合并语句 2. 键值名不一样的合并 3. “两个数据列名字重复了”的合并 二.纵向堆叠 统计师的Python日记[第6天:数据合并] 前言 根据我的Pyt ...
- pandas学习(数据分组与分组运算、离散化处理、数据合并)
pandas学习(数据分组与分组运算.离散化处理.数据合并) 目录 数据分组与分组运算 离散化处理 数据合并 数据分组与分组运算 GroupBy技术:实现数据的分组,和分组运算,作用类似于数据透视表 ...
- PANDAS 数据合并与重塑(join/merge篇)
pandas中也常常用到的join 和merge方法 merge pandas的merge方法提供了一种类似于SQL的内存链接操作,官网文档提到它的性能会比其他开源语言的数据操作(例如R)要高效. 和 ...
- pandas:根据行间差值进行数据合并
1. 问题描述 在处理用户上网数据时,用户的上网行为数据之间存在时间间隔,按照实际情况,若时间间隔小于阈值(next_access_time_app),则可把这几条上网行为合并为一条行为数据:若时间间 ...
- python 数据清洗之数据合并、转换、过滤、排序
前面我们用pandas做了一些基本的操作,接下来进一步了解数据的操作, 数据清洗一直是数据分析中极为重要的一个环节. 数据合并 在pandas中可以通过merge对数据进行合并操作. import n ...
- R︱高效数据操作——data.table包(实战心得、dplyr对比、key灵活用法、数据合并)
每每以为攀得众山小,可.每每又切实来到起点,大牛们,缓缓脚步来俺笔记葩分享一下吧,please~ --------------------------- 由于业务中接触的数据量很大,于是不得不转战开始 ...
- 使用pandas进行数据预处理01
数据预处理有四种技术:数据合并,数据清洗,数据标准化,以及数据转换. 数据合并技术:(1)横向或纵向堆叠合数据 (2)主键合并数据 (3)重叠合并数据 1.堆叠合并数据: 堆叠就是简单的把两个表拼接在 ...
随机推荐
- vue-cli3.0启动项目,在局域网内其他电脑通过自己ip访问
最近一直在使用vue-cli3.0做项目, package.json中配置后,自启动项目,也就没留意过小黑窗, "scripts": { "serve": &q ...
- vue 工具函数的封装 时间格式化函数
时间代码格式化工具函数的封装 小伙伴们,多封点工具函数,多封装点公共组件,多写点公共样式,照顾下互联网行业的新人把....~~~~~ /** yyyymmdd(new Date) -> &quo ...
- Linux环境配置与项目部署
简介: Linux是一类Unix计算机操作系统的统称.Linux操作系统的内核的名字也是“Linux”.Linux操作系统也是自由软件和开放源代码发展中最著名的例子.严格来讲,Linux这个词本身只表 ...
- pytest--命令行常用参数
前戏 在python中,大家听到最多的单元测试框架就是unittest和pytest了,而pytest有很多的功能,甩unittest几条街 我们在使用pytest时,要遵循pytest的命名规则: ...
- [NOI2010]超级钢琴 主席树
[NOI2010]超级钢琴 链接 luogu 思路 和12省联考的异或粽子一样. 堆维护n个左端点,每次取出来再放回去次 代码 #include <bits/stdc++.h> #defi ...
- LOJ6115 汇合 树上分块
本题空间很小,那些O(nlogn)的树上lca算法在这里不顶用了,可以考虑树分块. 本题的树分块是基于深度的,即按深度每\(\sqrt n\)分一块,然后一块一块往上跳,一直跳到lca处. 对于这题, ...
- Linux下的串口编程(转)
https://blog.csdn.net/tigerjibo/article/details/6179291 #include<stdio.h> /*标准输入输出定义*/ #includ ...
- prometheus安装(docker)
参考:https://github.com/songjiayang/prometheus_practice https://github.com/kjanshair/docker-prometheus ...
- Maven 教程(13)— Maven插件解析运行机制
原文地址:https://blog.csdn.net/liupeifeng3514/article/details/79551210 这里给大家详细说一下Maven的运行机制,让大家不仅知其然,更知其 ...
- 去除img标签函数
需要去除一个长字符串中的img标签,网上找到了这个代码试试看,确实是有效的.代码如下: <?php function strip_tags_img($string='') { $pattern= ...