pandas的拼接操作

#重点

pandas的拼接分为两种：

级联：pd.concat, pd.append
合并：pd.merge, pd.join

0. 回顾numpy的级联

import numpy as np

import pandas as pd

from pandas import Series,DataFrame

============================================

练习12：

生成2个3*3的矩阵，对其分别进行两个维度上的级联

============================================

nd1 =np.array([1,2,3])

nd2 =np.array([-1,-2,-3,-4])

np.concatenate([nd1,nd2])

array([ 1,  2,  3, -1, -2, -3, -4])

nd3 = np.array([[-1,-2,-3],[0,2,4]])

nd1 + nd3

array([[0, 0, 0],

       [1, 4, 7]])

nd1.shape

(3,)

nd3.shape

(2, 3)

nd1 + nd2

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-10-cffcceec071c> in <module>()

----> 1 nd1 + nd2

ValueError: operands could not be broadcast together with shapes (3,) (4,)

为方便讲解，我们首先定义一个生成DataFrame的函数：

def make_df(cols,inds):

    data = {c:[c+str(i) for i in inds] for c in cols}

    return DataFrame(data,index = inds)

#当c = a   c:a1 a2  a3

#当c =b    c: b1 b2 b3

df1 = make_df(list("abc"),[1,2,3])

df1

#

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	a	b	c
1	a1	b1	c1
2	a2	b2	c2
3	a3	b3	c3

df2 = make_df(list('abc'),[4,5,6])

df2

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	a	b	c
4	a4	b4	c4
5	a5	b5	c5
6	a6	b6	c6

1. 使用pd.concat()级联

pandas使用pd.concat函数，与np.concatenate函数类似，只是多了一些参数：

pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,

          keys=None, levels=None, names=None, verify_integrity=False,

          copy=True)

1) 简单级联

和np.concatenate一样，优先增加行数（默认axis=0）

pd.concat([df1,df2])

#在级联的时候，一定要注意他的轴！！！

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	a	b	c
1	a1	b1	c1
2	a2	b2	c2
3	a3	b3	c3
4	a4	b4	c4
5	a5	b5	c5
6	a6	b6	c6

df1

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	a	b	c
1	a1	b1	c1
2	a2	b2	c2
3	a3	b3	c3

df3 =make_df(list("def"),[1,2,3])

df3

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	d	e	f
1	d1	e1	f1
2	d2	e2	f2
3	d3	e3	f3

df1 + df3

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	a	b	c	d	e	f
1	NaN	NaN	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN	NaN	NaN
3	NaN	NaN	NaN	NaN	NaN	NaN

pd.concat([df1, df3], axis = 1)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	a	b	c	d	e	f
1	a1	b1	c1	d1	e1	f1
2	a2	b2	c2	d2	e2	f2
3	a3	b3	c3	d3	e3	f3

pd.concat([df1,df2],axis = 1)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	a	b	c	a	b	c
1	a1	b1	c1	NaN	NaN	NaN
2	a2	b2	c2	NaN	NaN	NaN
3	a3	b3	c3	NaN	NaN	NaN
4	NaN	NaN	NaN	a4	b4	c4
5	NaN	NaN	NaN	a5	b5	c5
6	NaN	NaN	NaN	a6	b6	c6

可以通过设置axis来改变级联方向

注意index在级联时可以重复

df1

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	a	b	c
1	a1	b1	c1
2	a2	b2	c2
3	a3	b3	c3

df4 = make_df(list('abc'),[2,3,4])

df4

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	a	b	c
2	a2	b2	c2
3	a3	b3	c3
4	a4	b4	c4

pd.concat([df1,df4])

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	a	b	c
1	a1	b1	c1
2	a2	b2	c2
3	a3	b3	c3
2	a2	b2	c2
3	a3	b3	c3
4	a4	b4	c4

也可以选择忽略ignore_index，重新索引

pd.concat([df1,df4],ignore_index=True)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	a	b	c
0	a1	b1	c1
1	a2	b2	c2
2	a3	b3	c3
3	a2	b2	c2
4	a3	b3	c3
5	a4	b4	c4

或者使用多层索引 keys

concat([x,y],keys=['x','y'])

pd.concat([df1,df4],keys = ["三班","四班"])

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

		a	b	c
三班	1	a1	b1	c1
	2	a2	b2	c2
	3	a3	b3	c3
四班	2	a2	b2	c2
	3	a3	b3	c3
	4	a4	b4	c4

============================================

练习13：

想一想级联的应用场景？
使用昨天的知识，建立一个期中考试张三、李四的成绩表ddd
假设新增考试学科"计算机"，如何实现？
新增王老五同学的成绩，如何实现？

============================================

2) 不匹配级联

不匹配指的是级联的维度的索引不一致。例如纵向级联时列索引不一致，横向级联时行索引不一致

df1

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	a	b	c
1	a1	b1	c1
2	a2	b2	c2
3	a3	b3	c3

df5 = make_df(list("abcd"),[3,4,5,6])

df5

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	a	b	c	d
3	a3	b3	c3	d3
4	a4	b4	c4	d4
5	a5	b5	c5	d5
6	a6	b6	c6	d6

pd.concat([df1,df5])

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	a	b	c	d
1	a1	b1	c1	NaN
2	a2	b2	c2	NaN
3	a3	b3	c3	NaN
3	a3	b3	c3	d3
4	a4	b4	c4	d4
5	a5	b5	c5	d5
6	a6	b6	c6	d6

有3种连接方式：

外连接：补NaN（默认模式）

#上面的这种情况  默认的这种情况！！！！

#join='outer'

内连接：只连接匹配的项

pd.concat([df1,df5],join = "inner")

#只匹配你能够匹配上去的项

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	a	b	c
1	a1	b1	c1
2	a2	b2	c2
3	a3	b3	c3
3	a3	b3	c3
4	a4	b4	c4
5	a5	b5	c5
6	a6	b6	c6

连接指定轴 join_axes

df6 = make_df(list("abcz"), [3,4,7,8])

df6

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	a	b	c	z
3	a3	b3	c3	z3
4	a4	b4	c4	z4
7	a7	b7	c7	z7
8	a8	b8	c8	z8

type(df6.columns)

pandas.core.indexes.base.Index

df6.columns

Index(['a', 'b', 'c', 'z'], dtype='object')

pd.concat([df6,df5,df2,df1], join_axes=[df6.columns])

#axis  轴  axes  轴面

#join_axes  list of Index objects

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	a	b	c	z
3	a3	b3	c3	z3
4	a4	b4	c4	z4
7	a7	b7	c7	z7
8	a8	b8	c8	z8
3	a3	b3	c3	NaN
4	a4	b4	c4	NaN
5	a5	b5	c5	NaN
6	a6	b6	c6	NaN
4	a4	b4	c4	NaN
5	a5	b5	c5	NaN
6	a6	b6	c6	NaN
1	a1	b1	c1	NaN
2	a2	b2	c2	NaN
3	a3	b3	c3	NaN

============================================

练习14：

假设【期末】考试ddd2的成绩没有张三的，只有李四、王老五、赵小六的，使用多种方法级联

============================================

3) 使用append()函数添加

由于在后面级联的使用非常普遍，因此有一个函数append专门用于在后面添加

s1 = ["123"]

s1.append('456')

s1

['123', '456']

#append和concat非常类似

df1.append(df2)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	a	b	c
1	a1	b1	c1
2	a2	b2	c2
3	a3	b3	c3
4	a4	b4	c4
5	a5	b5	c5
6	a6	b6	c6

df5

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	a	b	c	d
3	a3	b3	c3	d3
4	a4	b4	c4	d4
5	a5	b5	c5	d5
6	a6	b6	c6	d6

df5.append(df1)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	a	b	c	d
3	a3	b3	c3	d3
4	a4	b4	c4	d4
5	a5	b5	c5	d5
6	a6	b6	c6	d6
1	a1	b1	c1	NaN
2	a2	b2	c2	NaN
3	a3	b3	c3	NaN

============================================

练习15：

新建一个只有张三李四王老五的期末考试成绩单ddd3，使用append()与期中考试成绩表ddd级联

============================================

2. 使用pd.merge()合并

#重点

#必须是两个DataFrame有相同属性的时候才能进行merge

merge与concat的区别在于，merge需要依据某一共同的行或列来进行合并

使用pd.merge()合并时，会自动根据两者相同column名称的那一列，作为key来进行合并。

注意每一列元素的顺序不要求一致

1) 一对一合并

df1 = DataFrame({"age":[30,22,36],"work":['tech',"accounting","sell"],"sex":["男","女","女"]}, index = list("abc"))

df1

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	age	sex	work
a	30	男	tech
b	22	女	accounting
c	36	女	sell

df2 = DataFrame({"home":["上海","安徽","山东"],"work":['tech',"accounting","sell"],"weight":[60,50,45]},

                index = list("abc"))

df2

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	home	weight	work
a	上海	60	tech
b	安徽	50	accounting
c	山东	45	sell

pd.concat([df1,df2],axis = 1)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	age	sex	work	home	weight	work
a	30	男	tech	上海	60	tech
b	22	女	accounting	安徽	50	accounting
c	36	女	sell	山东	45	sell

df1.merge(df2)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	age	sex	work	home	weight
0	30	男	tech	上海	60
1	22	女	accounting	安徽	50
2	36	女	sell	山东	45

2) 多对一合并

df1

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	age	sex	work
a	30	男	tech
b	22	女	accounting
c	36	女	sell

df3 = DataFrame({"home":["深圳","北京","上海","安徽","山东"],

                "work":["tech","tech","tech","accounting","sell"],

                "weight":[60,75,80,54,63]},index = list("abcde"))

df3

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	home	weight	work
a	深圳	60	tech
b	北京	75	tech
c	上海	80	tech
d	安徽	54	accounting
e	山东	63	sell

df1.merge(df3)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	age	sex	work	home	weight
0	30	男	tech	深圳	60
1	30	男	tech	北京	75
2	30	男	tech	上海	80
3	22	女	accounting	安徽	54
4	36	女	sell	山东	63

3) 多对多合并

df5 = DataFrame({"age":[28,30,22,36], "work":['tech',"tech","accounting","sell"],"sex":["女","男","女","女"]}, index = list("abce"))

df5

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	age	sex	work
a	28	女	tech
b	30	男	tech
c	22	女	accounting
e	36	女	sell

df3

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	home	weight	work
a	深圳	60	tech
b	北京	75	tech
c	上海	80	tech
d	安徽	54	accounting
e	山东	63	sell

df3.merge(df5)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	home	weight	work	age	sex
0	深圳	60	tech	28	女
1	深圳	60	tech	30	男
2	北京	75	tech	28	女
3	北京	75	tech	30	男
4	上海	80	tech	28	女
5	上海	80	tech	30	男
6	安徽	54	accounting	22	女
7	山东	63	sell	36	女

4) key的规范化

使用on=显式指定哪一列为key,当有多个key相同时使用

df5

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	age	sex	work
a	28	女	tech
b	30	男	tech
c	22	女	accounting
e	36	女	sell

df6 = DataFrame({"age":[30,27,36],"work":["tech","leader","sell"],"hoppy":["sixdog","diaofish","playcat"]}, index = list("abc"))

df6

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	age	hoppy	work
a	30	sixdog	tech
b	27	diaofish	leader
c	36	playcat	sell

df5.merge(df6, on = "age", suffixes=["_总部","_分部"])

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	age	sex	work_总部	hoppy	work_分部
0	30	男	tech	sixdog	tech
1	36	女	sell	playcat	sell

df5.merge(df6,on = "work")

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	age_x	sex	work	age_y	hoppy
0	28	女	tech	30	sixdog
1	30	男	tech	30	sixdog
2	36	女	sell	36	playcat

使用left_on和right_on指定左右两边的列作为key，当左右两边的key都不想等时使用

df5

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	age	sex	work
a	28	女	tech
b	30	男	tech
c	22	女	accounting
e	36	女	sell

df7 = DataFrame({"年龄":[30,22,36],"工作":["tech","accounting","sell"],"性别":["男","女","女"]},index = list("abc"))

df7

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	工作	年龄	性别
a	tech	30	男
b	accounting	22	女
c	sell	36	女

df5.merge(df7,left_on = "work", right_on = "工作")

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	age	sex	work	工作	年龄	性别
0	28	女	tech	tech	30	男
1	30	男	tech	tech	30	男
2	22	女	accounting	accounting	22	女
3	36	女	sell	sell	36	女

df5

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	age	sex	work
a	28	女	tech
b	30	男	tech
c	22	女	accounting
e	36	女	sell

s = df5[["age"]]*1000

s.columns = ["salary"]

s

#可以对列的名字进行修改

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	salary
a	28000
b	30000
c	22000
e	36000

df5.merge(s, left_index = True,right_index=True)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	age	sex	work	salary
a	28	女	tech	28000
b	30	男	tech	30000
c	22	女	accounting	22000
e	36	女	sell	36000

pd.concat([df5,s],axis = 1)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	age	sex	work	salary
a	28	女	tech	28000
b	30	男	tech	30000
c	22	女	accounting	22000
e	36	女	sell	36000

============================================

练习16：

假设有两份成绩单，除了ddd是张三李四王老五之外，还有ddd4是张三和赵小六的成绩单，如何合并？
如果ddd4中张三的名字被打错了，成为了张十三，怎么办？
自行练习多对一，多对多的情况

============================================

5) 内合并与外合并

内合并：只保留两者都有的key（默认模式）

df3

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	home	weight	work
a	深圳	60	tech
b	北京	75	tech
c	上海	80	tech
d	安徽	54	accounting
e	山东	63	sell

df5

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	age	sex	work
a	28	女	tech
b	30	男	tech
c	22	女	accounting
e	36	女	sell

df6

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	age	hoppy	work
a	30	sixdog	tech
b	27	diaofish	leader
c	36	playcat	sell

df3

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	home	weight	work
a	深圳	60	tech
b	北京	75	tech
c	上海	80	tech
d	安徽	54	accounting
e	山东	63	sell

df3.merge(df6)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	home	weight	work	age	hoppy
0	深圳	60	tech	30	sixdog
1	北京	75	tech	30	sixdog
2	上海	80	tech	30	sixdog
3	山东	63	sell	36	playcat

外合并 how='outer'：补NaN

df3.merge(df6,how = "outer")

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	home	weight	work	age	hoppy
0	深圳	60.0	tech	30.0	sixdog
1	北京	75.0	tech	30.0	sixdog
2	上海	80.0	tech	30.0	sixdog
3	安徽	54.0	accounting	NaN	NaN
4	山东	63.0	sell	36.0	playcat
5	NaN	NaN	leader	27.0	diaofish

左合并、右合并：how='left'，how='right'，

df3

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	home	weight	work
a	深圳	60	tech
b	北京	75	tech
c	上海	80	tech
d	安徽	54	accounting
e	山东	63	sell

df6

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	age	hoppy	work
a	30	sixdog	tech
b	27	diaofish	leader
c	36	playcat	sell

df3.merge(df6, how = "left")

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	home	weight	work	age	hoppy
0	深圳	60	tech	30.0	sixdog
1	北京	75	tech	30.0	sixdog
2	上海	80	tech	30.0	sixdog
3	安徽	54	accounting	NaN	NaN
4	山东	63	sell	36.0	playcat

df3.merge(df6, how = "right")

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	home	weight	work	age	hoppy
0	深圳	60.0	tech	30	sixdog
1	北京	75.0	tech	30	sixdog
2	上海	80.0	tech	30	sixdog
3	山东	63.0	sell	36	playcat
4	NaN	NaN	leader	27	diaofish

============================================

练习17：

如果只有张三赵小六语数英三个科目的成绩，如何合并？
考虑应用情景，使用多种方式合并ddd与ddd4

============================================

6) 列冲突的解决

当列冲突时，即有多个列名称相同时，需要使用on=来指定哪一个列作为key，配合suffixes指定冲突列名

可以使用suffixes=自己指定后缀

============================================

练习18：

假设有两个同学都叫李四，ddd5、ddd6都是张三和李四的成绩表，如何合并？

============================================

作业

3. 案例分析：美国各州人口数据分析

首先导入文件，并查看数据样本

pop = pd.read_csv("./state-population.csv")

pop.head(20)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	state/region	ages	year	population
0	AL	under18	2012	1117489.0
1	AL	total	2012	4817528.0
2	AL	under18	2010	1130966.0
3	AL	total	2010	4785570.0
4	AL	under18	2011	1125763.0
5	AL	total	2011	4801627.0
6	AL	total	2009	4757938.0
7	AL	under18	2009	1134192.0
8	AL	under18	2013	1111481.0
9	AL	total	2013	4833722.0
10	AL	total	2007	4672840.0
11	AL	under18	2007	1132296.0
12	AL	total	2008	4718206.0
13	AL	under18	2008	1134927.0
14	AL	total	2005	4569805.0
15	AL	under18	2005	1117229.0
16	AL	total	2006	4628981.0
17	AL	under18	2006	1126798.0
18	AL	total	2004	4530729.0
19	AL	under18	2004	1113662.0

pop.shape

(2544, 4)

areas = pd.read_csv("./state-areas.csv")

areas

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	state	area (sq. mi)
0	Alabama	52423
1	Alaska	656425
2	Arizona	114006
3	Arkansas	53182
4	California	163707
5	Colorado	104100
6	Connecticut	5544
7	Delaware	1954
8	Florida	65758
9	Georgia	59441
10	Hawaii	10932
11	Idaho	83574
12	Illinois	57918
13	Indiana	36420
14	Iowa	56276
15	Kansas	82282
16	Kentucky	40411
17	Louisiana	51843
18	Maine	35387
19	Maryland	12407
20	Massachusetts	10555
21	Michigan	96810
22	Minnesota	86943
23	Mississippi	48434
24	Missouri	69709
25	Montana	147046
26	Nebraska	77358
27	Nevada	110567
28	New Hampshire	9351
29	New Jersey	8722
30	New Mexico	121593
31	New York	54475
32	North Carolina	53821
33	North Dakota	70704
34	Ohio	44828
35	Oklahoma	69903
36	Oregon	98386
37	Pennsylvania	46058
38	Rhode Island	1545
39	South Carolina	32007
40	South Dakota	77121
41	Tennessee	42146
42	Texas	268601
43	Utah	84904
44	Vermont	9615
45	Virginia	42769
46	Washington	71303
47	West Virginia	24231
48	Wisconsin	65503
49	Wyoming	97818
50	District of Columbia	68
51	Puerto Rico	3515

areas.shape

(52, 2)

abbr = pd.read_csv("./state-abbrevs.csv")

abbr.head()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	state	abbreviation
0	Alabama	AL
1	Alaska	AK
2	Arizona	AZ
3	Arkansas	AR
4	California	CA

abbr.shape

(51, 2)

合并pop与abbrevs两个DataFrame，分别依据state/region列和abbreviation列来合并。

为了保留所有信息，使用外合并。

#pop  :2544行的数据  abbr   51的条数据

pop2 = pop.merge(abbr,left_on = "state/region", right_on = "abbreviation", how = "left")

pop2.head()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	state/region	ages	year	population	state	abbreviation
0	AL	under18	2012	1117489.0	Alabama	AL
1	AL	total	2012	4817528.0	Alabama	AL
2	AL	under18	2010	1130966.0	Alabama	AL
3	AL	total	2010	4785570.0	Alabama	AL
4	AL	under18	2011	1125763.0	Alabama	AL

去除abbreviation的那一列（axis=1）

pop2.drop("abbreviation", axis = 1,inplace=True)

pop2

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	state/region	ages	year	population	state
0	AL	under18	2012	1117489.0	Alabama
1	AL	total	2012	4817528.0	Alabama
2	AL	under18	2010	1130966.0	Alabama
3	AL	total	2010	4785570.0	Alabama
4	AL	under18	2011	1125763.0	Alabama
5	AL	total	2011	4801627.0	Alabama
6	AL	total	2009	4757938.0	Alabama
7	AL	under18	2009	1134192.0	Alabama
8	AL	under18	2013	1111481.0	Alabama
9	AL	total	2013	4833722.0	Alabama
10	AL	total	2007	4672840.0	Alabama
11	AL	under18	2007	1132296.0	Alabama
12	AL	total	2008	4718206.0	Alabama
13	AL	under18	2008	1134927.0	Alabama
14	AL	total	2005	4569805.0	Alabama
15	AL	under18	2005	1117229.0	Alabama
16	AL	total	2006	4628981.0	Alabama
17	AL	under18	2006	1126798.0	Alabama
18	AL	total	2004	4530729.0	Alabama
19	AL	under18	2004	1113662.0	Alabama
20	AL	total	2003	4503491.0	Alabama
21	AL	under18	2003	1113083.0	Alabama
22	AL	total	2001	4467634.0	Alabama
23	AL	under18	2001	1120409.0	Alabama
24	AL	total	2002	4480089.0	Alabama
25	AL	under18	2002	1116590.0	Alabama
26	AL	under18	1999	1121287.0	Alabama
27	AL	total	1999	4430141.0	Alabama
28	AL	total	2000	4452173.0	Alabama
29	AL	under18	2000	1122273.0	Alabama
...	...	...	...	...	...
2514	USA	under18	1999	71946051.0	NaN
2515	USA	total	2000	282162411.0	NaN
2516	USA	under18	2000	72376189.0	NaN
2517	USA	total	1999	279040181.0	NaN
2518	USA	total	2001	284968955.0	NaN
2519	USA	under18	2001	72671175.0	NaN
2520	USA	total	2002	287625193.0	NaN
2521	USA	under18	2002	72936457.0	NaN
2522	USA	total	2003	290107933.0	NaN
2523	USA	under18	2003	73100758.0	NaN
2524	USA	total	2004	292805298.0	NaN
2525	USA	under18	2004	73297735.0	NaN
2526	USA	total	2005	295516599.0	NaN
2527	USA	under18	2005	73523669.0	NaN
2528	USA	total	2006	298379912.0	NaN
2529	USA	under18	2006	73757714.0	NaN
2530	USA	total	2007	301231207.0	NaN
2531	USA	under18	2007	74019405.0	NaN
2532	USA	total	2008	304093966.0	NaN
2533	USA	under18	2008	74104602.0	NaN
2534	USA	under18	2013	73585872.0	NaN
2535	USA	total	2013	316128839.0	NaN
2536	USA	total	2009	306771529.0	NaN
2537	USA	under18	2009	74134167.0	NaN
2538	USA	under18	2010	74119556.0	NaN
2539	USA	total	2010	309326295.0	NaN
2540	USA	under18	2011	73902222.0	NaN
2541	USA	total	2011	311582564.0	NaN
2542	USA	under18	2012	73708179.0	NaN
2543	USA	total	2012	313873685.0	NaN

2544 rows × 5 columns

查看存在缺失数据的列。

使用.isnull().any()，只有某一列存在一个缺失数据，就会显示True。

cond = pop2.isnull().any(axis = 1)

pop2[cond]

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	state/region	ages	year	population	state
2448	PR	under18	1990	NaN	NaN
2449	PR	total	1990	NaN	NaN
2450	PR	total	1991	NaN	NaN
2451	PR	under18	1991	NaN	NaN
2452	PR	total	1993	NaN	NaN
2453	PR	under18	1993	NaN	NaN
2454	PR	under18	1992	NaN	NaN
2455	PR	total	1992	NaN	NaN
2456	PR	under18	1994	NaN	NaN
2457	PR	total	1994	NaN	NaN
2458	PR	total	1995	NaN	NaN
2459	PR	under18	1995	NaN	NaN
2460	PR	under18	1996	NaN	NaN
2461	PR	total	1996	NaN	NaN
2462	PR	under18	1998	NaN	NaN
2463	PR	total	1998	NaN	NaN
2464	PR	total	1997	NaN	NaN
2465	PR	under18	1997	NaN	NaN
2466	PR	total	1999	NaN	NaN
2467	PR	under18	1999	NaN	NaN
2468	PR	total	2000	3810605.0	NaN
2469	PR	under18	2000	1089063.0	NaN
2470	PR	total	2001	3818774.0	NaN
2471	PR	under18	2001	1077566.0	NaN
2472	PR	total	2002	3823701.0	NaN
2473	PR	under18	2002	1065051.0	NaN
2474	PR	total	2004	3826878.0	NaN
2475	PR	under18	2004	1035919.0	NaN
2476	PR	total	2003	3826095.0	NaN
2477	PR	under18	2003	1050615.0	NaN
...	...	...	...	...	...
2514	USA	under18	1999	71946051.0	NaN
2515	USA	total	2000	282162411.0	NaN
2516	USA	under18	2000	72376189.0	NaN
2517	USA	total	1999	279040181.0	NaN
2518	USA	total	2001	284968955.0	NaN
2519	USA	under18	2001	72671175.0	NaN
2520	USA	total	2002	287625193.0	NaN
2521	USA	under18	2002	72936457.0	NaN
2522	USA	total	2003	290107933.0	NaN
2523	USA	under18	2003	73100758.0	NaN
2524	USA	total	2004	292805298.0	NaN
2525	USA	under18	2004	73297735.0	NaN
2526	USA	total	2005	295516599.0	NaN
2527	USA	under18	2005	73523669.0	NaN
2528	USA	total	2006	298379912.0	NaN
2529	USA	under18	2006	73757714.0	NaN
2530	USA	total	2007	301231207.0	NaN
2531	USA	under18	2007	74019405.0	NaN
2532	USA	total	2008	304093966.0	NaN
2533	USA	under18	2008	74104602.0	NaN
2534	USA	under18	2013	73585872.0	NaN
2535	USA	total	2013	316128839.0	NaN
2536	USA	total	2009	306771529.0	NaN
2537	USA	under18	2009	74134167.0	NaN
2538	USA	under18	2010	74119556.0	NaN
2539	USA	total	2010	309326295.0	NaN
2540	USA	under18	2011	73902222.0	NaN
2541	USA	total	2011	311582564.0	NaN
2542	USA	under18	2012	73708179.0	NaN
2543	USA	total	2012	313873685.0	NaN

96 rows × 5 columns

查看缺失数据

根据数据是否缺失情况显示数据，如果缺失为True，那么显示

找到有哪些state/region使得state的值为NaN，使用unique()查看非重复值

pop2.head()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	state/region	ages	year	population	state
0	AL	under18	2012	1117489.0	Alabama
1	AL	total	2012	4817528.0	Alabama
2	AL	under18	2010	1130966.0	Alabama
3	AL	total	2010	4785570.0	Alabama
4	AL	under18	2011	1125763.0	Alabama

#让你查看哪一个州的有空值的   州的缩写

cond_state = pop2["state"].isnull()

cond_state

0       False

1       False

2       False

3       False

4       False

5       False

6       False

7       False

8       False

9       False

10      False

11      False

12      False

13      False

14      False

15      False

16      False

17      False

18      False

19      False

20      False

21      False

22      False

23      False

24      False

25      False

26      False

27      False

28      False

29      False

        ...

2514     True

2515     True

2516     True

2517     True

2518     True

2519     True

2520     True

2521     True

2522     True

2523     True

2524     True

2525     True

2526     True

2527     True

2528     True

2529     True

2530     True

2531     True

2532     True

2533     True

2534     True

2535     True

2536     True

2537     True

2538     True

2539     True

2540     True

2541     True

2542     True

2543     True

Name: state, Length: 2544, dtype: bool

pop2[cond_state]["state/region"].unique()

array(['PR', 'USA'], dtype=object)

为找到的这些state/region的state项补上正确的值，从而去除掉state这一列的所有NaN！

记住这样清除缺失数据NaN的方法！

合并各州面积数据areas，使用左合并。

思考一下为什么使用外合并？

继续寻找存在缺失数据的列

我们会发现area(sq.mi)这一列有缺失数据，为了找出是哪一行，我们需要找出是哪个state没有数据

去除含有缺失数据的行

查看数据是否缺失

找出2010年的全民人口数据,df.query(查询语句)

对查询结果进行处理，以state列作为新的行索引:set_index

计算人口密度。注意是Series/Series，其结果还是一个Series。

排序，并找出人口密度最高的五个州sort_values()

找出人口密度最低的五个州

要点总结：

统一用loc()索引
善于使用.isnull().any()找到存在NaN的列
善于使用.unique()确定该列中哪些key是我们需要的
一般使用外合并、左合并，目的只有一个：宁愿该列是NaN也不要丢弃其他列的信息

回顾：Series/DataFrame运算与ndarray运算的区别

Series与DataFrame没有广播，如果对应index没有值，则记为NaN；或者使用add的fill_value来补缺失值
ndarray有广播，通过重复已有值来计算

（四）pandas的拼接操作的更多相关文章

Pandas的拼接操作
pandas的拼接操作 pandas的拼接分为两种: 级联:pd.concat, pd.append 合并:pd.merge, pd.join import pandas as pd import n ...
Pandas 拼接操作数据处理
数据分析生成器迭代器装饰器 (两层传参) 单例模式() ios七层 io多路数据分析:是把隐藏在一些看似杂乱无章的数据背后的信息提炼出来,总结出所研究对象的内在规律 pandas的拼接操作 p ...
深度学习实践-强化学习-bird游戏 1.np.stack(表示进行拼接操作) 2.cv2.resize(进行图像的压缩操作) 3.cv2.cvtColor(进行图片颜色的转换) 4.cv2.threshold(进行图片的二值化操作) 5.random.sample(样本的随机抽取)
1. np.stack((x_t, x_t, x_t, x_t), axis=2) 将图片进行串接的操作,使得图片的维度为[80, 80, 4] 参数说明: (x_t, x_t, x_t, x_t) ...
数据分析05 /pandas的高级操作
数据分析05 /pandas的高级操作目录数据分析05 /pandas的高级操作 1. 替换操作 2. 映射操作 3. 运算工具 4. 映射索引 / 更改之前索引 5. 排序实现的随机抽样/打乱表 ...
实验四简单的PV操作
实验四简单的PV操作专业网络工程姓名方俊晖学号 201406114309 一. 实验目的 1.掌握临界区的概念及临界区的设计原则: 2.掌握信号量的概念.PV操作的含义以 ...
十天学Linux内核之第四天---如何处理输入输出操作
原文:十天学Linux内核之第四天---如何处理输入输出操作真的是悲喜交加呀,本来这个寒假早上8点都去练车,两个小时之后再来实验室陪伴Linux内核,但是今天教练说没名额考试了,好纠结,不过想想就可 ...
第四章使用jQuery操作DOM
第四章使用jQuery操作DOM 一.DOM操作在jQuery中的DOM操作主要可分为样式操作.文本和value属性值操作.节点操作: 节点操作又包含属性操作.节点遍历和CSS-DOM操作. 其中 ...
Python/MySQL（四、MySQL数据库操作）
Python/MySQL(四.MySQL数据库操作) 一.数据库条件语句: case when id>9 then ture else false 二.三元运算: if(isnull(xx)0, ...
pandas的apply操作
pandas的apply操作类似于Scala的udf一样方便,假设存在如下dataframe: id_part pred pred_class v_id 0 d [0.722817, 0.650064 ...

随机推荐

iOS视频随笔（一）
实例化对象init [AFNetworkActivityIndicatiorManager shareManager].enable = Yes; //开启网络请求指示 scrollView.cont ...
（十）HttpClient以multipart/form-data上传文件
原文链接:https://blog.csdn.net/wsdtq123/article/details/78888734 POST上传文件最早的HTTP POST是不支持文件上传的,给编程开发带来很 ...
arduino连接12864LCD方法
arduino连接12864LCD方法,参考相关代码. https://blog.csdn.net/txwtech/article/details/95038386
C# WPF - MVVM实现OPC Client管理系统
前言本文主要讲解采用WPF MVVM模式设计OPC Client的过程,算作对于WPF MVVM架构的学习记录吧!不足之处请不吝赐教,感谢! 涉及知识点 C#基础 Xaml基础命令.通知和数据绑定 ...
cron计划任务
格式 crontab -e [-u 用户名] ##编辑:注意,每项工作都是一行. crontab -l [-u 用户名] ##查看 crontab -r [-u 用户名] #清除分时日月周 ...
python django 批量上传文件并绑定对应文件的描述
面试题64：求 1 + 2 + ... + n
这道题目条件限制严格,需要发散思维...但是作者是以 C++ 语言特性来做讲解的,对于 Java 狗只能说稍微有点参考意义吧!
web安全中的session攻击
运行着个简单的demo后,打开login.jsp,使用firebug或chrome会发现,即使没有登录,我们也会有一个JSESSIONID,这是由服务器端在会话开始是通过set-cookie来设置的匿 ...
HTML&CSS面试高频考点(一)
1. 行内元素/块级元素非替换元素/替换元素行内元素(内联元素):a, abbr(缩写), acronym(只取首字母缩写), b, bdo(文本方向), big, br, cite(引用), c ...
python之单元测试及unittest框架的使用
例题取用登录模块:代码如下 def login_check(username,password): ''' 登录校验的函数 :param username:账号 :param password: 密码 ...

	a	b	c	a	b	c
1	a1	b1	c1	NaN	NaN	NaN
2	a2	b2	c2	NaN	NaN	NaN
3	a3	b3	c3	NaN	NaN	NaN
4	NaN	NaN	NaN	a4	b4	c4
5	NaN	NaN	NaN	a5	b5	c5
6	NaN	NaN	NaN	a6	b6	c6

	a	b	c	d
1	a1	b1	c1	NaN
2	a2	b2	c2	NaN
3	a3	b3	c3	NaN
3	a3	b3	c3	d3
4	a4	b4	c4	d4
5	a5	b5	c5	d5
6	a6	b6	c6	d6

	a	b	c	z
3	a3	b3	c3	z3
4	a4	b4	c4	z4
7	a7	b7	c7	z7
8	a8	b8	c8	z8
3	a3	b3	c3	NaN
4	a4	b4	c4	NaN
5	a5	b5	c5	NaN
6	a6	b6	c6	NaN
4	a4	b4	c4	NaN
5	a5	b5	c5	NaN
6	a6	b6	c6	NaN
1	a1	b1	c1	NaN
2	a2	b2	c2	NaN
3	a3	b3	c3	NaN

	a	b	c	d
3	a3	b3	c3	d3
4	a4	b4	c4	d4
5	a5	b5	c5	d5
6	a6	b6	c6	d6
1	a1	b1	c1	NaN
2	a2	b2	c2	NaN
3	a3	b3	c3	NaN

	a	b	c	a	b	c
1	a1	b1	c1	NaN	NaN	NaN
2	a2	b2	c2	NaN	NaN	NaN
3	a3	b3	c3	NaN	NaN	NaN
4	NaN	NaN	NaN	a4	b4	c4
5	NaN	NaN	NaN	a5	b5	c5
6	NaN	NaN	NaN	a6	b6	c6

	a	b	c	d
1	a1	b1	c1	NaN
2	a2	b2	c2	NaN
3	a3	b3	c3	NaN
3	a3	b3	c3	d3
4	a4	b4	c4	d4
5	a5	b5	c5	d5
6	a6	b6	c6	d6

	a	b	c	z
3	a3	b3	c3	z3
4	a4	b4	c4	z4
7	a7	b7	c7	z7
8	a8	b8	c8	z8
3	a3	b3	c3	NaN
4	a4	b4	c4	NaN
5	a5	b5	c5	NaN
6	a6	b6	c6	NaN
4	a4	b4	c4	NaN
5	a5	b5	c5	NaN
6	a6	b6	c6	NaN
1	a1	b1	c1	NaN
2	a2	b2	c2	NaN
3	a3	b3	c3	NaN

	a	b	c	d
3	a3	b3	c3	d3
4	a4	b4	c4	d4
5	a5	b5	c5	d5
6	a6	b6	c6	d6
1	a1	b1	c1	NaN
2	a2	b2	c2	NaN
3	a3	b3	c3	NaN

（四）pandas的拼接操作

pandas的拼接操作

0. 回顾numpy的级联

1. 使用pd.concat()级联

1) 简单级联

2) 不匹配级联

3) 使用append()函数添加

2. 使用pd.merge()合并

1) 一对一合并

2) 多对一合并

3) 多对多合并

4) key的规范化

5) 内合并与外合并

6) 列冲突的解决

作业

3. 案例分析：美国各州人口数据分析

回顾：Series/DataFrame运算与ndarray运算的区别

（四）pandas的拼接操作的更多相关文章

随机推荐

热门专题

	a	b	c	a	b	c
1	a1	b1	c1	NaN	NaN	NaN
2	a2	b2	c2	NaN	NaN	NaN
3	a3	b3	c3	NaN	NaN	NaN
4	NaN	NaN	NaN	a4	b4	c4
5	NaN	NaN	NaN	a5	b5	c5
6	NaN	NaN	NaN	a6	b6	c6

	a	b	c	d
1	a1	b1	c1	NaN
2	a2	b2	c2	NaN
3	a3	b3	c3	NaN
3	a3	b3	c3	d3
4	a4	b4	c4	d4
5	a5	b5	c5	d5
6	a6	b6	c6	d6

	a	b	c	z
3	a3	b3	c3	z3
4	a4	b4	c4	z4
7	a7	b7	c7	z7
8	a8	b8	c8	z8
3	a3	b3	c3	NaN
4	a4	b4	c4	NaN
5	a5	b5	c5	NaN
6	a6	b6	c6	NaN
4	a4	b4	c4	NaN
5	a5	b5	c5	NaN
6	a6	b6	c6	NaN
1	a1	b1	c1	NaN
2	a2	b2	c2	NaN
3	a3	b3	c3	NaN

	a	b	c	d
3	a3	b3	c3	d3
4	a4	b4	c4	d4
5	a5	b5	c5	d5
6	a6	b6	c6	d6
1	a1	b1	c1	NaN
2	a2	b2	c2	NaN
3	a3	b3	c3	NaN