Python Pandas-将缺少的列作为NaN连接
想象两个数据帧:Python Pandas-将缺少的列作为NaN连接,python,pandas,dataframe,Python,Pandas,Dataframe,想象两个数据帧: X = pd.DataFrame([[1,2],[3,4],[5,6]], columns=["a", "b"]) Y = pd.DataFrame([10,20,30], columns=["a"]) >>> X a b 0 1 2 1 3 4 2 5 6 >>> Y a 0 10 1 20 2 30 总的来说,我希望我的最终输出如下: a_X b_X a_Y b_Y sum_a sum_b 0
X = pd.DataFrame([[1,2],[3,4],[5,6]], columns=["a", "b"])
Y = pd.DataFrame([10,20,30], columns=["a"])
>>> X
a b
0 1 2
1 3 4
2 5 6
>>> Y
a
0 10
1 20
2 30
总的来说,我希望我的最终输出如下:
a_X b_X a_Y b_Y sum_a sum_b
0 1 2 10 NaN 11 2
1 3 4 20 NaN 23 4
2 5 6 30 NaN 35 6
我试图通过以下方式实现:
merged = X.join(Y, lsuffix="_X", rsuffix="_Y")
merged['sum_a'] = merged['a_X'] + merged['a_Y'] # works
merged['sum_b'] = merged['b_X'] + merged['b_Y'] # doesn't work
显然,sum_b列将失败,因为Y集中没有b列。它可能在那里,但它不一定在那里,我的数据集没有任何保证。看起来我无法使用内置连接在那里添加“NaN”列。您可以执行以下操作:
import numpy as np
Y['b'] = np.nan
merged = X.join(Y, lsuffix="_X", rsuffix="_Y")
merged['sum_a'] = merged['a_X'] + merged['a_Y']
merged['sum_b'] = merged['b_X'] + merged.fillna(0)['b_Y']
#>>> merged
# a_X b_X a_Y b_Y sum_a sum_b
#0 1 2 10 NaN 11 2.0
#1 3 4 20 NaN 23 4.0
#2 5 6 30 NaN 35 6.0
与pd.concat连接
k = ['X', 'Y']
df = pd.concat([X, Y], keys=k, axis=1)
df
X Y
a b a
0 1 2 10
1 3 4 20
2 5 6 30
生成多索引并使用它重新索引-
idx = pd.MultiIndex.from_product([k, df.columns.levels[1].unique()])
df = df.reindex(columns=idx)
df
X Y
a b a b
0 1 2 10 NaN
1 3 4 20 NaN
2 5 6 30 NaN
重新设置列名-
df.columns = df.columns.map('_'.join)
df
X_a X_b Y_a Y_b
0 1 2 10 NaN
1 3 4 20 NaN
2 5 6 30 NaN
现在,您可以按后缀分组并查找和-
v = df.groupby(by=lambda x: x.split('_')[1], axis=1).sum().add_prefix('sum_')
v
sum_a sum_b
0 11.0 2.0
1 23.0 4.0
2 35.0 6.0
将此文件与原始文件连接:
pd.concat([df, v], 1)
X_a X_b Y_a Y_b sum_a sum_b
0 1 2 10 NaN 11.0 2.0
1 3 4 20 NaN 23.0 4.0
2 5 6 30 NaN 35.0 6.0
一个更接近你正在做的事情的替代方案。由于
Y
不必具有与X
相同的列,您可以对Y
使用reindex
,然后使用fill\u value
选项执行操作:
Y = Y.reindex(columns=X.columns)
>>> Y
# a b
#0 10 NaN
#1 20 NaN
#2 30 NaN
merged = X.join(Y, lsuffix="_X", rsuffix="_Y")
merged['sum_a'] = merged['a_X'].add(merged['a_Y'], fill_value=0)
merged['sum_b'] = merged['b_X'].add(merged['b_Y'], fill_value=0)
Y = Y.reindex(columns=X.columns)
>>> Y
# a b
#0 10 NaN
#1 20 NaN
#2 30 NaN
merged = X.join(Y, lsuffix="_X", rsuffix="_Y")
merged['sum_a'] = merged['a_X'].add(merged['a_Y'], fill_value=0)
merged['sum_b'] = merged['b_X'].add(merged['b_Y'], fill_value=0)