Python 如何将一个数据帧中某些列的值与另一个数据帧中同一组列的值进行比较?
我有三个数据帧df1、df2和df3,它们的定义如下Python 如何将一个数据帧中某些列的值与另一个数据帧中同一组列的值进行比较?,python,pandas,dataframe,Python,Pandas,Dataframe,我有三个数据帧df1、df2和df3,它们的定义如下 df1 = A B C 0 1 a a1 1 2 b b2 2 3 c c3 3 4 d d4 4 5 e e5 5 6 f f6 df2 = A B C 0 1 a X 1 2 b Y 2 3 c Z df3 = A B C 3 4 d P 4 5 e Q 5 6 f R 我已经定义了一个主键列表PK=[“a”,“B”] 现在
df1 =
A B C
0 1 a a1
1 2 b b2
2 3 c c3
3 4 d d4
4 5 e e5
5 6 f f6
df2 =
A B C
0 1 a X
1 2 b Y
2 3 c Z
df3 =
A B C
3 4 d P
4 5 e Q
5 6 f R
我已经定义了一个主键列表PK=[“a”,“B”]
现在,我将第四个数据帧df4取为df4=df1.sample(n=2)
,它给出如下内容
df4 =
A B C
4 5 e e5
1 2 b b2
现在,我想从df2和df1中选择与df4主键值匹配的行。
例如,在这种情况下,
我需要和你吵架
索引=df3中的4,
df2的索引=1
如果可能,我需要获得如下数据帧:
df =
A B C A(df2) B(df2) C(df2) A(df3) B(df3) C(df3)
4 5 e e5 5 e Q
1 2 b b2 2 b Y
任何关于如何解决这个问题的想法都将非常有用。以下是我将如何在整个数据集上解决这个问题。如果您想先采样,只需通过将
df1
替换为df4
来更新末尾的merge语句,或者只需采样t
PK = ["A","B"]
df2 = pd.concat([df2,df2], axis=1)
df2.columns=['A','B','C','A(df2)', 'B(df2)', 'C(df2)']
df2.drop(columns=['C'], inplace=True)
df3 = pd.concat([df3,df3], axis=1)
df3.columns=['A','B','C','A(df3)', 'B(df3)', 'C(df3)']
df3.drop(columns=['C'], inplace=True)
t = df1.merge(df2, on=PK, how='left')
t = t.merge(df3, on=PK, how='left')
输出
A B C A(df2) B(df2) C(df2) A(df3) B(df3) C(df3)
0 1 a a1 1.0 a X NaN NaN NaN
1 2 b b2 2.0 b Y NaN NaN NaN
2 3 c c3 3.0 c Z NaN NaN NaN
3 4 d d4 NaN NaN NaN 4.0 d P
4 5 e e5 NaN NaN NaN 5.0 e Q
5 6 f f6 NaN NaN NaN 6.0 f R
下面是我将如何在整个数据集上执行此操作。如果您想先采样,只需通过将
df1
替换为df4
来更新末尾的merge语句,或者只需采样t
PK = ["A","B"]
df2 = pd.concat([df2,df2], axis=1)
df2.columns=['A','B','C','A(df2)', 'B(df2)', 'C(df2)']
df2.drop(columns=['C'], inplace=True)
df3 = pd.concat([df3,df3], axis=1)
df3.columns=['A','B','C','A(df3)', 'B(df3)', 'C(df3)']
df3.drop(columns=['C'], inplace=True)
t = df1.merge(df2, on=PK, how='left')
t = t.merge(df3, on=PK, how='left')
输出
A B C A(df2) B(df2) C(df2) A(df3) B(df3) C(df3)
0 1 a a1 1.0 a X NaN NaN NaN
1 2 b b2 2.0 b Y NaN NaN NaN
2 3 c c3 3.0 c Z NaN NaN NaN
3 4 d d4 NaN NaN NaN 4.0 d P
4 5 e e5 NaN NaN NaN 5.0 e Q
5 6 f f6 NaN NaN NaN 6.0 f R
在右数据帧上使用两个连续操作来合并数据帧df4、df2、df3
,最后使用empty
字符串替换缺少的值:
df = (
df4.merge(df2.add_suffix('(df2)'), left_on=['A', 'B'], right_on=['A(df2)', 'B(df2)'], how='left')
.merge(df3.add_suffix('(df3)'), left_on=['A', 'B'], right_on=['A(df3)', 'B(df3)'], how='left')
.fillna('')
)
结果:
# print(df)
A B C A(df2) B(df2) C(df2) A(df3) B(df3) C(df3)
0 5 e e5 5 e Q
1 2 b b2 2 b Y
在右数据帧上使用两个连续操作来合并数据帧df4、df2、df3
,最后使用empty
字符串替换缺少的值:
df = (
df4.merge(df2.add_suffix('(df2)'), left_on=['A', 'B'], right_on=['A(df2)', 'B(df2)'], how='left')
.merge(df3.add_suffix('(df3)'), left_on=['A', 'B'], right_on=['A(df3)', 'B(df3)'], how='left')
.fillna('')
)
结果:
# print(df)
A B C A(df2) B(df2) C(df2) A(df3) B(df3) C(df3)
0 5 e e5 5 e Q
1 2 b b2 2 b Y