Python 基于优先级映射数据帧列

Python 基于优先级映射数据帧列,python,pandas,dataframe,merge,Python,Pandas,Dataframe,Merge,我有一个主数据帧(main_df),如: 另一个数据帧(map_df)将有助于映射: A B X Y Id2 0 cat cat1 catxyz 0.4 nx01 1 cat cat1 NaN NaN nx02 2 cat NaN NaN NaN nx03 3 dog dog1 dogxyz NaN nx04 4 dog NaN NaN NaN nx05 列映射的优先级顺序为:['A

我有一个主数据帧(main_df),如:

另一个数据帧(map_df)将有助于映射:

     A     B       X    Y   Id2
0  cat  cat1  catxyz  0.4  nx01
1  cat  cat1     NaN  NaN  nx02
2  cat   NaN     NaN  NaN  nx03
3  dog  dog1  dogxyz  NaN  nx04
4  dog   NaN     NaN  NaN  nx05
列映射的优先级顺序为:
['A',B',X',Y']

因此,如果一行
map\u df
的所有单元格与
main\u df
的任何一行匹配,那么
Id2
的相应元素应添加到
main\u df
。如果
Y
不存在,则映射应继续进行
['A','B','X']
;如果
X
也不存在,则应继续执行
['A',B']
等操作

我能够通过数据帧合并获得部分结果。例如:

main_df = main_df.merge(map_df, how='left', on=['A', 'B', 'X', 'Y']))
但我无法计算出基于优先级的角度

作为映射的结果,数据帧(result_df)应该如下所示:

     A     B       X    Y     Id1   Id2
0  cat  cat1  catabc  0.1  uuid01  nx02
1  cat  cat1  catxyz  0.4  uuid02  nx01
2  cat  cat2  catpqr  0.5  uuid01  nx03
3  dog  dog1  dogxyz  0.3  uuid03  nx04
4  dog  dog2  dogpqr  0.2  uuid02  nx05
5  dog  dog2  dogabc  0.8  uuid01  nx05

这将解决问题。不过不太漂亮

它是一种迭代算法,先向前遍历数据,然后向后遍历数据

向前传递以逐渐降低的优先级解决合并,从而消除数据结构中的匹配项。它使用
内部合并
策略

向后传递将输出从最低优先级更新为最高优先级。 我注意到最后的
update
调用有点慢

merge_on=['A','B','X','Y']
tot_cols=len(合并打开)
操作\u main\u df=main\u df.copy()
输出=[]
#在逐渐变小的集合上向前传递
对于范围内的第一个\u idx(len(merge\u on)):
merged_id2=operation_main_df.merge(map_df,how='internal',
on=merge\u在[0:tot\u cols-first\u idx]上,右\u index=True,后缀={“,”\u y})
#相关输出具有正确数量的NaN列,其余为垃圾。
outputs.append(merged_id2[merged_id2.isna().sum(axis=1)=first_idx])
#向后更新过程
反向_it=iter(输出[:-1])
out=next(倒转)[[A'、'B'、'X'、'Y'、'Id1'、'Id2']]
对于反向的el_it:
out.update(el)
输出:


A   B   X   Y   Id1 Id2
0   cat cat1    catabc  0.1 uuid01  nx02
1   cat cat1    catxyz  0.4 uuid02  nx01
2   cat cat2    catpqr  0.5 uuid01  nx03
3   dog dog1    dogxyz  0.3 uuid03  nx04
4   dog dog2    dogpqr  0.2 uuid02  nx05
5   dog dog2    dogabc  0.8 uuid01  nx05

编辑:在我的机器上加速10倍

#更改
out=next(与之相反)[[A'、'B'、'X'、'Y'、'Id1'、'Id2']
#进入
out=next(倒转)[[A'、'B'、'X'、'Y'、'Id1'、'Id2']]

使用SQL引擎如何处理复杂查询的概念。生成部分笛卡尔积,然后将结果过滤到所需集

在这种情况下,计算匹配(重量)

在这种情况下,它是可用的最长密钥

main_df = pd.read_csv(io.StringIO("""     A     B       X    Y     Id1
0  cat  cat1  catabc  0.1  uuid01
1  cat  cat1  catxyz  0.4  uuid02
2  cat  cat2  catpqr  0.5  uuid01
3  dog  dog1  dogxyz  0.3  uuid03
4  dog  dog2  dogpqr  0.2  uuid02
5  dog  dog2  dogabc  0.8  uuid01"""), sep="\s+")
map_df = pd.read_csv(io.StringIO("""     A     B       X    Y   Id2
0  cat  cat1  catxyz  0.4  nx01
1  cat  cat1     NaN  NaN  nx02
2  cat   NaN     NaN  NaN  nx03
3  dog  dog1  dogxyz  NaN  nx04
4  dog   NaN     NaN  NaN  nx05"""), sep="\s+")

# result_df = 
result_df = (main_df
 # minimum match - just A, effectively a partial catersian product
 .merge(map_df, on=["A"], suffixes=("","_m"))
 # simpler to use empty string for NaN
 .fillna("")
 # synthetic column of rest of key and length
 .assign(c=lambda dfa: dfa["B_m"]+dfa["X_m"]+dfa["Y_m"].astype(str),
                l=lambda dfa: dfa["c"].str.len())
 # longest synthetic key is one we want
 .sort_values(["A","B","X","Y","l"], ascending=[True,True,True,True,False])
 .groupby(["A","B","X","Y"]).first()
 # cleanup and don't show workings...
 .reset_index()
 .loc[:,["A","B","X","Y","Id1","Id2"]]
)

产出(采用加权法)
不太优雅的方式,我做4合并?所以你做['A','B','C','D'],然后['A','B','C']等等。如果得到的数据帧如预期的那样,我会喜欢这个解决方案。这个解决方案不使用迭代机制,在我能看到的大数据帧上速度非常快。但是,如果您查看生成的数据帧,它与预期不符。@AsifIqbal我只在笔记本电脑上工作,没有足够的屏幕空间:-)。更新了相同的方法,使用更好的计算方法对matchbtw列进行排序,显示匹配级别1110-4cols、1100-3cols等。解决方案正在运行。非常感谢。但是我觉得这个答案可以在Rob Raymond的答案的帮助下得到改进。对于非常大的数据帧,迭代非常慢。@AsifIqbal,在我的机器上以10倍的速度编辑。看到变化了吗
result_df = (main_df
 # minimum match - just A, effectivaly a partial catersian producr
 .merge(map_df, on=["A"], suffixes=("","_m"))
 # calculate weight of match.  NaN is neutral, different penalise, same reward
 .assign(
     B_w=lambda dfa: np.where(dfa["B_m"].isna(), 0, np.where(dfa["B"]==dfa["B_m"],1,-1)*1000),
     X_w=lambda dfa: np.where(dfa["X_m"].isna(), 0, np.where(dfa["X"]==dfa["X_m"],1,-1)*100),
     Y_w=lambda dfa: np.where(dfa["Y_m"].isna(), 0, np.where(dfa["Y"]==dfa["Y_m"],1,-1)*10),
     w=lambda dfa: dfa.loc[:,["B_w","X_w","Y_w"]].sum(axis=1)
 )
# biggest weight is one we want
 .sort_values(["A","B","X","Y","w"], ascending=[True,True,True,True,False])

 .groupby(["A","B","X","Y"]).first()
#  # cleanup and don't show workings...
 .reset_index()
 .loc[:,["A","B","X","Y","Id1","Id2"]]
)
main_df = pd.read_csv(io.StringIO("""     A     B       X    Y     Id1
0  cat  cat1  catabc  0.1  uuid01
1  cat  cat1  catxyz  0.4  uuid02
2  cat  cat2  catpqr  0.5  uuid01
3  dog  dog1  dogxyz  0.3  uuid03
4  dog  dog2  dogpqr  0.2  uuid02
5  dog  dog2  dogabc  0.8  uuid01"""), sep="\s+")
map_df = pd.read_csv(io.StringIO("""     A     B       X    Y   Id2
0  cat  cat1  catxyz  0.4  nx01
1  cat  cat1     NaN  NaN  nx02
2  cat   NaN     NaN  NaN  nx03
3  dog  dog1  dogxyz  NaN  nx04
4  dog   NaN     NaN  NaN  nx05"""), sep="\s+")

# result_df = 
result_df = (main_df
 # minimum match - just A, effectively a partial catersian product
 .merge(map_df, on=["A"], suffixes=("","_m"))
 # simpler to use empty string for NaN
 .fillna("")
 # synthetic column of rest of key and length
 .assign(c=lambda dfa: dfa["B_m"]+dfa["X_m"]+dfa["Y_m"].astype(str),
                l=lambda dfa: dfa["c"].str.len())
 # longest synthetic key is one we want
 .sort_values(["A","B","X","Y","l"], ascending=[True,True,True,True,False])
 .groupby(["A","B","X","Y"]).first()
 # cleanup and don't show workings...
 .reset_index()
 .loc[:,["A","B","X","Y","Id1","Id2"]]
)

     A     B       X    Y     Id1   Id2
0  cat  cat1  catabc  0.1  uuid01  nx02
1  cat  cat1  catxyz  0.4  uuid02  nx01
2  cat  cat2  catpqr  0.5  uuid01  nx03
3  dog  dog1  dogxyz  0.3  uuid03  nx04
4  dog  dog2  dogabc  0.8  uuid01  nx05
5  dog  dog2  dogpqr  0.2  uuid02  nx05