Python 如果给定的两列相同,请在行中查找相似的元素以创建新列

Python 如果给定的两列相同,请在行中查找相似的元素以创建新列,python,pandas,dataframe,Python,Pandas,Dataframe,我的数据框看起来 df = query subject HPSame 0 cat dog HPS_1 1 cat horse HPS_2 2 king queen HPS_3 3 queen people HPS_4 4 CAR VAN HPS_5 5 dog tiger HPS_6 6 CAR

我的数据框看起来

df = 
     query    subject     HPSame
0    cat      dog         HPS_1
1    cat      horse       HPS_2
2    king     queen       HPS_3
3    queen    people      HPS_4
4    CAR      VAN         HPS_5
5    dog      tiger       HPS_6
6    CAR      TRUCK       HPS_7
7    horse    deer        HPS_8
8    CAR      JEEP        HPS_9
9    TRUCK    LORRY       HPS_10
10   VAN      TRAIN       HPS_11
11   people   children    HPS_12
在df中,查询类似于主题,即cat类似于dog,因此标记为HPS_1。此外,猫和马相似,狗和老虎相似,因此,应该有相同的匹配标签HPS_1。我希望找到类似的元素,比如if a=b=c=d,并在新列中给它们相同的标签。我试图简化我的问题。主题和查询基本上由字母数字元素组成,WP_020314852.1=WP_004217899.1=WP_150395973.1表示相同类型。预期结果如下

df = 

     query    subject     HPSame   match
0    cat      dog         HPS_1    HPS_1
1    cat      horse       HPS_2    HPS_1
2    king     queen       HPS_3    HPS_3
3    queen    people      HPS_4    HPS_3
4    CAR      VAN         HPS_5    HPS_5
5    dog      tiger       HPS_6    HPS_1
6    CAR      TRUCK       HPS_7    HPS_5
7    horse    deer        HPS_8    HPS_1
8    CAR      JEEP        HPS_9    HPS_5
9    TRUCK    LORRY       HPS_10   HPS_5
10   VAN      TRAIN       HPS_11   HPS_5
11   people   children    HPS_12   HPS_3  
我试过了

df['query_s'] = df['query'].shift(-1)
df['HPSame_s'] = df['HPSame'].shift(-1)
condition = [(df['query'] == df['query_s'])]
ifTrue = df['HPSame']
ifFalse = df['HPSame_s']
df['match'] = np.where(condition, ifTrue, ifFalse)
这会抛出ValueError:值的长度与索引的长度不匹配

我们可以使用以下方法执行此操作:

输出:

     query   subject  HPSame  match
0      cat       dog   HPS_1  HPS_1
1      cat     horse   HPS_2  HPS_1
2     king     queen   HPS_3  HPS_3
3    queen    people   HPS_4  HPS_3
4      CAR       VAN   HPS_5  HPS_5
5      dog     tiger   HPS_6  HPS_1
6      CAR     TRUCK   HPS_7  HPS_5
7    horse      deer   HPS_8  HPS_1
8      CAR      JEEP   HPS_9  HPS_5
9    TRUCK     LORRY  HPS_10  HPS_5
10     VAN     TRAIN  HPS_11  HPS_5
11  people  children  HPS_12  HPS_3
数据帧中图形网络的图像:

fig, ax = plt.subplots(figsize=(10,8))
nx.draw_networkx(G, node_color='y')

我可能反应迟钝,但我不明白“猫和狗很相似,所以给它贴上了HPS_1”的标签。另外,猫和马相似,狗和老虎相似,所以,应该有相同的匹配标签,HPS_1。“你如何定义相似性?”?“匹配”列是如何计算的?比方说,cat是一种ID为WP_120314582.1的蛋白质,而dog是另一种ID为WP_13242761.5的蛋白质。这两种蛋白质都是100%相似的,所以它们应该有相同的名字,尽管它们有不同的ID。非常感谢!
fig, ax = plt.subplots(figsize=(10,8))
nx.draw_networkx(G, node_color='y')