Python 熊猫:仅当另一列中的值匹配时,才计算行之间的重叠字

Python 熊猫:仅当另一列中的值匹配时,才计算行之间的重叠字,python,pandas,string-comparison,sentence-similarity,Python,Pandas,String Comparison,Sentence Similarity,我有一个数据框,看起来如下所示,但有许多行: import pandas as pd data = {'intent': ['order_food', 'order_food','order_taxi','order_call','order_call','order_taxi'], 'Sent': ['i need hamburger','she wants sushi','i need a cab','call me at 6','she called me','i would like

我有一个数据框,看起来如下所示,但有许多行:

import pandas as pd

data = {'intent':  ['order_food', 'order_food','order_taxi','order_call','order_call','order_taxi'],
'Sent': ['i need hamburger','she wants sushi','i need a cab','call me at 6','she called me','i would like a new taxi' ],
'key_words': [['need','hamburger'], ['want','sushi'],['need','cab'],['call','6'],['call'],['new','taxi']]}

df = pd.DataFrame (data, columns = ['intent','Sent','key_words'])
我已使用以下代码计算了jaccard相似性(不是我的解决方案):

并修改了给出的代码,以比较每两行之间的重叠字,并由此创建了一个数据帧:

overlapping_word_list=[]

for val in list(combinations(range(len(data_new)), 2)):
     overlapping_word_list.append(f"the shared keywords between {data_new.iloc[val[0],0]} and {data_new.iloc[val[1],0]} sentences are: {lexical_overlap(data_new.iloc[val[0],1],data_new.iloc[val[1],1])}")
#creating an overlap dataframe
banking_overlapping_words_per_sent = DataFrame(overlapping_word_list,columns=['overlapping_list'])

因为我的数据集很大,所以当我运行此代码来比较所有行时,需要花费很长时间。因此,我只想比较具有相同意图的句子,而不比较具有不同意图的句子。我不确定如何继续只执行该操作,只需迭代
intent
列中的唯一值,然后使用
loc
仅获取对应的行。如果您有两行以上,您仍然需要使用
组合
,以在类似意图之间获得唯一的
组合

from itertools import combinations

for intent in df.intent.unique():
    # loc returns a DataFrame but we need just the column
    rows = df.loc[df.intent == intent, ["Sent"]].Sent.to_list()
    combos = combinations(rows, 2)
    for combo in combos:
        x, y = rows
        overlap = lexical_overlap(x, y)
        print(f"Overlap for ({x}) and ({y}) is {overlap}")

#  Overlap for (i need hamburger) and (she wants sushi) is 46.666666666666664
#  Overlap for (i need a cab) and (i would like a new taxi) is 40.0
#  Overlap for (call me at 6) and (she called me) is 54.54545454545454

好的,所以我根据@gold_cy的回答,想出了如何获得评论中提到的期望输出:

for intent in df.intent.unique():
# loc returns a DataFrame but we need just the column
   rows = df.loc[df.intent == intent,['intent','key_words','Sent']].values.tolist()
   combos = combinations(rows, 2)
   for combo in combos:
       x, y = rows
       overlap = lexical_overlap(x[1], y[1])
       print(f"Overlap of intent ({x[0]}) for ({x[2]}) and ({y[2]}) is {overlap}")

非常感谢您的回复。你能告诉我,我怎样才能得到这样的输出(例如,意图顺序的重叠_call for(在6点给我打电话)和(她给我打电话)是{'call'}基于关键字列,如果我将词法重叠函数更改为仅输出交叉点?非常感谢抱歉,我没有听你的问题。你的词法交叉点函数仅输出交叉点,没有其他内容。至于你想打印什么,由你决定。如果我对我的问题不清楚,很抱歉。因此,我将我喜欢从函数中得到如下输出:例如:(在6点调用我)和(她调用我)的意图(order_call)的重叠是{'call'},当然剩下的是一个空集,所以我想我可以在代码中做如下更改:df.loc[df.intent==intent,['intent','key_words','Sent'].values.tolist(),但不知道如何继续获得上面提到的输出。我唯一的问题是,它不适用于存在更多意图实例的情况
for intent in df.intent.unique():
# loc returns a DataFrame but we need just the column
   rows = df.loc[df.intent == intent,['intent','key_words','Sent']].values.tolist()
   combos = combinations(rows, 2)
   for combo in combos:
       x, y = rows
       overlap = lexical_overlap(x[1], y[1])
       print(f"Overlap of intent ({x[0]}) for ({x[2]}) and ({y[2]}) is {overlap}")