Python 熊猫:仅当另一列中的值匹配时,才计算行之间的重叠字(多个实例的问题)
我有一个数据框,看起来如下所示,但有许多行:Python 熊猫:仅当另一列中的值匹配时,才计算行之间的重叠字(多个实例的问题),python,pandas,list,combinations,sentence-similarity,Python,Pandas,List,Combinations,Sentence Similarity,我有一个数据框,看起来如下所示,但有许多行: import pandas as pd data = {'intent': ['order_food', 'order_food','order_taxi','order_call','order_call','order_call','order_taxi'], 'Sent': ['i need hamburger','she wants sushi','i need a cab','call me at 6','she called me',
import pandas as pd
data = {'intent': ['order_food', 'order_food','order_taxi','order_call','order_call','order_call','order_taxi'],
'Sent': ['i need hamburger','she wants sushi','i need a cab','call me at 6','she called me','order call','i would like a new taxi' ],
'key_words': [['need','hamburger'], ['want','sushi'],['need','cab'],['call','6'],['call'],['order','call'],['new','taxi']]}
df = pd.DataFrame (data, columns = ['intent','Sent','key_words'])
我已使用以下代码计算了jaccard相似性(不是我的解决方案):
并修改了给出的代码,以比较每两行之间的重叠字,并由此创建了一个数据帧:
overlapping_word_list=[]
for val in list(combinations(range(len(data_new)), 2)):
overlapping_word_list.append(f"the shared keywords between {data_new.iloc[val[0],0]} and {data_new.iloc[val[1],0]} sentences are: {lexical_overlap(data_new.iloc[val[0],1],data_new.iloc[val[1],1])}")
#creating an overlap dataframe
banking_overlapping_words_per_sent = DataFrame(overlapping_word_list,columns=['overlapping_list'])
的答案帮助了我,我对它做了一些更改,以获得我喜欢的输出:
for intent in df.intent.unique():
# loc returns a DataFrame but we need just the column
rows = df.loc[df.intent == intent,['intent','key_words','Sent']].values.tolist()
combos = combinations(rows, 2)
for combo in combos:
x, y = rows
overlap = lexical_overlap(x[1], y[1])
print(f"Overlap of intent ({x[0]}) for ({x[2]}) and ({y[2]}) is {overlap}")
问题是,当有更多相同意图的实例时,我会遇到以下错误:
ValueError:要解压缩的值太多(应为2个)
我不知道如何处理这个问题,因为我的数据集中还有很多例子你想要这个吗
from itertools import combinations
from operator import itemgetter
items_to_consider = []
for item in list(combinations(zip(df.Sent.values, map(set,df.key_words.values)),2)):
keywords = (list(map(itemgetter(1),item)))
intersect = keywords[0].intersection(keywords[1])
if len(intersect) > 0:
str_list = list(map(itemgetter(0),item))
str_list.append(intersect)
items_to_consider.append(str_list)
for i in items_to_consider:
for item in i[2]:
if item in i[0] and item in i[1]:
print(f"Overlap of intent (order_food) for ({i[0]}) and ({i[1]}) is {item}")
IIUC,在您的循环中,
对于组合…
解包应该是x,y=combo
,而不是x,y=rows
?对于给定的inp,您的预期输出df是多少?
from itertools import combinations
from operator import itemgetter
items_to_consider = []
for item in list(combinations(zip(df.Sent.values, map(set,df.key_words.values)),2)):
keywords = (list(map(itemgetter(1),item)))
intersect = keywords[0].intersection(keywords[1])
if len(intersect) > 0:
str_list = list(map(itemgetter(0),item))
str_list.append(intersect)
items_to_consider.append(str_list)
for i in items_to_consider:
for item in i[2]:
if item in i[0] and item in i[1]:
print(f"Overlap of intent (order_food) for ({i[0]}) and ({i[1]}) is {item}")