Python 模糊匹配一列中的字符串,并使用fuzzyfuzzy创建新的数据帧
我有以下数据帧:Python 模糊匹配一列中的字符串,并使用fuzzyfuzzy创建新的数据帧,python,pandas,fuzzy-comparison,fuzzywuzzy,Python,Pandas,Fuzzy Comparison,Fuzzywuzzy,我有以下数据帧: df = pd.DataFrame( {'id': [1, 2, 3, 4, 5, 6], 'fruits': ['apple', 'apples', 'orange', 'apple tree', 'oranges', 'mango'] }) id fruits 0 1 apple 1 2 apples 2 3 orange 3 4 apple tree 4 5 oran
df = pd.DataFrame(
{'id': [1, 2, 3, 4, 5, 6],
'fruits': ['apple', 'apples', 'orange', 'apple tree', 'oranges', 'mango']
})
id fruits
0 1 apple
1 2 apples
2 3 orange
3 4 apple tree
4 5 oranges
5 6 mango
我希望在列fruits
中找到模糊字符串,并得到一个新的数据帧,如下所示,该数据帧的比率_得分高于80
在Python中如何使用fuzzywuzzy包实现这一点?谢谢请注意,ratio\u score
是由一系列值组成的示例
我的解决方案:
df.loc[:,'fruits_copy'] = df['fruits']
df['ratio_score'] = df[['fruits', 'fruits_copy']].apply(lambda row: fuzz.ratio(row['fruits'], row['fruits_copy']), axis=1)
预期结果:
id fruits matched_id matched_fruits ratio_score
0 1 apple 2 apples 95
1 1 apple 4 apple tree 85
2 2 apples 4 apple tree 80
3 3 orange 5 oranges 95
4 6 mango
参考相关:
我的解决方案及以下参考:
df.loc[:,'fruits_copy'] = df['fruits']
compare = pd.MultiIndex.from_product([df['fruits'],
df['fruits_copy']]).to_series()
def metrics(tup):
return pd.Series([fuzz.ratio(*tup),
fuzz.token_sort_ratio(*tup)],
['ratio', 'token'])
compare.apply(metrics)
ratio token
apple apple 100 100
apples 91 91
orange 36 36
apple tree 67 67
oranges 33 33
mango 20 20
apples apple 91 91
apples 100 100
orange 33 33
apple tree 62 62
oranges 46 46
mango 18 18
orange apple 36 36
apples 33 33
orange 100 100
apple tree 25 25
oranges 92 92
mango 55 55
apple tree apple 67 67
apples 62 62
orange 25 25
apple tree 100 100
oranges 24 24
mango 13 13
oranges apple 33 33
apples 46 46
orange 92 92
apple tree 24 24
oranges 100 100
mango 50 50
mango apple 20 20
apples 18 18
orange 55 55
apple tree 13 13
oranges 50 50
mango 100 100