Python 从两个可变长度字符串数组返回相似性矩阵（scipy选项？）_Python_Matrix_Scipy_Distance_Levenshtein Distance

Python 从两个可变长度字符串数组返回相似性矩阵（scipy选项？）

python matrix

Python 从两个可变长度字符串数组返回相似性矩阵（scipy选项？）,python,matrix,scipy,distance,levenshtein-distance,Python,Matrix,Scipy,Distance,Levenshtein Distance,假设我有两个数组： import numpy as np arr1 = np.array(['faucet', 'faucets', 'bath', 'parts', 'bathroom']) arr2 = np.array(['faucett', 'faucetd', 'bth', 'kichen']) 我想计算arr2中字符串与arr1中字符串的相似性 arr1是拼写正确的单词数组 arr2是单词词典中无法识别的单词数组我想返回一个矩阵，然后将其转换为一个数据帧我当前的解决方案（）：

假设我有两个数组：

import numpy as np
arr1 = np.array(['faucet', 'faucets', 'bath', 'parts', 'bathroom'])
arr2 = np.array(['faucett', 'faucetd', 'bth', 'kichen'])

我想计算

arr2

中字符串与

arr1

中字符串的相似性

arr1

是拼写正确的单词数组

arr2

是单词词典中无法识别的单词数组

我想返回一个矩阵，然后将其转换为一个数据帧

我当前的解决方案（）：

输出：

            faucet   faucets      bath     parts  bathroom   faucett  \
faucet    0.000000  0.923077  0.400000  0.363636  0.285714  0.923077   
faucets   0.923077  0.000000  0.363636  0.500000  0.266667  0.857143   
bath      0.400000  0.363636  0.000000  0.444444  0.666667  0.363636   
parts     0.363636  0.500000  0.444444  0.000000  0.307692  0.333333   
bathroom  0.285714  0.266667  0.666667  0.307692  0.000000  0.266667   
faucett   0.923077  0.857143  0.363636  0.333333  0.266667  0.000000   
faucetd   0.923077  0.857143  0.363636  0.333333  0.266667  0.857143   
bth       0.222222  0.200000  0.857143  0.250000  0.545455  0.200000   
kichen    0.333333  0.307692  0.200000  0.000000  0.142857  0.307692   

           faucetd       bth    kichen  
faucet    0.923077  0.222222  0.333333  
faucets   0.857143  0.200000  0.307692  
bath      0.363636  0.857143  0.200000  
parts     0.333333  0.250000  0.000000  
bathroom  0.266667  0.545455  0.142857  
faucett   0.857143  0.200000  0.307692  
faucetd   0.000000  0.200000  0.307692  
bth       0.200000  0.000000  0.222222  
kichen    0.307692  0.222222  0.000000

此解决方案的问题：我浪费时间计算我已经知道拼写正确的单词的成对距离比

我要做的是将一个函数

arr1

和

arr2

（可以是不同的长度！）交给用户，然后输出一个带有比率的矩阵（不一定是平方）

结果如下（没有计算开销）：

我想你在寻找：

结果:

           faucet   faucets      bath     parts  bathroom
faucett  0.923077  0.857143  0.363636  0.333333  0.266667
faucetd  0.923077  0.857143  0.363636  0.333333  0.266667
bth      0.222222  0.200000  0.857143  0.250000  0.545455
kichen   0.333333  0.307692  0.200000  0.000000  0.142857

使用（而不是

pdist

）并分别将两个数组赋给它。当我阅读cdist上的文档时，我无法理解它。而且不知道数组可能是可变长度的。谢谢你的帮助！

>>> df.drop(index=arr1, columns=arr2)

           faucet   faucets      bath     parts  bathroom
faucett  0.923077  0.857143  0.363636  0.333333  0.266667
faucetd  0.923077  0.857143  0.363636  0.333333  0.266667
bth      0.222222  0.200000  0.857143  0.250000  0.545455
kichen   0.333333  0.307692  0.200000  0.000000  0.142857

import pandas as pd
import numpy as np
from scipy.spatial.distance import cdist
from Levenshtein import ratio

arr1 = np.array(['faucet', 'faucets', 'bath', 'parts', 'bathroom'])
arr2 = np.array(['faucett', 'faucetd', 'bth', 'kichen'])

matrix = cdist(arr2.reshape(-1, 1), arr1.reshape(-1, 1), lambda x, y: ratio(x[0], y[0]))
df = pd.DataFrame(data=matrix, index=arr2, columns=arr1)

           faucet   faucets      bath     parts  bathroom
faucett  0.923077  0.857143  0.363636  0.333333  0.266667
faucetd  0.923077  0.857143  0.363636  0.333333  0.266667
bth      0.222222  0.200000  0.857143  0.250000  0.545455
kichen   0.333333  0.307692  0.200000  0.000000  0.142857