Python 创建具有相似性索引值的列_Python_Pandas_Fuzzywuzzy

Python 创建具有相似性索引值的列

python pandas

Python 创建具有相似性索引值的列,python,pandas,fuzzywuzzy,Python,Pandas,Fuzzywuzzy,如何创建列来分别显示每行的相似性索引此代码 def func(name): matches = try_test.apply(lambda row: (fuzz.partial_ratio(row['name'], name) >= 85), axis=1) return [try_test.word[i] for i, x in enumerate(matches) if x] try_test.apply(lambda row: func(row['name']),

如何创建列来分别显示每行的相似性索引

此代码

def func(name):
    matches = try_test.apply(lambda row: (fuzz.partial_ratio(row['name'], name) >= 85), axis=1)
    return [try_test.word[i] for i, x in enumerate(matches) if x]

try_test.apply(lambda row: func(row['name']), axis=1)

返回与条件

=85

匹配的索引。但是，我也希望通过将每个字段与所有其他字段进行比较来获得这些值

数据集是

try_test = pd.DataFrame({'word': ['apple', 'orange', 'diet', 'energy', 'fire', 'cake'], 
                         'name': ['dog', 'cat', 'mad cat', 'good dog', 'bad dog', 'chicken']})

非常感谢你对我的帮助

预期输出（值只是一个示例）

在对角线上有一个值100，因为我正在比较狗和狗，。。。

如果你认为它会更好的话，我也可以考虑另一种方法。

IIUC，你可以稍微改变你的函数以得到你想要的：

def func(name):
    return try_test.apply(lambda row: (fuzz.partial_ratio(row['name'], name)), axis=1)

print(try_test.apply(lambda row: func(row['name']), axis=1))
     0    1    2    3    4    5
0  100    0   33  100  100    0
1    0  100  100    0   33   33
2   33  100  100   29   43   14
3  100    0   29  100   71    0
4  100   33   43   71  100    0
5    0   33   14    0    0  100

也就是说，由于结果是对称矩阵且对角线为100，因此不需要进行一半以上的计算。因此，如果数据较大，则可以对当前行之前的行执行

部分u比率

。添加so

reindex

，然后使用

（转置）和

np.diag

创建完整矩阵，您可以执行以下操作：

def func_pr (row):
    return (try_test.loc[:row.name-1, 'name']
                    .apply(lambda name: fuzz.partial_ratio(name, row['name'])))

#start at index 1 (second row)
pr = (try_test.loc[1:].apply(func_pr, axis=1)
         .reindex(index=try_test.index, 
                  columns=try_test.index)
         .fillna(0)
         .add_prefix('sim_idx')
     )

#complete the result with transpose and diag
pr += pr.to_numpy().T + np.diag(np.ones(pr.shape[0]))*100

# concat
res = pd.concat([try_test, pr.astype(int)], axis=1)

你得到了什么

print(res)
     word      name  sim_idx0  sim_idx1  sim_idx2  sim_idx3  sim_idx4  \
0   apple       dog       100         0        33       100       100   
1  orange       cat         0       100       100         0        33   
2    diet   mad cat        33       100       100        29        43   
3  energy  good dog       100         0        29       100        71   
4    fire   bad dog       100        33        43        71       100   
5    cake   chicken         0        33        14         0         0   

   sim_idx5  
0         0  
1        33  
2        14  
3         0  
4         0  
5       100

IIUC，你可以稍微改变你的函数来得到你想要的：

def func(name):
    return try_test.apply(lambda row: (fuzz.partial_ratio(row['name'], name)), axis=1)

print(try_test.apply(lambda row: func(row['name']), axis=1))
     0    1    2    3    4    5
0  100    0   33  100  100    0
1    0  100  100    0   33   33
2   33  100  100   29   43   14
3  100    0   29  100   71    0
4  100   33   43   71  100    0
5    0   33   14    0    0  100

也就是说，由于结果是对称矩阵且对角线为100，因此不需要进行一半以上的计算。因此，如果数据较大，则可以对当前行之前的行执行

部分u比率

。添加so

reindex

，然后使用

（转置）和

np.diag

创建完整矩阵，您可以执行以下操作：

def func_pr (row):
    return (try_test.loc[:row.name-1, 'name']
                    .apply(lambda name: fuzz.partial_ratio(name, row['name'])))

#start at index 1 (second row)
pr = (try_test.loc[1:].apply(func_pr, axis=1)
         .reindex(index=try_test.index, 
                  columns=try_test.index)
         .fillna(0)
         .add_prefix('sim_idx')
     )

#complete the result with transpose and diag
pr += pr.to_numpy().T + np.diag(np.ones(pr.shape[0]))*100

# concat
res = pd.concat([try_test, pr.astype(int)], axis=1)

你得到了什么

print(res)
     word      name  sim_idx0  sim_idx1  sim_idx2  sim_idx3  sim_idx4  \
0   apple       dog       100         0        33       100       100   
1  orange       cat         0       100       100         0        33   
2    diet   mad cat        33       100       100        29        43   
3  energy  good dog       100         0        29       100        71   
4    fire   bad dog       100        33        43        71       100   
5    cake   chicken         0        33        14         0         0   

   sim_idx5  
0         0  
1        33  
2        14  
3         0  
4         0  
5       100

您期望的输出是什么？它应该是从

fuzz.partial_ratio（row['name']，name）

或从计算每个术语和所有其他术语之间的匹配索引中获得的值，我得到了真/假条件（基于>=85）。不幸的是，我无法返回到数值更新问题您的预期输出是什么？它应该是从

fuzz.partial_ratio（row['name'，name）

或从计算每个术语与所有其他术语之间的匹配索引得到的值，我得到了真/假条件（基于>=85）。不幸的是，我不能回到数字问题上来了。非常感谢你，本·T。非常感谢你，本·T。