如何计算python-Levenshtein.ratio_Python_Levenshtein Distance

如何计算python-Levenshtein.ratio

python

如何计算python-Levenshtein.ratio,python,levenshtein-distance,Python,Levenshtein Distance,根据python Levenshtein.ratio来源：它被计算为（lensum-ldist）/lensum。这适用于 # pip install python-Levenshtein import Levenshtein Levenshtein.distance('ab', 'a') # returns 1 Levenshtein.ratio('ab', 'a') # returns 0.666666 然而，它似乎与 Levenshtein.distance('ab', 'ac'

根据

python Levenshtein.ratio

来源：

它被计算为

（lensum-ldist）/lensum

。这适用于

# pip install python-Levenshtein
import Levenshtein
Levenshtein.distance('ab', 'a') # returns 1
Levenshtein.ratio('ab', 'a')    # returns 0.666666

然而，它似乎与

Levenshtein.distance('ab', 'ac')  # returns 1
Levenshtein.ratio('ab', 'ac')     # returns 0.5

我觉得我一定错过了一些非常简单的东西。。但是为什么不

0.75

？

对于
'ab'
和
'ac'
的Levenshtein距离，如下所示：

因此，调整是：

  a c
  a b

对齐长度=2
不匹配的数量=1

Levenshtein距离

为

，因为将

ac

转换为

ab

（或相反）只需一次替换

距离比=（Levenshtein距离）/（路线长度）=0.5

编辑

你在写作

（lensum-ldist）/lensum

（1-ldist/lensum）

=1-0.5=0.5

但这是匹配（而不是距离）
，你可能会注意到它的文字

匹配%

p = (1 - l/m) × 100

其中

是

levenshtein距离

，

是两个单词中最长的

长度：
（注意：一些作者使用了两个中最长的，我使用了对齐长度）
为什么有些作者按对齐长度划分，另一个则用最大长度来划分，因为LevsTein不考虑间隙。距离=编辑次数（插入+删除+替换），而这是标准的全局对齐考虑间隙。这是Needleman–Wunsch和Levenshtein之间的（差距）差异，因此许多纸张使用两个序列之间的最大距离（但这是我自己的理解，我不确定100%）
以下是关于配对分析的IEEE交易：在本文中，标准化编辑距离如下所示：
给定有限字母表上的两个字符串X和Y，X和Y之间的归一化编辑距离d（X，Y）定义为W（p）/L（p）W的最小值，这里p是X和Y之间的编辑路径，W（p）是p的基本编辑操作的权重之和，L（p）是这些操作的次数（p的长度）
虽然没有绝对标准，但最常见的定义是标准化的levenstein距离ldist/max（len（a），len（b））
。这两个例子的结果都是0.5
max
之所以有意义，是因为它是Levenshtein距离的最低上限：要从b
中获得a
，其中len（a）>len（b）
，您始终可以用a
中的相应元素替换b
的第一个len（b）
元素，然后插入缺少的部分a[len（b） ：]
，总共执行len（a）
编辑操作
这个论点以一种明显的方式扩展到了len（a）的情况，通过更仔细地查看C代码，我发现这种明显的矛盾是由于ratio
对待“替换”编辑操作与其他操作不同（即成本为2），而距离
对它们一视同仁，成本为1
这可以在ratio\u py
函数中对内部levenshtein\u common
函数的调用中看到：



这最终导致不同的成本参数被发送到另一个内部函数，lev\u edit\u distance
，该函数具有以下文档片段：
@xcost: If nonzero, the replace operation has weight 2, otherwise all
        edit operations have equal weights of 1.

lev_edit_distance（）的代码：

[回答]
在我的例子中
比率（'ab'，'ac'）
意味着在字符串（4）的总长度上进行替换操作（成本为2），因此2/4=0.5

这就解释了“如何”，我想剩下的唯一方面就是“为什么”，但目前我对这种理解感到满意。
（lensum-ldist）/lensum

ldist不是距离，而是成本的总和

数组中每个不匹配的数字都来自上方、左侧或对角
如果数字来自左边，他是一个插入，它来自上面，它是一个删除，它来自对角线，它是一个替换
>>> import Levenshtein as lev
>>> lev.distance("ab","ac")
1
>>> lev.ratio("ab","ac")
0.5
>>> (4.0-1.0)/4.0    #Erro, the distance is 1 but the cost is 2 to be a replacement
0.75
>>> lev.ratio("ab","a")
0.6666666666666666
>>> lev.distance("ab","a")
1
>>> (3.0-1.0)/3.0    #Coincidence, the distance equal to the cost of insertion that is 1
0.6666666666666666
>>> x="ab"
>>> y="ac"
>>> lev.editops(x,y)
[('replace', 1, 1)]
>>> ldist = sum([2 for item in lev.editops(x,y) if item[0] == 'replace'])+ sum([1 for item in lev.editops(x,y) if item[0] != 'replace'])
>>> ldist
2
>>> ln=len(x)+len(y)
>>> ln
4
>>> (4.0-2.0)/4.0
0.5

插入和删除的成本为1，替换的成本为2。
重置成本为2，因为它是删除和插入
ab ac成本为2，因为它是一个替代品
>>> import Levenshtein as lev
>>> lev.distance("ab","ac")
1
>>> lev.ratio("ab","ac")
0.5
>>> (4.0-1.0)/4.0    #Erro, the distance is 1 but the cost is 2 to be a replacement
0.75
>>> lev.ratio("ab","a")
0.6666666666666666
>>> lev.distance("ab","a")
1
>>> (3.0-1.0)/3.0    #Coincidence, the distance equal to the cost of insertion that is 1
0.6666666666666666
>>> x="ab"
>>> y="ac"
>>> lev.editops(x,y)
[('replace', 1, 1)]
>>> ldist = sum([2 for item in lev.editops(x,y) if item[0] == 'replace'])+ sum([1 for item in lev.editops(x,y) if item[0] != 'replace'])
>>> ldist
2
>>> ln=len(x)+len(y)
>>> ln
4
>>> (4.0-2.0)/4.0
0.5


有关更多信息：
另一个例子：

成本为9（4个替换=>4*2=8，1个删除1*1=1，8+1=9）
距离=5（根据矩阵的向量（7，6）=5）
比率为（13-9）/13=0.30769223077
>>> c="look-at"
>>> d="google"
>>> lev.editops(c,d)
[('replace', 0, 0), ('delete', 3, 3), ('replace', 4, 3), ('replace', 5, 4), ('replace', 6, 5)]
>>> lev.ratio(c,d)
0.3076923076923077
>>> lev.distance(c,d)
5

嗨，拉斯曼！你是正确的，<>代码>最常用的定义的LISTT/max（LeN（a），LeN（b））< /> >，考虑GAP是我检查了库（在你给出的链接），我也混淆了为什么他使用了<代码>和> /代码>。也<代码>（1-1 / 3）＝666…< /代码>按照代码，但也<代码>（1-1 / 4）。=0.75
它的.5如何？即使在文档中也不清楚……但计算Levenshtein距离的实际公式在我的答案中。感谢这个答案，它是有意义的，但它没有解决真正困扰我的问题：两个结果（使用相同代码获得）似乎不一致（即，他们提出了两种不同的计算比率的方法）。这怎么可能？@cjauvin你读过我对你的问题的评论了吗？我已经检查过了，我有相同的印象，根据文档，它应该是.75，但你的例子中有两个结果是矛盾的。是的，我看到了你的评论，这就是为什么，尽管很好，很有趣，但我不能接受你的答案作为解决方案，对吗因为我真正想要的是这段代码中矛盾的原因。也许我应该问问PL维护人员。@cjauvin，我为你做的更改..我的意思是我正在查看该文件（u
/**
 * lev_edit_distance:
 * @len1: The length of @string1.
 * @string1: A sequence of bytes of length @len1, may contain NUL characters.
 * @len2: The length of @string2.
 * @string2: A sequence of bytes of length @len2, may contain NUL characters.
 * @xcost: If nonzero, the replace operation has weight 2, otherwise all
 *         edit operations have equal weights of 1.
 *
 * Computes Levenshtein edit distance of two strings.
 *
 * Returns: The edit distance.
 **/
_LEV_STATIC_PY size_t
lev_edit_distance(size_t len1, const lev_byte *string1,
                  size_t len2, const lev_byte *string2,
                  int xcost)
{
  size_t i;

>>> import Levenshtein as lev
>>> lev.distance("ab","ac")
1
>>> lev.ratio("ab","ac")
0.5
>>> (4.0-1.0)/4.0    #Erro, the distance is 1 but the cost is 2 to be a replacement
0.75
>>> lev.ratio("ab","a")
0.6666666666666666
>>> lev.distance("ab","a")
1
>>> (3.0-1.0)/3.0    #Coincidence, the distance equal to the cost of insertion that is 1
0.6666666666666666
>>> x="ab"
>>> y="ac"
>>> lev.editops(x,y)
[('replace', 1, 1)]
>>> ldist = sum([2 for item in lev.editops(x,y) if item[0] == 'replace'])+ sum([1 for item in lev.editops(x,y) if item[0] != 'replace'])
>>> ldist
2
>>> ln=len(x)+len(y)
>>> ln
4
>>> (4.0-2.0)/4.0
0.5

str1=len("google") #6
str2=len("look-at") #7
str1 + str2 #13

>>> c="look-at"
>>> d="google"
>>> lev.editops(c,d)
[('replace', 0, 0), ('delete', 3, 3), ('replace', 4, 3), ('replace', 5, 4), ('replace', 6, 5)]
>>> lev.ratio(c,d)
0.3076923076923077
>>> lev.distance(c,d)
5