Algorithm 计算Levenshtein距离的最有效方法

Algorithm 计算Levenshtein距离的最有效方法,algorithm,optimization,levenshtein-distance,Algorithm,Optimization,Levenshtein Distance,我刚刚实现了一个最佳匹配文件搜索算法,以查找字典中与字符串最接近的匹配项。在分析代码之后,我发现绝大多数时间都花在计算查询与可能结果之间的距离上。我目前正在使用二维数组实现计算Levenshtein距离的算法,这使得实现成为一个O(n^2)操作。我希望有人能提出一个更快的方法 以下是我的实现: public int计算(字符串根,字符串查询) { int arr[][]=new int[root.length()+2][query.length()+2]; 对于(int i=2;i

我刚刚实现了一个最佳匹配文件搜索算法,以查找字典中与字符串最接近的匹配项。在分析代码之后,我发现绝大多数时间都花在计算查询与可能结果之间的距离上。我目前正在使用二维数组实现计算Levenshtein距离的算法,这使得实现成为一个O(n^2)操作。我希望有人能提出一个更快的方法

以下是我的实现:

public int计算(字符串根,字符串查询)
{
int arr[][]=new int[root.length()+2][query.length()+2];
对于(int i=2;i
on-Levenshtein距离为优化计算提供了有用的建议——在您的案例中,最适用的建议是,如果您可以在感兴趣的最大距离上设置一个界限
k
(任何超出该界限的东西都可能是无穷大!),您可以将计算减少到
O(n乘以k)
而不是
O(n平方)
(基本上是在最小可能距离变为
>k时放弃)

由于您正在寻找最接近的匹配,您可以逐步将
k
减少到目前为止找到的最佳匹配的距离——这不会影响最坏情况下的行为(因为匹配可能按距离的递减顺序进行,这意味着您永远不会更快地退出),但平均情况应该会有所改善

我相信,如果您需要获得更好的性能,您可能必须接受一些计算更接近距离的强折衷方案(因此获得“合理的良好匹配”,而不一定是最佳匹配).

讨论了您的算法和各种改进。然而,似乎至少在一般情况下,O(n^2)是您能得到的最佳值


但是,如果您可以限制您的问题,则会有一些改进(例如,如果您只对距离感兴趣,如果距离小于d,则复杂性为O(dn)-这可能是有意义的,因为距离接近字符串长度的匹配可能不是很有趣)。看看你是否可以利用问题的细节…

根据本博客上的一条评论,你可以使用VP树并实现O(nlogn)。同一博客上的另一条评论指向a。请让我们知道这是否有效。

Commons lang的实现速度非常快。请参阅

以下是我对Scala的翻译:

//下面的代码基于Apache Commons lang项目的代码。
/*
*根据一个或多个许可证颁发给Apache软件基金会(ASF)
*参与者许可协议。请参阅随此文件分发的通知文件
*了解有关版权所有权的更多信息。ASF
*根据Apache许可证2.0版(以下简称
*“许可证”);除非符合
*许可证。您可以通过以下方式获得许可证副本:
* 
* http://www.apache.org/licenses/LICENSE-2.0
* 
*除非适用法律要求或书面同意,软件
*根据许可证进行的分发是按“原样”分发的,没有
*任何种类的明示或暗示的保证或条件。请参阅
*管理权限和限制的特定语言的许可证
*根据许可证。
*/
/**
*断言(levenshtein(“算法”,“利他”)==6)
*断言(levenshtein(“1638452297”,“444488444”)==9)
*断言(levenshtein(“,”)==0)
*断言(levenshtein(“,“a”)==1)
*断言(levenshtein(“aaapppp”),等于7)
*断言(levenshtein(“青蛙”、“雾”)==1)
*断言(levenshtein(“fly”,“ant”)==3)
*断言(levenshtein(“大象”、“河马”)==7)
*断言(levenshtein(“河马”、“大象”)==7)
*断言(levenshtein(“河马”,“zzzzzz”)==8)
*断言(levenshtein(“你好”,“你好”)==1)
*
*/
def levenshtein(s:CharSequence,t:CharSequence,max:Int=Int.MaxValue)={
导入scala.annotation.tailrec
def impl(s:CharSequence,t:CharSequence,n:Int,m:Int)={
//内部impl n m)impl(t,s,m,n)其他impl(s,t,n,m)
}

}

我知道现在已经很晚了,但这与当前的讨论有关

正如其他人提到的,如果您只想检查两个字符串之间的编辑距离是否在某个阈值k内,则可以将时间复杂度降低到O(kn)。更精确的表达式是O((2k+1)n)。取一条横贯对角线单元格两侧k个单元格的条带(条带长度为2k+1),计算该条带上单元格的值


有趣的是,Li等人提出了一个新的方法,并将其进一步简化为O((k+1)n)。

我修改了在该方法上发现的Levenshtein距离VBA函数,以使用一维数组。它执行速度快得多

'Calculate the Levenshtein Distance between two strings (the number of insertions,
'deletions, and substitutions needed to transform the first string into the second)

Public Function LevenshteinDistance2(ByRef s1 As String, ByRef s2 As String) As Long
Dim L1 As Long, L2 As Long, D() As Long, LD As Long 'Length of input strings and distance matrix
Dim i As Long, j As Long, ss2 As Long, ssL As Long, cost As Long 'loop counters, loop step, loop start, and cost of substitution for current letter
Dim cI As Long, cD As Long, cS As Long 'cost of next Insertion, Deletion and Substitution
Dim L1p1 As Long, L1p2 As Long 'Length of S1 + 1, Length of S1 + 2

L1 = Len(s1): L2 = Len(s2)
L1p1 = L1 + 1
L1p2 = L1 + 2
LD = (((L1 + 1) * (L2 + 1))) - 1
ReDim D(0 To LD)
ss2 = L1 + 1

For i = 0 To L1 Step 1: D(i) = i: Next i                'setup array positions 0,1,2,3,4,...
For j = 0 To LD Step ss2: D(j) = j / ss2: Next j        'setup array positions 0,1,2,3,4,...

For j = 1 To L2
    ssL = (L1 + 1) * j
    For i = (ssL + 1) To (ssL + L1)
        If Mid$(s1, i Mod ssL, 1) <> Mid$(s2, j, 1) Then cost = 1 Else cost = 0
        cI = D(i - 1) + 1
        cD = D(i - L1p1) + 1
        cS = D(i - L1p2) + cost

        If cI <= cD Then 'Insertion or Substitution
            If cI <= cS Then D(i) = cI Else D(i) = cS
        Else 'Deletion or Substitution
            If cD <= cS Then D(i) = cD Else D(i) = cS
        End If
    Next i
Next j

LevenshteinDistance2 = D(LD)
End Function
'计算两个字符串之间的Levenshtein距离(插入的数量,
'将第一个字符串转换为第二个字符串所需的删除和替换)
公共函数LevenshteinDistance2(ByRef s1作为字符串,ByRef s2作为字符串)的长度
输入字符串和距离矩阵的Dim L1为长,L2为长,D()为长,LD为长
Dim i为Long,j为Long,ss2为Long,ssL为Long,cost为Long'循环计数器,循环步长,循环开始,以及替换当前字母的成本
朦胧词
'Calculate the Levenshtein Distance between two strings (the number of insertions,
'deletions, and substitutions needed to transform the first string into the second)
Public Function LevenshteinDistance(ByRef s1 As String, ByRef s2 As String) As Long
Dim L1 As Long, L2 As Long, D() As Long, LD As Long         'Length of input strings and distance matrix
Dim i As Long, j As Long, ss2 As Long                       'loop counters, loop step
Dim ssL As Long, cost As Long                               'loop start, and cost of substitution for current letter
Dim cI As Long, cD As Long, cS As Long                      'cost of next Insertion, Deletion and Substitution
Dim L1p1 As Long, L1p2 As Long                              'Length of S1 + 1, Length of S1 + 2
Dim sss1() As String, sss2() As String                      'Character arrays for string S1 & S2

L1 = Len(s1): L2 = Len(s2)
L1p1 = L1 + 1
L1p2 = L1 + 2
LD = (((L1 + 1) * (L2 + 1))) - 1
ReDim D(0 To LD)
ss2 = L1 + 1

For i = 0 To L1 Step 1: D(i) = i: Next i                    'setup array positions 0,1,2,3,4,...
For j = 0 To LD Step ss2: D(j) = j / ss2: Next j            'setup array positions 0,1,2,3,4,...

ReDim sss1(1 To L1)                                         'Size character array S1
ReDim sss2(1 To L2)                                         'Size character array S2
For i = 1 To L1 Step 1: sss1(i) = Mid$(s1, i, 1): Next i    'Fill S1 character array
For i = 1 To L2 Step 1: sss2(i) = Mid$(s2, i, 1): Next i    'Fill S2 character array

For j = 1 To L2
    ssL = (L1 + 1) * j
    For i = (ssL + 1) To (ssL + L1)
        If sss1(i Mod ssL) <> sss2(j) Then cost = 1 Else cost = 0
        cI = D(i - 1) + 1
        cD = D(i - L1p1) + 1
        cS = D(i - L1p2) + cost
        If cI <= cD Then 'Insertion or Substitution
            If cI <= cS Then D(i) = cI Else D(i) = cS
        Else 'Deletion or Substitution
            If cD <= cS Then D(i) = cD Else D(i) = cS
        End If
    Next i
Next j

LevenshteinDistance = D(LD)
End Function