C# 如何计算给定2个字符串的距离相似性度量?

C# 如何计算给定2个字符串的距离相似性度量?,c#,.net,levenshtein-distance,measure,similarity,C#,.net,Levenshtein Distance,Measure,Similarity,我需要计算两个字符串之间的相似性。那我到底是什么意思?让我举例说明: 真正的词:医院 错词:haspita 现在,我的目标是确定需要修改多少个字符才能获得真正的单词。在本例中,我需要修改2个字母。那么百分比是多少?我总是用真实单词的长度。所以它变成了2/8=25%,所以这两个给定的字符串DSM是75% 在性能成为关键考虑因素的情况下,我如何实现这一目标?您需要的是编辑距离或。wikipedia的文章解释了它是如何计算的,并且在底部有一段很好的伪代码,可以帮助您非常轻松地用C#编写这个算法 以

我需要计算两个字符串之间的相似性。那我到底是什么意思?让我举例说明:

  • 真正的词:
    医院
  • 错词:
    haspita
现在,我的目标是确定需要修改多少个字符才能获得真正的单词。在本例中,我需要修改2个字母。那么百分比是多少?我总是用真实单词的长度。所以它变成了2/8=25%,所以这两个给定的字符串DSM是75%


在性能成为关键考虑因素的情况下,我如何实现这一目标?

您需要的是编辑距离或。wikipedia的文章解释了它是如何计算的,并且在底部有一段很好的伪代码,可以帮助您非常轻松地用C#编写这个算法

以下是第一个站点的实现,链接如下:

private static int  CalcLevenshteinDistance(string a, string b)
    {
    if (String.IsNullOrEmpty(a) && String.IsNullOrEmpty(b)) {
        return 0;
    }
    if (String.IsNullOrEmpty(a)) {
        return b.Length;
    }
    if (String.IsNullOrEmpty(b)) {
        return a.Length;
    }
    int  lengthA   = a.Length;
    int  lengthB   = b.Length;
    var  distances = new int[lengthA + 1, lengthB + 1];
    for (int i = 0;  i <= lengthA;  distances[i, 0] = i++);
    for (int j = 0;  j <= lengthB;  distances[0, j] = j++);

    for (int i = 1;  i <= lengthA;  i++)
        for (int j = 1;  j <= lengthB;  j++)
            {
            int  cost = b[j - 1] == a[i - 1] ? 0 : 1;
            distances[i, j] = Math.Min
                (
                Math.Min(distances[i - 1, j] + 1, distances[i, j - 1] + 1),
                distances[i - 1, j - 1] + cost
                );
            }
    return distances[lengthA, lengthB];
    }
private static int Calclevensteind(字符串a、字符串b)
{
if(String.IsNullOrEmpty(a)和&String.IsNullOrEmpty(b)){
返回0;
}
if(String.IsNullOrEmpty(a)){
返回b.长度;
}
if(String.IsNullOrEmpty(b)){
返回a.长度;
}
int lengthA=a.长度;
int lengthB=b.长度;
var距离=新整数[lengthA+1,lengthB+1];

对于(int i=0;i,可以使用大量字符串相似距离算法。此处列出了一些(但未详尽列出)算法:

  • 缝纫工
  • 史密斯·沃特曼
  • 史密斯·沃特曼·戈托
  • 雅罗
  • 骰子相似性
  • 蒙格埃尔坎
包含所有这些实现的库称为
它有java和c#两种实现。

我几周前刚刚解决了这个完全相同的问题。因为现在有人问我,我将分享代码。在我的详尽测试中,我的代码比维基百科上的c#示例快10倍,即使没有提供最大距离。当提供最大距离时,性能增益会增加es到30x-100x+。注意性能的几个关键点:


  • 如果需要反复比较相同的单词,请首先将单词转换为整数数组。Damerau-Levenshtein算法包括许多>,这里有一种替代方法:

    这篇评论太长了

    寻找相似性的典型方法是Levenshtein距离,毫无疑问,库中有可用的代码

    不幸的是,这需要对每个字符串进行比较。您可以编写一个专门版本的代码来缩短计算过程。如果距离大于某个阈值,您仍然需要进行所有比较

    另一个想法是使用一些三角形或n-gram的变体。这些是由n个字符组成的序列(或n个单词或n个基因组序列或n个任何东西)。保持三角形到字符串的映射,并选择重叠最大的字符串。n的典型选择是“3”,因此得名

    例如,英语将有以下三元:

    • 英格
    • 天然气
    • 格利
    • 利斯
    • 伊什
    而英国会:

    • 英格
    • 天然气
    • 玻璃酸
    • 局域网
    好的,七分之二(或十分之四)匹配。如果这对你有效,你可以索引三角/字符串表并获得更快的搜索


    您还可以将其与Levenshtein结合使用,以将比较集减少到具有最小共有n克数的比较集。

    以下是我对Damerau Levenshtein距离的实现,它不仅返回相似系数,而且还返回更正单词中的错误位置(此功能可在文本编辑器中使用)。此外,我的实现支持不同的错误权重(替换、删除、插入、转置)

    公共静态列表最佳StringAlignmentDistance(
    字符串字,字符串更正字,
    布尔换位=真,
    int替换成本=1,
    int insertionCost=1,
    int deletionCost=1,
    int transpositionConst=1)
    {
    int w_length=单词长度;
    int cw_length=已更正的字长度;
    var d=新的KeyValuePair[w_长度+1,cw_长度+1];
    var结果=新列表(数学最大值(w_长度,cw_长度));
    如果(w_长度==0)
    {
    对于(int i=0;iresult.Count)
    {
    int delmistakescont=d[w_length,cw_length].Key-result.Count;
    for(int i=0;i/// <summary>
    /// Computes the Damerau-Levenshtein Distance between two strings, represented as arrays of
    /// integers, where each integer represents the code point of a character in the source string.
    /// Includes an optional threshhold which can be used to indicate the maximum allowable distance.
    /// </summary>
    /// <param name="source">An array of the code points of the first string</param>
    /// <param name="target">An array of the code points of the second string</param>
    /// <param name="threshold">Maximum allowable distance</param>
    /// <returns>Int.MaxValue if threshhold exceeded; otherwise the Damerau-Leveshteim distance between the strings</returns>
    public static int DamerauLevenshteinDistance(int[] source, int[] target, int threshold) {
    
        int length1 = source.Length;
        int length2 = target.Length;
    
        // Return trivial case - difference in string lengths exceeds threshhold
        if (Math.Abs(length1 - length2) > threshold) { return int.MaxValue; }
    
        // Ensure arrays [i] / length1 use shorter length 
        if (length1 > length2) {
            Swap(ref target, ref source);
            Swap(ref length1, ref length2);
        }
    
        int maxi = length1;
        int maxj = length2;
    
        int[] dCurrent = new int[maxi + 1];
        int[] dMinus1 = new int[maxi + 1];
        int[] dMinus2 = new int[maxi + 1];
        int[] dSwap;
    
        for (int i = 0; i <= maxi; i++) { dCurrent[i] = i; }
    
        int jm1 = 0, im1 = 0, im2 = -1;
    
        for (int j = 1; j <= maxj; j++) {
    
            // Rotate
            dSwap = dMinus2;
            dMinus2 = dMinus1;
            dMinus1 = dCurrent;
            dCurrent = dSwap;
    
            // Initialize
            int minDistance = int.MaxValue;
            dCurrent[0] = j;
            im1 = 0;
            im2 = -1;
    
            for (int i = 1; i <= maxi; i++) {
    
                int cost = source[im1] == target[jm1] ? 0 : 1;
    
                int del = dCurrent[im1] + 1;
                int ins = dMinus1[i] + 1;
                int sub = dMinus1[im1] + cost;
    
                //Fastest execution for min value of 3 integers
                int min = (del > ins) ? (ins > sub ? sub : ins) : (del > sub ? sub : del);
    
                if (i > 1 && j > 1 && source[im2] == target[jm1] && source[im1] == target[j - 2])
                    min = Math.Min(min, dMinus2[im2] + cost);
    
                dCurrent[i] = min;
                if (min < minDistance) { minDistance = min; }
                im1++;
                im2++;
            }
            jm1++;
            if (minDistance > threshold) { return int.MaxValue; }
        }
    
        int result = dCurrent[maxi];
        return (result > threshold) ? int.MaxValue : result;
    }
    
    static void Swap<T>(ref T arg1,ref T arg2) {
        T temp = arg1;
        arg1 = arg2;
        arg2 = temp;
    }
    
    public static List<Mistake> OptimalStringAlignmentDistance(
      string word, string correctedWord,
      bool transposition = true,
      int substitutionCost = 1,
      int insertionCost = 1,
      int deletionCost = 1,
      int transpositionCost = 1)
    {
        int w_length = word.Length;
        int cw_length = correctedWord.Length;
        var d = new KeyValuePair<int, CharMistakeType>[w_length + 1, cw_length + 1];
        var result = new List<Mistake>(Math.Max(w_length, cw_length));
    
        if (w_length == 0)
        {
            for (int i = 0; i < cw_length; i++)
                result.Add(new Mistake(i, CharMistakeType.Insertion));
            return result;
        }
    
        for (int i = 0; i <= w_length; i++)
            d[i, 0] = new KeyValuePair<int, CharMistakeType>(i, CharMistakeType.None);
    
        for (int j = 0; j <= cw_length; j++)
            d[0, j] = new KeyValuePair<int, CharMistakeType>(j, CharMistakeType.None);
    
        for (int i = 1; i <= w_length; i++)
        {
            for (int j = 1; j <= cw_length; j++)
            {
                bool equal = correctedWord[j - 1] == word[i - 1];
                int delCost = d[i - 1, j].Key + deletionCost;
                int insCost = d[i, j - 1].Key + insertionCost;
                int subCost = d[i - 1, j - 1].Key;
                if (!equal)
                    subCost += substitutionCost;
                int transCost = int.MaxValue;
                if (transposition && i > 1 && j > 1 && word[i - 1] == correctedWord[j - 2] && word[i - 2] == correctedWord[j - 1])
                {
                    transCost = d[i - 2, j - 2].Key;
                    if (!equal)
                        transCost += transpositionCost;
                }
    
                int min = delCost;
                CharMistakeType mistakeType = CharMistakeType.Deletion;
                if (insCost < min)
                {
                    min = insCost;
                    mistakeType = CharMistakeType.Insertion;
                }
                if (subCost < min)
                {
                    min = subCost;
                    mistakeType = equal ? CharMistakeType.None : CharMistakeType.Substitution;
                }
                if (transCost < min)
                {
                    min = transCost;
                    mistakeType = CharMistakeType.Transposition;
                }
    
                d[i, j] = new KeyValuePair<int, CharMistakeType>(min, mistakeType);
            }
        }
    
        int w_ind = w_length;
        int cw_ind = cw_length;
        while (w_ind >= 0 && cw_ind >= 0)
        {
            switch (d[w_ind, cw_ind].Value)
            {
                case CharMistakeType.None:
                    w_ind--;
                    cw_ind--;
                    break;
                case CharMistakeType.Substitution:
                    result.Add(new Mistake(cw_ind - 1, CharMistakeType.Substitution));
                    w_ind--;
                    cw_ind--;
                    break;
                case CharMistakeType.Deletion:
                    result.Add(new Mistake(cw_ind, CharMistakeType.Deletion));
                    w_ind--;
                    break;
                case CharMistakeType.Insertion:
                    result.Add(new Mistake(cw_ind - 1, CharMistakeType.Insertion));
                    cw_ind--;
                    break;
                case CharMistakeType.Transposition:
                    result.Add(new Mistake(cw_ind - 2, CharMistakeType.Transposition));
                    w_ind -= 2;
                    cw_ind -= 2;
                    break;
            }
        }
        if (d[w_length, cw_length].Key > result.Count)
        {
            int delMistakesCount = d[w_length, cw_length].Key - result.Count;
            for (int i = 0; i < delMistakesCount; i++)
                result.Add(new Mistake(0, CharMistakeType.Deletion));
        }
    
        result.Reverse();
    
        return result;
    }
    
    public struct Mistake
    {
        public int Position;
        public CharMistakeType Type;
    
        public Mistake(int position, CharMistakeType type)
        {
            Position = position;
            Type = type;
        }
    
        public override string ToString()
        {
            return Position + ", " + Type;
        }
    }
    
    public enum CharMistakeType
    {
        None,
        Substitution,
        Insertion,
        Deletion,
        Transposition
    }
    
    Public Shared Function LevenshteinDistance(ByVal v1 As String, ByVal v2 As String) As Integer
        Dim cost(v1.Length, v2.Length) As Integer
        If v1.Length = 0 Then
            Return v2.Length                'if string 1 is empty, the number of edits will be the insertion of all characters in string 2
        ElseIf v2.Length = 0 Then
            Return v1.Length                'if string 2 is empty, the number of edits will be the insertion of all characters in string 1
        Else
            'setup the base costs for inserting the correct characters
            For v1Count As Integer = 0 To v1.Length
                cost(v1Count, 0) = v1Count
            Next v1Count
            For v2Count As Integer = 0 To v2.Length
                cost(0, v2Count) = v2Count
            Next v2Count
            'now work out the cheapest route to having the correct characters
            For v1Count As Integer = 1 To v1.Length
                For v2Count As Integer = 1 To v2.Length
                    'the first min term is the cost of editing the character in place (which will be the cost-to-date or the cost-to-date + 1 (depending on whether a change is required)
                    'the second min term is the cost of inserting the correct character into string 1 (cost-to-date + 1), 
                    'the third min term is the cost of inserting the correct character into string 2 (cost-to-date + 1) and 
                    cost(v1Count, v2Count) = Math.Min(
                        cost(v1Count - 1, v2Count - 1) + If(v1.Chars(v1Count - 1) = v2.Chars(v2Count - 1), 0, 1),
                        Math.Min(
                            cost(v1Count - 1, v2Count) + 1,
                            cost(v1Count, v2Count - 1) + 1
                        )
                    )
                Next v2Count
            Next v1Count
    
            'the final result is the cheapest cost to get the two strings to match, which is the bottom right cell in the matrix
            'in the event of strings being equal, this will be the result of zipping diagonally down the matrix (which will be square as the strings are the same length)
            Return cost(v1.Length, v2.Length)
        End If
    End Function