C# Dice Sorensen距离误差计算不使用相交法_C#_.net_String_Distance_Intersection

C# Dice Sorensen距离误差计算不使用相交法

c# .net string

C# Dice Sorensen距离误差计算不使用相交法,c#,.net,string,distance,intersection,C#,.net,String,Distance,Intersection,我一直在编写一个对象来计算两个字符串之间的DiceSorensen距离。操作的逻辑并没有那么困难。计算一个字符串中存在多少两个字母对，将其与第二个字符串进行比较，然后执行此等式 2（x与y相交）/（|x |.| y |）其中| x |和| y |是x&y中二元图元素的数量。为了进一步澄清，可在此处找到参考资料因此，我尝试在不同的位置查找如何在线执行代码，但我遇到的每个方法都在两个列表之间使用“Intersect”方法，据我所知，这不会起作用，因为如果你有一个字符串，其中的二元图已经存在，它将

我一直在编写一个对象来计算两个字符串之间的DiceSorensen距离。操作的逻辑并没有那么困难。计算一个字符串中存在多少两个字母对，将其与第二个字符串进行比较，然后执行此等式 2（x与y相交）/（|x |.| y |）

其中| x |和| y |是x&y中二元图元素的数量。为了进一步澄清，可在此处找到参考资料

因此，我尝试在不同的位置查找如何在线执行代码，但我遇到的每个方法都在两个列表之间使用“Intersect”方法，据我所知，这不会起作用，因为如果你有一个字符串，其中的二元图已经存在，它将不会再添加另一个。例如，如果我有一个字符串 “aaaa” 我希望有3个“aa”二元图，但Intersect方法只会产生一个，如果我在这个假设上不正确，请告诉我，因为我想知道为什么这么多人使用Intersect方法。我的假设基于MSDN网站

这是我写的代码

public static double SorensenDiceDistance(this string source, string target)
{
    // formula 2|X intersection Y|
    //         --------------------
    //          |X|     +     |Y|

    //create variables needed
    List<string> bigrams_source = new List<string>();
    List<string> bigrams_target = new List<string>();

    int source_length;
    int target_length;
    double intersect_count = 0;
    double result = 0;

    Console.WriteLine("DEBUG: string length source is " + source.Length);

    //base case
    if (source.Length == 0 || target.Length == 0)
    {
        return 0;
    }

    //extract bigrams from string 1
    bigrams_source = source.ListBiGrams();
    //extract bigrams from string 2
    bigrams_target = target.ListBiGrams();

    source_length = bigrams_source.Count();
    target_length = bigrams_target.Count();
    Console.WriteLine("DEBUG: bigram counts are source: " + source_length + " . target length : " + target_length);
    //now we have two sets of bigrams compare them in a non distinct loop

    for (int i = 0; i < bigrams_source.Count(); i++)
    {
        for (int y = 0; y < bigrams_target.Count(); y++)
        {
            if (bigrams_source.ElementAt(i) == bigrams_target.ElementAt(y))
            {
                intersect_count++;
                //Console.WriteLine("intersect count is :" + intersect_count);
            }
        }
    }
    Console.WriteLine("intersect line value : " + intersect_count);

    result = (2 * intersect_count) / (source_length + target_length);

    if (result < 0)
    {
        result = Math.Abs(result);
    }

    return result;
}

公共静态双SorensenDiceDistance（此字符串源，字符串目标）
{
//公式2 | X交点Y|
//         --------------------
//| X |+| Y|
//创建所需的变量
List bigrams_source=new List（）；
List bigrams_target=new List（）；
int源长度；
int目标长度；
双相交计数=0；
双结果=0；
Console.WriteLine（“调试：字符串长度源为”+source.length）；
//基本情况
if（source.Length==0 | | target.Length==0）
{
返回0；
}
//从字符串1中提取bigrams
bigrams_source=source.ListBiGrams（）；
//从字符串2中提取bigrams
bigrams_target=target.ListBiGrams（）；
source_length=bigrams_source.Count（）；
target_length=bigrams_target.Count（）；
WriteLine（“调试：bigram计数是源：“+source\u length+”。目标长度：“+target\u length”）；
//现在我们有两组Bigram在一个不明显的循环中比较它们
对于（int i=0；i


在代码中，您可以看到我调用了一个名为listBiGrams的方法，这就是它的外观
public static List<string> ListBiGrams(this string source)
{
    return ListNGrams(source, 2);
}

public static List<string> ListTriGrams(this string source)
{
    return ListNGrams(source, 3);
}

public static List<string> ListNGrams(this string source, int n)
{
    List<string> nGrams = new List<string>();

    if (n > source.Length)
    {
        return null;
    }
    else if (n == source.Length)
    {
        nGrams.Add(source);
        return nGrams;
    }
    else
    {
        for (int i = 0; i < source.Length - n; i++)
        {
            nGrams.Add(source.Substring(i, n));
        }

        return nGrams;
    }
}

公共静态列表ListBiGrams（此字符串源）
{
返回列表RAM（来源，2）；
}
公共静态列表ListTriGrams（此字符串源）
{
返回列表RAM（来源，3）；
}
公共静态列表ListGrams（此字符串源，int n）
{
List nGrams=新列表（）；
如果（n>源长度）
{
返回null；
}
else if（n==source.Length）
{
nGrams.Add（来源）；
返回nGrams；
}
其他的
{
for（int i=0；i

因此，我对代码的逐步理解是
1） 串连传递
2） 0长度检查
3） 创建列表并向上传递bigram
4） 获取每个二元图列表的长度
5） 嵌套循环，以在源位置[i]中针对目标字符串中的每个二元RAM进行检查，然后递增i，直到不再有要检查的源列表
6） 执行上面从维基百科获取的等式
7） 如果结果是负数，则返回一个正结果（但是我知道结果应该在0和1之间，这就是让我知道我做错了什么的原因）
我使用的源字符串是source=“这不是一个正确的字符串”，而目标字符串是，target=“这是一个正确的字符串”
我得到的结果是-0.0908
我确信（99%）我缺少的是一些小东西，比如某个地方的长度计算错误或计数错误。如果有人能指出我做错了什么，我会非常感激。谢谢你抽出时间
 这看起来像是家庭作业，但是字符串的相似性度量对我来说是新的，所以我看了一下

正如您可能注意到的，C版本使用并利用了该方法
集合是不包含重复元素的集合，其
元素没有特定的顺序
这解决了您的字符串“aaaa”难题。那里只有一个二字组

如果您喜欢Linq，那么我建议您使用可枚举.Distinct
，可枚举.Union
和可枚举.Intersect
。这些应该很好地模拟哈希集的重复删除功能
我还发现了这篇用Scala编写的漂亮文章。
嗨，Andrei，谢谢你这么做，但Intersect方法正是我想要避免的。Intersect方法只从您测试的字符串中添加了一个bigram'aa'，但在环境中，我需要生成字符串中出现的每个bigram，即使它已经出现了。因此字符串“aaaa”将生成一组双元字符串[aa]、[aa]、[aa]。也不是家庭作业。