C# 尝试优化模糊匹配_C#_Algorithm_Duplicates_Grouping_Fuzzy Logic

C# 尝试优化模糊匹配

c# algorithm

C# 尝试优化模糊匹配,c#,algorithm,duplicates,grouping,fuzzy-logic,C#,Algorithm,Duplicates,Grouping,Fuzzy Logic,我有2500000个产品名称，我想尝试将它们组合在一起，即查找具有类似名称的产品。例如，我可以有三种产品：亨氏烤豆400克 hzbkd豆400g 亨氏豆400克实际上是相同的产品，可以合并在一起我的计划是使用的实现来查找匹配项。过程如下：在内存中列出所有产品名称的大列表选择列表中的第一个产品将其与列表中紧随其后的每个产品进行比较，并计算“Jaro分数” 报告任何具有高匹配度（例如0.95f或更高）的产品转到下一个产品因此，这有一些优化，因为它只以一种方式匹配每个产品，节省

我有2500000个产品名称，我想尝试将它们组合在一起，即查找具有类似名称的产品。例如，我可以有三种产品：

亨氏烤豆400克
hzbkd豆400g
亨氏豆400克

实际上是相同的产品，可以合并在一起

我的计划是使用的实现来查找匹配项。过程如下：

在内存中列出所有产品名称的大列表
选择列表中的第一个产品
将其与列表中紧随其后的每个产品进行比较，并计算“Jaro分数”
报告任何具有高匹配度（例如0.95f或更高）的产品
转到下一个产品

因此，这有一些优化，因为它只以一种方式匹配每个产品，节省了一半的处理时间

我对它进行了编码并进行了测试。它工作得很好，找到了几十个匹配项进行调查

将一种产品与2500000种其他产品进行比较并计算“Jaro分数”大约需要20秒。假设我的计算是正确的，这意味着需要一年的大部分时间来完成处理

显然这是不实际的

我让同事检查了代码，他们设法将Jaro分数计算部分的速度提高了20%。他们使进程多线程化，这使它更快了一点。我们还删除了存储的一些信息，将其简化为产品名称和唯一标识符的列表；这似乎对处理时间没有任何影响

有了这些改进，我们仍然认为这需要几个月的时间来处理，我们需要几个小时（最多几天）

我不想谈太多细节，因为我认为这并不完全相关，但我将产品细节加载到一个列表中：

private class Product
{
    public int MemberId;
    public string MemberName;
    public int ProductId;
    public string ProductCode;
    public string ProductName;
}
private class ProductList : List<Product> { }
private readonly ProductList _pl = new ProductList();

私有类产品
{
公共内部成员ID；
公共字符串成员名；
公共int ProductId；
公共字符串代码；
公共字符串产品名称；
}
私有类ProductList:List{}
私有只读产品列表_pl=new ProductList（）；

然后，我使用以下方法处理每个产品：

{Outer loop...
var match = _pl[matchCount];

for (int count = 1; count < _pl.Count; count++)
{
    var search = _pl[count];
    //Don't match products with themselves (redundant in a one-tailed match)
    if (search.MemberId == match.MemberId && search.ProductId == match.ProductId)
        continue;
    float jaro = Jaro.GetJaro(search.ProductName, match.ProductName);

    //We only log matches that pass the criteria
    if (jaro > target)
    {
        //Load the details into the grid
        var row = new string[7];
        row[0] = search.MemberName;
        row[1] = search.ProductCode;
        row[2] = search.ProductName;
        row[3] = match.MemberName;
        row[4] = match.ProductCode;
        row[5] = match.ProductName;
        row[6] = (jaro*100).ToString("#,##0.0000");
        JaroGrid.Rows.Add(row);
    }
}

{外部循环。。。
var match=_pl[matchCount]；
对于（int count=1；count<\u pl.count；count++）
{
var搜索=_pl[count]；
//不匹配产品本身（单尾匹配中冗余）
if（search.MemberId==match.MemberId&&search.ProductId==match.ProductId）
继续；
float jaro=jaro.GetJaro（search.ProductName，match.ProductName）；
//我们只记录通过条件的匹配
if（jaro>target）
{
//将细节加载到网格中
var row=新字符串[7]；
行[0]=search.MemberName；
第[1]行=search.ProductCode；
第[2]行=search.ProductName；
第[3]行=match.MemberName；
第[4]行=match.ProductCode；
第[5]行=match.ProductName；
第[6]行=（jaro*100）.ToString（“#，#，#0.0000”）；
JaroGrid.Rows.Add（row）；
}
}

我认为出于这个问题的目的，我们可以假设Jaro.GetJaro方法是一个“黑盒子”，也就是说，它如何工作并不重要，因为这部分代码已经尽可能地优化了，我不认为它可以如何改进

有没有更好的方法来模糊匹配此产品列表

我想知道是否有一种“聪明”的方法来预处理列表，以便在匹配过程开始时获得大多数匹配项。例如，如果比较所有产品需要3个月的时间，但比较“可能”的产品只需要3天，那么我们可以接受这一点

好的，下面是两个常见的问题。首先，是的，我利用了单尾匹配过程。真正的代码是：

for (int count = matchCount + 1; count < _pl.Count; count++)

for（int count=matchCount+1；count<\u pl.count；count++）

我很遗憾发布了修改后的版本；我试图将其简化一点（坏主意）

第二，很多人都想看到Jaro代码，所以这里是这样的（它相当长，最初不是我的-我甚至可能在这里的某个地方找到了它。）顺便说一句，我喜欢一旦发现不匹配就在完成之前退出的想法。我现在就开始看它

using System;
using System.Text;

namespace EPICFuzzyMatching
{
    public static class Jaro
    {
        private static string CleanString(string clean)
        {
            clean = clean.ToUpper();
            return clean;
        }

        //Gets the similarity of the two strings using Jaro distance
        //param string1 the first input string
        //param string2 the second input string
        //return a value between 0-1 of the similarity
        public static float GetJaro(String string1, String string2)
        {
            //Clean the strings, we do some tricks here to help matching
            string1 = CleanString(string1);
            string2 = CleanString(string2);

            //Get half the length of the string rounded up - (this is the distance used for acceptable transpositions)
            int halflen = ((Math.Min(string1.Length, string2.Length)) / 2) + ((Math.Min(string1.Length, string2.Length)) % 2);

            //Get common characters
            String common1 = GetCommonCharacters(string1, string2, halflen);
            String common2 = GetCommonCharacters(string2, string1, halflen);

            //Check for zero in common
            if (common1.Length == 0 || common2.Length == 0)
                return 0.0f;

            //Check for same length common strings returning 0.0f is not the same
            if (common1.Length != common2.Length)
                return 0.0f;

            //Get the number of transpositions
            int transpositions = 0;
            int n = common1.Length;
            for (int i = 0; i < n; i++)
            {
                if (common1[i] != common2[i])
                    transpositions++;
            }
            transpositions /= 2;

            //Calculate jaro metric
            return (common1.Length / ((float)string1.Length) + common2.Length / ((float)string2.Length) + (common1.Length - transpositions) / ((float)common1.Length)) / 3.0f;
        }

        //Returns a string buffer of characters from string1 within string2 if they are of a given
        //distance seperation from the position in string1.
        //param string1
        //param string2
        //param distanceSep
        //return a string buffer of characters from string1 within string2 if they are of a given
        //distance seperation from the position in string1
        private static String GetCommonCharacters(String string1, String string2, int distanceSep)
        {
            //Create a return buffer of characters
            var returnCommons = new StringBuilder(string1.Length);

            //Create a copy of string2 for processing
            var copy = new StringBuilder(string2);

            //Iterate over string1
            int n = string1.Length;
            int m = string2.Length;
            for (int i = 0; i < n; i++)
            {
                char ch = string1[i];

                //Set boolean for quick loop exit if found
                bool foundIt = false;

                //Compare char with range of characters to either side
                for (int j = Math.Max(0, i - distanceSep); !foundIt && j < Math.Min(i + distanceSep, m); j++)
                {
                    //Check if found
                    if (copy[j] == ch)
                    {
                        foundIt = true;
                        //Append character found
                        returnCommons.Append(ch);
                        //Alter copied string2 for processing
                        copy[j] = (char)0;
                    }
                }
            }
            return returnCommons.ToString();
        }
    }
}

使用系统；
使用系统文本；
名称空间模糊匹配
{
公共静态类Jaro
{
私有静态字符串清理字符串（字符串清理）
{
clean=clean.ToUpper（）；
返回清洁；
}
//使用Jaro距离获取两个字符串的相似性
//param string1第一个输入字符串
//param string2第二个输入字符串
//返回0-1之间的相似性值
公共静态浮点GetJaro（字符串string1、字符串string2）
{
//清理字符串，我们在这里做一些技巧来帮助匹配
string1=干净的字符串（string1）；
string2=清洁串（string2）；
//将字符串长度的一半向上取整-（这是用于可接受的换位的距离）
int-halflen=（（Math.Min（string1.Length，string2.Length））/2）+（（Math.Min（string1.Length，string2.Length））%2）；
//获取常用字符
String common1=GetCommonCharacters（string1、string2、halflen）；
String common2=GetCommonCharacters（string2、string1、halflen）；
//检查共有零位
if（common1.Length==0 | | common2.Length==0）
返回0.0f；
//检查返回0.0f的相同长度公共字符串是否不相同
if（common1.Length！=common2.Length）
返回0.0f；
//获取换位次数
int换位=0；
int n=common1.长度；
对于（int i=0；i