String 用于名称匹配的字符串中的相似性_String_Algorithm_Matching_String Matching

String 用于名称匹配的字符串中的相似性

string algorithm

String 用于名称匹配的字符串中的相似性,string,algorithm,matching,string-matching,String,Algorithm,Matching,String Matching,我试图确定两个不同的餐厅名称是否相似，以便能够匹配它们。姓名可能拼写错误，或者标题部分的顺序可能错误在某些情况下，这是一个简单的匹配： “愤怒的食客”与“愤怒的食客餐厅”。或 “汉堡王”和“博格王” 我发现一个更难的情况是： “Mathias Dahlgren Matbaren”和“Mathias Dahlgren餐厅” 我已经研究了几种不同的模糊字符串差分算法，但没有找到一种适合这个用例的算法任何了解我可以使用的算法和/或库的人？您可以尝试diff算法。它创建所有可能的字符串并查找最长的

我试图确定两个不同的餐厅名称是否相似，以便能够匹配它们。姓名可能拼写错误，或者标题部分的顺序可能错误

在某些情况下，这是一个简单的匹配： “愤怒的食客”与“愤怒的食客餐厅”。或 “汉堡王”和“博格王”

我发现一个更难的情况是： “Mathias Dahlgren Matbaren”和“Mathias Dahlgren餐厅”

我已经研究了几种不同的模糊字符串差分算法，但没有找到一种适合这个用例的算法

任何了解我可以使用的算法和/或库的人？

您可以尝试diff算法。它创建所有可能的字符串并查找最长的公共子序列

Well, as mentioned above the speed is O(N^3), i've done a longest common subsequence way that is O(m.n) where m and n are the length of str1 and str2, the result is a percentage and it seems to be exactly the same as similar_text percentage but with better performance... here's the 3 functions i'm using.. 

<?php 
function LCS_Length($s1, $s2) 
{ 
  $m = strlen($s1); 
  $n = strlen($s2); 

  //this table will be used to compute the LCS-Length, only 128 chars per string are considered
  $LCS_Length_Table = array(array(128),array(128)); 


  //reset the 2 cols in the table 
  for($i=1; $i < $m; $i++) $LCS_Length_Table[$i][0]=0; 
  for($j=0; $j < $n; $j++) $LCS_Length_Table[0][$j]=0; 

  for ($i=1; $i <= $m; $i++) { 
    for ($j=1; $j <= $n; $j++) { 
      if ($s1[$i-1]==$s2[$j-1]) 
        $LCS_Length_Table[$i][$j] = $LCS_Length_Table[$i-1][$j-1] + 1; 
      else if ($LCS_Length_Table[$i-1][$j] >= $LCS_Length_Table[$i][$j-1]) 
        $LCS_Length_Table[$i][$j] = $LCS_Length_Table[$i-1][$j]; 
      else 
        $LCS_Length_Table[$i][$j] = $LCS_Length_Table[$i][$j-1]; 
    } 
  } 
  return $LCS_Length_Table[$m][$n]; 
} 

function str_lcsfix($s) 
{ 
  $s = str_replace(" ","",$s); 
  $s = ereg_replace("[��������]","e", $s); 
  $s = ereg_replace("[������������]","a", $s); 
  $s = ereg_replace("[��������]","i", $s); 
  $s = ereg_replace("[���������]","o", $s); 
  $s = ereg_replace("[��������]","u", $s); 
  $s = ereg_replace("[�]","c", $s); 
  return $s; 
} 

function get_lcs($s1, $s2) 
{ 
  //ok, now replace all spaces with nothing 
  $s1 = strtolower(str_lcsfix($s1)); 
  $s2 = strtolower(str_lcsfix($s2)); 

  $lcs = LCS_Length($s1,$s2); //longest common sub sequence 

  $ms = (strlen($s1) + strlen($s2)) / 2; 

  return (($lcs*100)/$ms); 
} 
?> 

you can skip calling str_lcsfix if you don't worry about accentuated characters and things like that or you can add up to it or modify it for faster performance, i think ereg is not the fastest way? 
hope this helps. 
Georges

如上所述，速度是O（N^3），我用了一种最长的通用子序列方式，即O（m.N），其中m和N是str1和str2的长度，结果是一个百分比，它似乎与类似的文本百分比完全相同，但性能更好。。。下面是我正在使用的3个函数。。
你可以跳过调用str_lcsfix，如果你不担心重音字符之类的事情，或者你可以添加或修改它以获得更快的性能，我认为ereg不是最快的方法？
希望这有帮助。
乔治

[1]

[2]

我认为最佳拟合算法应该是最佳局部对齐算法：

它是Levenstein算法的一个变体，不同之处在于，在开头/结尾插入/删除字符不会受到惩罚。

首先：如果您不仅仅需要匹配名称，例如地址，那么您将获得更好的结果。然后，您可以使用记录链接引擎来考虑来自所有属性的证据。在大多数情况下，仅使用名称会导致精度低下

首先，你需要考虑的是，如果你可能看到子串的重新排序。也就是说，“愤怒的餐厅”和“愤怒的餐厅”。在这种情况下，q-gram、最长公共子串和最长公共子序列都是很好的候选。对于q-gram，您可以在各种子公式和匹配之间进行选择

如果你想让顺序变得重要，仿射间隙可能适合这个特殊的情况。它与史密斯·沃特曼（Smith Waterman）相似，但不会因为删除而受到太多惩罚。基本上，第一次删除的成本很高，但随后在同一位置进行的删除成本较低

正如其他人所建议的，在匹配之前删除诸如“餐厅”、“马特巴伦”等常用词可能会提高准确性

有成堆的库，但由于没有指定编程语言，因此很难推荐一种。如果你使用PHP，Java有什么用？反之亦然

但请仔细注意我上面写的内容：光是名字并不能很好地发挥作用。即使名称相同，也可能是两个完全不同的餐厅。

根据您的具体想法，您的问题可以被视为以下问题的重复，我已经看过并尝试了Levenshtein distance，但当单词被反复使用时，它不起作用。例如，您的意思是，“Burgor King”和“King Burger”之间的距离应该小于Levenshtein距离？在应用模糊字符串差分算法之前，您能否从这两个词中删除

{“餐厅”、“地方”、“宫殿”…}

？@Codor例如，Levenshtein距离可能是最好的。然而，它在较难的一个上表现得并不好。

penalty("Angry Diner","Angry Diner Restaurant") = 0
penalty("Burger King", "Burgor King") = 1
penalty("Mathias Dahlgren Matbaren", "Restaurant Mathias Dahlgren") = 0