Php 变音字符的Levenshtein距离
在PHP中,我使用函数Levenshtein()计算Levenshtein距离。对于简单字符,它的工作原理与预期相同,但对于变音字符,如示例中所示Php 变音字符的Levenshtein距离,php,levenshtein-distance,Php,Levenshtein Distance,在PHP中,我使用函数Levenshtein()计算Levenshtein距离。对于简单字符,它的工作原理与预期相同,但对于变音字符,如示例中所示 echo levenshtein('à', 'a'); 它返回“2”。在这种情况下,只需进行一次替换,因此我希望它返回“1” 我遗漏了什么吗?与许多PHP函数一样,默认的PHPlevenshtein(),不支持多字节。因此,当处理带有Unicode字符的字符串时,它分别处理每个字节并更改两个字节 没有多字节版本(即mb_levenshtein())
echo levenshtein('à', 'a');
它返回“2”。在这种情况下,只需进行一次替换,因此我希望它返回“1”
我遗漏了什么吗?与许多PHP函数一样,默认的PHP
levenshtein()
,不支持多字节。因此,当处理带有Unicode字符的字符串时,它分别处理每个字节并更改两个字节
没有多字节版本(即mb_levenshtein()
),因此您有两个选项:
1) 使用mb
函数自己重新实现该函数:
我认为发布此问题的答案可能会有用,因此如下所示:-
levenshtein函数分别处理输入字符串的每个字节。然后对于多字节编码,如UTF-8,它可能会给出误导性的结果
带有法语重音单词的示例:
-levenshtein('notre','votre')=1
-levenshtein('notre','nôtre')=2(啊?!)
您可以很容易地找到levenshtein函数的多字节兼容PHP实现,但它当然比C实现慢得多
另一个选项是将字符串转换为单字节(无损)编码,以便它们可以为fast core levenshtein函数提供数据
下面是我在搜索引擎中使用的转换函数,该搜索引擎存储UTF-8字符串和一个快速基准测试。我希望这会有帮助
<?php
// Convert an UTF-8 encoded string to a single-byte string suitable for
// functions such as levenshtein.
//
// The function simply uses (and updates) a tailored dynamic encoding
// (in/out map parameter) where non-ascii characters are remapped to
// the range [128-255] in order of appearance.
//
// Thus it supports up to 128 different multibyte code points max over
// the whole set of strings sharing this encoding.
//
function utf8_to_extended_ascii($str, &$map)
{
// find all multibyte characters (cf. utf-8 encoding specs)
$matches = array();
if (!preg_match_all('/[\xC0-\xF7][\x80-\xBF]+/', $str, $matches))
return $str; // plain ascii string
// update the encoding map with the characters not already met
foreach ($matches[0] as $mbc)
if (!isset($map[$mbc]))
$map[$mbc] = chr(128 + count($map));
// finally remap non-ascii characters
return strtr($str, $map);
}
// Didactic example showing the usage of the previous conversion function but,
// for better performance, in a real application with a single input string
// matched against many strings from a database, you will probably want to
// pre-encode the input only once.
//
function levenshtein_utf8($s1, $s2)
{
$charMap = array();
$s1 = utf8_to_extended_ascii($s1, $charMap);
$s2 = utf8_to_extended_ascii($s2, $charMap);
return levenshtein($s1, $s2);
}
?>
结果(约6000个电话)
-参考时间核心C功能(单字节):30毫秒
-utf8到ext ascii转换+核心功能:90毫秒
完整的PHP实现:3000 ms 您可能需要一个多字节兼容的LevsTein实现,(第一个在谷歌上)解释得很好。我认为1个选项更适合我的情况,因为我还必须计算其他Unicode字符的LevsHeTin距离。但我对它做了一些修正:1<代码>如果($length1==0)返回$length2代码>更改为如果($length2==0)返回$length1代码>(因为此时$length1将始终大于或等于$length2)。2.移动的if($str1==$str2)返回0
来启动函数,因为它会立即给出结果。如果($length1==0)返回$length2,则不会如果($length1==0)返回0,则最好将代码>替换为
,因为此时如果$length1
为0,则两个字符串都为空。事实上,如果您移动了if($str1==$str2),则返回0
一开始,这个案例已经处理好了,所以毕竟第二次更改后,第一次是无关紧要的,整个if($length1==0)返回0代码>可以被抛出。
<?php
// Convert an UTF-8 encoded string to a single-byte string suitable for
// functions such as levenshtein.
//
// The function simply uses (and updates) a tailored dynamic encoding
// (in/out map parameter) where non-ascii characters are remapped to
// the range [128-255] in order of appearance.
//
// Thus it supports up to 128 different multibyte code points max over
// the whole set of strings sharing this encoding.
//
function utf8_to_extended_ascii($str, &$map)
{
// find all multibyte characters (cf. utf-8 encoding specs)
$matches = array();
if (!preg_match_all('/[\xC0-\xF7][\x80-\xBF]+/', $str, $matches))
return $str; // plain ascii string
// update the encoding map with the characters not already met
foreach ($matches[0] as $mbc)
if (!isset($map[$mbc]))
$map[$mbc] = chr(128 + count($map));
// finally remap non-ascii characters
return strtr($str, $map);
}
// Didactic example showing the usage of the previous conversion function but,
// for better performance, in a real application with a single input string
// matched against many strings from a database, you will probably want to
// pre-encode the input only once.
//
function levenshtein_utf8($s1, $s2)
{
$charMap = array();
$s1 = utf8_to_extended_ascii($s1, $charMap);
$s2 = utf8_to_extended_ascii($s2, $charMap);
return levenshtein($s1, $s2);
}
?>