Warning: file_get_contents(/data/phpspider/zhask/data//catemap/1/php/246.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Php 变音字符的Levenshtein距离_Php_Levenshtein Distance - Fatal编程技术网

Php 变音字符的Levenshtein距离

Php 变音字符的Levenshtein距离,php,levenshtein-distance,Php,Levenshtein Distance,在PHP中,我使用函数Levenshtein()计算Levenshtein距离。对于简单字符,它的工作原理与预期相同,但对于变音字符,如示例中所示 echo levenshtein('à', 'a'); 它返回“2”。在这种情况下,只需进行一次替换,因此我希望它返回“1” 我遗漏了什么吗?与许多PHP函数一样,默认的PHPlevenshtein(),不支持多字节。因此,当处理带有Unicode字符的字符串时,它分别处理每个字节并更改两个字节 没有多字节版本(即mb_levenshtein())

在PHP中,我使用函数Levenshtein()计算Levenshtein距离。对于简单字符,它的工作原理与预期相同,但对于变音字符,如示例中所示

echo levenshtein('à', 'a');
它返回“2”。在这种情况下,只需进行一次替换,因此我希望它返回“1”


我遗漏了什么吗?

与许多PHP函数一样,默认的PHP
levenshtein()
,不支持多字节。因此,当处理带有Unicode字符的字符串时,它分别处理每个字节并更改两个字节

没有多字节版本(即
mb_levenshtein()
),因此您有两个选项:

1) 使用
mb
函数自己重新实现该函数:


我认为发布此问题的答案可能会有用,因此如下所示:-

levenshtein函数分别处理输入字符串的每个字节。然后对于多字节编码,如UTF-8,它可能会给出误导性的结果

带有法语重音单词的示例: -levenshtein('notre','votre')=1 -levenshtein('notre','nôtre')=2(啊?!)

您可以很容易地找到levenshtein函数的多字节兼容PHP实现,但它当然比C实现慢得多

另一个选项是将字符串转换为单字节(无损)编码,以便它们可以为fast core levenshtein函数提供数据

下面是我在搜索引擎中使用的转换函数,该搜索引擎存储UTF-8字符串和一个快速基准测试。我希望这会有帮助

<?php
// Convert an UTF-8 encoded string to a single-byte string suitable for
// functions such as levenshtein.
// 
// The function simply uses (and updates) a tailored dynamic encoding
// (in/out map parameter) where non-ascii characters are remapped to
// the range [128-255] in order of appearance.
//
// Thus it supports up to 128 different multibyte code points max over
// the whole set of strings sharing this encoding.
//
function utf8_to_extended_ascii($str, &$map)
{
    // find all multibyte characters (cf. utf-8 encoding specs)
    $matches = array();
    if (!preg_match_all('/[\xC0-\xF7][\x80-\xBF]+/', $str, $matches))
        return $str; // plain ascii string

    // update the encoding map with the characters not already met
    foreach ($matches[0] as $mbc)
        if (!isset($map[$mbc]))
            $map[$mbc] = chr(128 + count($map));

    // finally remap non-ascii characters
    return strtr($str, $map);
}

// Didactic example showing the usage of the previous conversion function but,
// for better performance, in a real application with a single input string
// matched against many strings from a database, you will probably want to
// pre-encode the input only once.
//
function levenshtein_utf8($s1, $s2)
{
    $charMap = array();
    $s1 = utf8_to_extended_ascii($s1, $charMap);
    $s2 = utf8_to_extended_ascii($s2, $charMap);

    return levenshtein($s1, $s2);
}
?>

结果(约6000个电话) -参考时间核心C功能(单字节):30毫秒 -utf8到ext ascii转换+核心功能:90毫秒
完整的PHP实现:3000 ms

您可能需要一个多字节兼容的LevsTein实现,(第一个在谷歌上)解释得很好。我认为1个选项更适合我的情况,因为我还必须计算其他Unicode字符的LevsHeTin距离。但我对它做了一些修正:1<代码>如果($length1==0)返回$length2更改为
如果($length2==0)返回$length1(因为此时$length1将始终大于或等于$length2)。2.移动的
if($str1==$str2)返回0
来启动函数,因为它会立即给出结果。如果($length1==0)返回$length2,则不会
替换为
,因为此时如果
$length1
为0,则两个字符串都为空。事实上,如果您移动了
if($str1==$str2),则返回0
一开始,这个案例已经处理好了,所以毕竟第二次更改后,第一次是无关紧要的,整个
if($length1==0)返回0可以被抛出。
<?php
// Convert an UTF-8 encoded string to a single-byte string suitable for
// functions such as levenshtein.
// 
// The function simply uses (and updates) a tailored dynamic encoding
// (in/out map parameter) where non-ascii characters are remapped to
// the range [128-255] in order of appearance.
//
// Thus it supports up to 128 different multibyte code points max over
// the whole set of strings sharing this encoding.
//
function utf8_to_extended_ascii($str, &$map)
{
    // find all multibyte characters (cf. utf-8 encoding specs)
    $matches = array();
    if (!preg_match_all('/[\xC0-\xF7][\x80-\xBF]+/', $str, $matches))
        return $str; // plain ascii string

    // update the encoding map with the characters not already met
    foreach ($matches[0] as $mbc)
        if (!isset($map[$mbc]))
            $map[$mbc] = chr(128 + count($map));

    // finally remap non-ascii characters
    return strtr($str, $map);
}

// Didactic example showing the usage of the previous conversion function but,
// for better performance, in a real application with a single input string
// matched against many strings from a database, you will probably want to
// pre-encode the input only once.
//
function levenshtein_utf8($s1, $s2)
{
    $charMap = array();
    $s1 = utf8_to_extended_ascii($s1, $charMap);
    $s2 = utf8_to_extended_ascii($s2, $charMap);

    return levenshtein($s1, $s2);
}
?>