Php 变音字符的Levenshtein距离_Php_Levenshtein Distance

Php 变音字符的Levenshtein距离

php

Php 变音字符的Levenshtein距离,php,levenshtein-distance,Php,Levenshtein Distance,在PHP中，我使用函数Levenshtein（）计算Levenshtein距离。对于简单字符，它的工作原理与预期相同，但对于变音字符，如示例中所示 echo levenshtein('à', 'a'); 它返回“2”。在这种情况下，只需进行一次替换，因此我希望它返回“1” 我遗漏了什么吗？与许多PHP函数一样，默认的PHPlevenshtein（），不支持多字节。因此，当处理带有Unicode字符的字符串时，它分别处理每个字节并更改两个字节没有多字节版本（即mb_levenshtein（））

在PHP中，我使用函数Levenshtein（）计算Levenshtein距离。对于简单字符，它的工作原理与预期相同，但对于变音字符，如示例中所示

echo levenshtein('à', 'a');

它返回“2”。在这种情况下，只需进行一次替换，因此我希望它返回“1”

我遗漏了什么吗？

与许多PHP函数一样，默认的PHP

levenshtein（）

，不支持多字节。因此，当处理带有Unicode字符的字符串时，它分别处理每个字节并更改两个字节

没有多字节版本（即

mb_levenshtein（）

），因此您有两个选项：

1）使用

mb

函数自己重新实现该函数：

我认为发布此问题的答案可能会有用，因此如下所示：-
levenshtein函数分别处理输入字符串的每个字节。然后对于多字节编码，如UTF-8，它可能会给出误导性的结果
带有法语重音单词的示例：
-levenshtein（'notre'，'votre'）=1
-levenshtein（'notre'，'nôtre'）=2（啊？！）
您可以很容易地找到levenshtein函数的多字节兼容PHP实现，但它当然比C实现慢得多
另一个选项是将字符串转换为单字节（无损）编码，以便它们可以为fast core levenshtein函数提供数据
下面是我在搜索引擎中使用的转换函数，该搜索引擎存储UTF-8字符串和一个快速基准测试。我希望这会有帮助
<?php
// Convert an UTF-8 encoded string to a single-byte string suitable for
// functions such as levenshtein.
// 
// The function simply uses (and updates) a tailored dynamic encoding
// (in/out map parameter) where non-ascii characters are remapped to
// the range [128-255] in order of appearance.
//
// Thus it supports up to 128 different multibyte code points max over
// the whole set of strings sharing this encoding.
//
function utf8_to_extended_ascii($str, &$map)
{
    // find all multibyte characters (cf. utf-8 encoding specs)
    $matches = array();
    if (!preg_match_all('/[\xC0-\xF7][\x80-\xBF]+/', $str, $matches))
        return $str; // plain ascii string

    // update the encoding map with the characters not already met
    foreach ($matches[0] as $mbc)
        if (!isset($map[$mbc]))
            $map[$mbc] = chr(128 + count($map));

    // finally remap non-ascii characters
    return strtr($str, $map);
}

// Didactic example showing the usage of the previous conversion function but,
// for better performance, in a real application with a single input string
// matched against many strings from a database, you will probably want to
// pre-encode the input only once.
//
function levenshtein_utf8($s1, $s2)
{
    $charMap = array();
    $s1 = utf8_to_extended_ascii($s1, $charMap);
    $s2 = utf8_to_extended_ascii($s2, $charMap);

    return levenshtein($s1, $s2);
}
?>



结果（约6000个电话）
-参考时间核心C功能（单字节）：30毫秒
-utf8到ext ascii转换+核心功能：90毫秒
完整的PHP实现：3000 ms 
您可能需要一个多字节兼容的LevsTein实现，（第一个在谷歌上）解释得很好。我认为1个选项更适合我的情况，因为我还必须计算其他Unicode字符的LevsHeTin距离。但我对它做了一些修正：1<代码>如果（$length1==0）返回$length2更改为如果（$length2==0）返回$length1（因为此时$length1将始终大于或等于$length2）。2.移动的if（$str1==$str2）返回0
来启动函数，因为它会立即给出结果。如果（$length1==0）返回$length2，则不会替换为
，因为此时如果$length1
为0，则两个字符串都为空。事实上，如果您移动了if（$str1==$str2），则返回0
一开始，这个案例已经处理好了，所以毕竟第二次更改后，第一次是无关紧要的，整个if（$length1==0）返回0可以被抛出。
<?php
// Convert an UTF-8 encoded string to a single-byte string suitable for
// functions such as levenshtein.
// 
// The function simply uses (and updates) a tailored dynamic encoding
// (in/out map parameter) where non-ascii characters are remapped to
// the range [128-255] in order of appearance.
//
// Thus it supports up to 128 different multibyte code points max over
// the whole set of strings sharing this encoding.
//
function utf8_to_extended_ascii($str, &$map)
{
    // find all multibyte characters (cf. utf-8 encoding specs)
    $matches = array();
    if (!preg_match_all('/[\xC0-\xF7][\x80-\xBF]+/', $str, $matches))
        return $str; // plain ascii string

    // update the encoding map with the characters not already met
    foreach ($matches[0] as $mbc)
        if (!isset($map[$mbc]))
            $map[$mbc] = chr(128 + count($map));

    // finally remap non-ascii characters
    return strtr($str, $map);
}

// Didactic example showing the usage of the previous conversion function but,
// for better performance, in a real application with a single input string
// matched against many strings from a database, you will probably want to
// pre-encode the input only once.
//
function levenshtein_utf8($s1, $s2)
{
    $charMap = array();
    $s1 = utf8_to_extended_ascii($s1, $charMap);
    $s2 = utf8_to_extended_ascii($s2, $charMap);

    return levenshtein($s1, $s2);
}
?>