Php 删除合成词_Php_Regex - Fatal编程技术网

Php 删除合成词

php regex

Php 删除合成词,php,regex,Php,Regex,例如，我有一个单词列表，其中一些是合成词帕兰卡柏拉图普拉托帕兰卡我需要删除“柏拉图”和“帕兰卡”，只让“帕兰卡”。使用array_unique删除重复项，但这些合成词很复杂我应该按字长对列表进行排序并逐一比较吗？正则表达式就是答案更新：单词列表更大、更复杂，不仅仅是相关单词更新2：我可以安全地将数组内爆为字符串更新3：我正试图避免这样做，好像这是一个摆动排序。必须有一种更有效的方法来做到这一点我认为buble类方法是唯一可行的方法：-( 我不喜欢，但这是我所拥有的。。。

例如，我有一个单词列表，其中一些是合成词

帕兰卡
柏拉图
普拉托帕兰卡

我需要删除“柏拉图”和“帕兰卡”，只让“帕兰卡”。使用array_unique删除重复项，但这些合成词很复杂

我应该按字长对列表进行排序并逐一比较吗？正则表达式就是答案

更新：单词列表更大、更复杂，不仅仅是相关单词

更新2：我可以安全地将数组内爆为字符串

更新3：我正试图避免这样做，好像这是一个摆动排序。必须有一种更有效的方法来做到这一点

我认为buble类方法是唯一可行的方法：-( 我不喜欢，但这是我所拥有的。。。还有更好的办法吗

function sortByLengthDesc($a,$b){
return strlen($a)-strlen($b);
}

usort($words,'sortByLengthDesc');
$count = count($words);
for($i=0;$i<=$count;$i++) {
    for($j=$i+1;$j<$count;$j++) {
        if(strstr($words[$j], $words[$i]) ){
            $delete[]=$i;
        }
    }
}
foreach($delete as $i) {
    unset($words[$i]);
}

函数sortbylengdesc（$a，$b）{
返回strlen（$a）-strlen（$b）；
}
usort（$words，'sortbylengdesc'）；
$count=count（$words）；
对于（$i=0；$i您可以选取每个单词，查看数组中是否有任何单词以它开头或结尾。如果是，则应删除该单词（unset（））。
正则表达式可以工作。您可以在正则表达式中定义字符串的开头和结尾
^定义开始
美元决定了结局
大概是
foreach($array as $value)
{
    //$term is the value that you want to remove
    if(preg_match('/^' . $term . '$/', $value))
    {
        //Here you can be confident that $term is $value, and then either remove it from
        //$array, or you can add all not-matched values to a new result array
    }
}

会避免你的问题
但如果您只是检查两个值是否相等，==将与preg_match一样有效（可能比preg_match更快）
如果$terms和$values的列表很大，这将不是最有效的策略，但这是一个简单的解决方案
如果性能是一个问题，则排序（注意提供的功能）列表，然后并排遍历列表可能更有用。在我将代码发布到这里之前，我将实际测试这个想法。
您可以将单词放入数组，按字母顺序排序数组，然后循环检查下一个单词是否以当前索引开始，从而构成单词。如果是，您可以删除当前索引中的单词以及下一个单词的后面部分
大概是这样的：
$array = array('palanca', 'plato', 'platopalanca');
// ok, the example array is already sorted alphabetically, but anyway...
sort($array);

// another array for words to be removed
$removearray = array();

// loop through the array, the last index won't have to be checked
for ($i = 0; $i < count($array) - 1; $i++) {

  $current = $array[$i];

  // use another loop in case there are more than one combined words
  // if the words are case sensitive, use strpos() instead to compare
  while ($i < count($array) && stripos($array[$i + 1], $current) === 0) {
    // the next word starts with the current one, so remove current
    $removearray[] = $current;
    // get the other word to remove
    $removearray[] = substr($next, strlen($current));
    $i++;
  }

}

// now just get rid of the words to be removed
// for example by joining the arrays and getting the unique words
$result = array_unique(array_merge($array, $removearray));

$array=array（'palanca'，'plato'，'platonaca'）；
//好的，示例数组已经按字母顺序排序了，但是无论如何。。。
排序（$数组）；
//要删除的单词的另一个数组
$removearray=array（）；
//循环遍历数组，不必检查最后一个索引
对于（$i=0；$i
我认为您需要进一步定义问题，以便我们能够给出可靠的答案。以下是一些病理列表。哪些项目应该删除？：

热狗，热狗摊
热狗，热狗摊，热狗摊
热狗，摊，热狗摊

一些代码
此代码应比您拥有的代码更高效：
$words = array('hatstand','hat','stand','hot','dog','cat','hotdogstand','catbasket');

$count = count($words);

for ($i=0; $i<=$count; $i++) {
    if (isset($words[$i])) {
        $len_i = strlen($words[$i]);
        for ($j=$i+1; $j<$count; $j++) {
            if (isset($words[$j])) {
                $len_j = strlen($words[$j]);

                if ($len_i<=$len_j) {
                    if (substr($words[$j],0,$len_i)==$words[$i]) {
                        unset($words[$i]);  
                    }
                } else {
                    if (substr($words[$i],0,$len_j)==$words[$j]) {
                        unset($words[$j]);
                    }
                }
            }
        }
    }
}

foreach ($words as $word) {
    echo "$word<br>";
}

$words=array（'hatstand'，'hat'，'stand'，'hot'，'dog'，'cat'，'hotdogstand'，'catbasket'）；
$count=count（$words）；
对于（$i=0；$iSo）如果一个单词的组成部分也在列表中，那么它就是一个合成词。汽车、宠物和地毯呢？是的。有零件的名称，所以我没有“地毯”问题：-）数组中可能没有“其他”这样的条目，如果是，你会采取什么行动？如果是“其他”如果不在列表中，那么就没有问题了……这更清楚了。格兰、格兰德、正面看台会发生什么？我已经考虑了复数形式。我正在更新我的问题。你让我意识到我采取了错误的方法+1。