PHP在数组中查找n-gram_Php_Arrays_N Gram

PHP在数组中查找n-gram

php arrays

PHP在数组中查找n-gram,php,arrays,n-gram,Php,Arrays,N Gram,我有一个PHP数组： $excerpts = array( 'I love cheap red apples', 'Cheap red apples are what I love', 'Do you sell cheap red apples?', 'I want red apples', 'Give me my red apples', 'OK now where are my apples?' ); 我想找出这些行中的所有n-gram，得到

我有一个PHP数组：

$excerpts = array(
    'I love cheap red apples',
    'Cheap red apples are what I love',
    'Do you sell cheap red apples?',
    'I want red apples',
    'Give me my red apples',
    'OK now where are my apples?'
);

我想找出这些行中的所有n-gram，得到如下结果：

便宜的红苹果：3个
红苹果：5个
苹果：6个

我试图对数组进行内爆，然后对其进行解析，但这很愚蠢，因为可以找到新的n-gram，因为串接的字符串彼此之间什么都看不见

如何继续？

假设您只想计算字符串的出现次数：

$cheapRedAppleCount = 0;
$redAppleCount = 0;
$appleCount = 0;
for($i = 0; $i < count($excerpts); $i++)
{
    $cheapRedAppleCount += preg_match_all('cheap red apples', $excerpts[$i]);
    $redAppleCount += preg_match_all('red apples', $excerpts[$i]);
    $appleCount += preg_match_all('apples', $excerpts[$i]);
}

$cheapRedAppleCount=0；
$redAppleCount=0；
$appleCount=0；
对于（$i=0；$i


preg\u match\u all
返回给定字符串中的匹配数，因此您可以将匹配数添加到计数器中
了解更多信息
如果我误解了，请道歉。
试试这个（使用内爆
，因为您已经提到了这是一个尝试）：
我想找一组以前不知道的单词
有了你的功能，我需要在做任何事情之前提供它们
试试这个：
mb_internal_encoding('UTF-8');

$joinedExcerpts = implode(".\n", $excerpts);
$sentences = preg_split('/[^\s|\pL]/umi', $joinedExcerpts, -1, PREG_SPLIT_NO_EMPTY);

$wordsSequencesCount = array();
foreach($sentences as $sentence) {
    $words = array_map('mb_strtolower',
                       preg_split('/[^\pL+]/umi', $sentence, -1, PREG_SPLIT_NO_EMPTY));
    foreach($words as $index => $word) {
        $wordsSequence = '';
        foreach(array_slice($words, $index) as $nextWord) {
                $wordsSequence .= $wordsSequence ? (' ' . $nextWord) : $nextWord;
            if( !isset($wordsSequencesCount[$wordsSequence]) ) {
                $wordsSequencesCount[$wordsSequence] = 0;
            }
            ++$wordsSequencesCount[$wordsSequence];
        }
    }
}

$ngramsCount = array_filter($wordsSequencesCount,
                            function($count) { return $count > 1; });

我假设你只想重复一组单词。
var_dump（$ngramscont）的输出是：
array (size=11)
  'i' => int 3
  'i love' => int 2
  'love' => int 2
  'cheap' => int 3
  'cheap red' => int 3
  'cheap red apples' => int 3
  'red' => int 5
  'red apples' => int 5
  'apples' => int 6
  'are' => int 2
  'my' => int 2

例如，可以对代码进行优化，以使用更少的内存。以上内容非常好。
因为我将其用于法语，所以我对正则表达式进行了如下修改：
$sentences = preg_split('/[^\s|\pL-\'’]/umi', $joinedExcerpts, -1, PREG_SPLIT_NO_EMPTY);

通过这种方式，我们可以分析包含连字符和撇号的单词（“est-ce-que”、“j'ai”等）。要继续，我将查找n-gram算法，然后决定在该数据集上实现哪个算法是合适的。第一次呼叫：。感谢您的建议，这就是我所做的，但我需要任何解决方案或至少具体的示例，以提供我提供的最终输出。我猜OP可能希望在任何字符串集中找到所有n-gram，而不仅仅是在这些特定字符串中的这三个字符串\我想在以前不知道的情况下找到一组单词，不幸的是，这不符合我的要求。不管怎样，谢谢你的帮助。重点是：我想在不知道它们的情况下找到这组单词，尽管有了你的功能，我需要在任何事情之前提供它们。谢谢你的帮助。对不起，我误解了这个问题。“我”、“我爱”和“是”这组词是否应该被视为n-gram，不重复的这组词是否应该被忽略（“做”、“做你”等）？这太完美了，正是我所问的。谢谢！
$sentences = preg_split('/[^\s|\pL-\'’]/umi', $joinedExcerpts, -1, PREG_SPLIT_NO_EMPTY);