Perl段落n-gram_Perl_N Gram - Fatal编程技术网

Perl段落n-gram

perl

Perl段落n-gram,perl,n-gram,Perl,N Gram,假设我有一句文字： $body = 'the quick brown fox jumps over the lazy dog'; 我想把这个句子变成“关键字”的散列，但我想允许多个单词的关键字；我有以下方法来获取单词关键字： $words{$_}++ for $body =~ m/(\w+)/g; 完成后，我有一个如下所示的散列： 'the' => 2, 'quick' => 1, 'brown' => 1, 'fox' => 1, 'jumps' => 1,

假设我有一句文字：

$body = 'the quick brown fox jumps over the lazy dog';

我想把这个句子变成“关键字”的散列，但我想允许多个单词的关键字；我有以下方法来获取单词关键字：

$words{$_}++ for $body =~ m/(\w+)/g;

完成后，我有一个如下所示的散列：

'the' => 2,
'quick' => 1,
'brown' => 1,
'fox' => 1,
'jumps' => 1,
'over' => 1,
'lazy' => 1,
'dog' => 1

$words{$_}++ for $body =~ m/(\w+ \w+)/g;

下一步，我可以得到两个字的关键字，如下所示：

'the' => 2,
'quick' => 1,
'brown' => 1,
'fox' => 1,
'jumps' => 1,
'over' => 1,
'lazy' => 1,
'dog' => 1

$words{$_}++ for $body =~ m/(\w+ \w+)/g;

但这只能得到每一个“其他”对；看起来像这样：

'the quick' => 1,
'brown fox' => 1,
'jumps over' => 1,
'the lazy' => 1

我还需要一个单词的偏移量：

'quick brown' => 1,
'fox jumps' => 1,
'over the' => 1

有没有比下面更简单的方法

my $orig_body = $body;
# single word keywords
$words{$_}++ for $body =~ m/(\w+)/g;
# double word keywords
$words{$_}++ for $body =~ m/(\w+ \w+)/g;
$body =~ s/^(\w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+)/g;
$body = $orig_body;
# triple word keywords
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;
$body =~ s/^(\w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;
$body = $orig_body;
$body =~ s/^(\w+ \w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;

使用运算符

位置标量返回有关变量的最后一次

m//g

搜索结束时的偏移量（

$\ucode>在未指定变量时使用）
还有特殊阵列
@最后一场比赛开始
@-
$-[0]
是上次成功匹配开始的偏移量$-[n]
是第n个子模式匹配的子字符串开始的偏移量，如果子模式不匹配，则为unde

例如，下面的程序在自己的捕获中捕获每对的第二个单词，并回放匹配的位置，因此第二个单词将成为下一对的第一个单词：
#! /usr/bin/perl

use warnings;
use strict;

my $body = 'the quick brown fox jumps over the lazy dog';

my %words;
while ($body =~ /(\w+ (\w+))/g) {
  ++$words{$1};
  pos($body) = $-[2];
}

for (sort { index($body,$a) <=> index($body,$b) } keys %words) {
  print "'$_' => $words{$_}\n";
}

#/usr/bin/perl
使用警告；
严格使用；
我的$body=‘敏捷的棕色狐狸跳过懒惰的狗’；
我的%字；
而（$body=~/（\w+（\w+））/g）{
++$words{$1}；
pos（$body）=$-[2]；
}
对于（排序{index（$body，$a）index（$body，$b）}关键字%words）{
打印“'$\'=>$words{$\}\n”；
}

输出：
'the quick' => 1
'quick brown' => 1
'brown fox' => 1
'fox jumps' => 1
'jumps over' => 1
'over the' => 1
'the lazy' => 1
'lazy dog' => 1
$VAR1 = {
          'over the lazy' => 1,
          'the' => 2,
          'over' => 1,
          'brown fox jumps' => 1,
          'brown fox' => 1,
          'the lazy dog' => 1,
          'jumps over' => 1,
          'the lazy' => 1,
          'the quick brown' => 1,
          'fox jumps' => 1,
          'over the' => 1,
          'brown' => 1,
          'fox jumps over' => 1,
          'quick brown' => 1,
          'jumps' => 1,
          'lazy' => 1,
          'jumps over the' => 1,
          'lazy dog' => 1,
          'dog' => 1,
          'quick brown fox' => 1,
          'fox' => 1,
          'the quick' => 1,
          'quick' => 1
        };

'快速'=>1
“快速棕色”=>1
“褐狐”=>1
“狐狸跳跃”=>1
“跳过”=>1
'超过'=>1
“懒惰的人”=>1
“懒狗”=>1除了第一个单词，我会收集所有的东西。这样，位置将自动正确前进：
my $body = 'the quick brown fox jumps over the lazy dog';

my %words;

++$words{$1}         while $body =~ m/(\w+)/g;
++$words{"$1 $2"}    while $body =~ m/(\w+) \s+ (?= (\w+) )/gx;
++$words{"$1 $2 $3"} while $body =~ m/(\w+) \s+ (?= (\w+) \s+ (\w+) )/gx;

如果您想使用单个空格而不是\s+
（如果这样做，请不要忘记删除/x
修饰符），那么可以简化它一点，因为您可以在$2
中收集任意数量的单词，而不是每个单词使用一组。虽然手动编码可能会对所描述的任务感兴趣，
使用处理n-gram的现有CPAN模块不是更好吗？它看起来（与之相反）可以处理基于单词的n-gram分析。
您可以使用以下工具做一些有点古怪的事情：
如果我这样做：
$words{$_}++ for $body =~ m/(?=(\w+ \w+))\w+/g;

这句话的意思是向前看两个单词（并捕捉它们），但要消耗1
我得到：
%words: {
          'brown fox' => 1,
          'fox jumps' => 1,
          'jumps over' => 1,
          'lazy dog' => 1,
          'over the' => 1,
          'quick brown' => 1,
          'the lazy' => 1,
          'the quick' => 1
        }

似乎我可以通过输入一个变量来进行计数来概括：
my $n    = 4;
$words{$_}++ for $body =~ m/(?=(\w+(?: \w+){$n}))\w+/g;

单独使用正则表达式这样做有什么特别的原因吗？对我来说，最明显的方法是将文本拆分成一个数组，然后使用一对嵌套循环从中提取计数。大致如下：
#!/usr/bin/env perl

use strict;
use warnings;

my $text = 'the quick brown fox jumps over the lazy dog';
my $max_words = 3;

my @words = split / /, $text;
my %counts;

for my $pos (0 .. $#words) {
  for my $phrase_len (0 .. ($pos >= $max_words ? $max_words - 1 : $pos)) {
    my $phrase = join ' ', @words[($pos - $phrase_len) .. $pos];
    $counts{$phrase}++;
  }
} 

use Data::Dumper;
print Dumper(\%counts);

输出：
'the quick' => 1
'quick brown' => 1
'brown fox' => 1
'fox jumps' => 1
'jumps over' => 1
'over the' => 1
'the lazy' => 1
'lazy dog' => 1
$VAR1 = {
          'over the lazy' => 1,
          'the' => 2,
          'over' => 1,
          'brown fox jumps' => 1,
          'brown fox' => 1,
          'the lazy dog' => 1,
          'jumps over' => 1,
          'the lazy' => 1,
          'the quick brown' => 1,
          'fox jumps' => 1,
          'over the' => 1,
          'brown' => 1,
          'fox jumps over' => 1,
          'quick brown' => 1,
          'jumps' => 1,
          'lazy' => 1,
          'jumps over the' => 1,
          'lazy dog' => 1,
          'dog' => 1,
          'quick brown fox' => 1,
          'fox' => 1,
          'the quick' => 1,
          'quick' => 1
        };

编辑：根据cjm的评论，修复了$phrase\u len
循环，以防止使用负面索引导致错误结果。
+0.4999。。。另外0.5将用于相关文件参考，以解释其工作原理。：）@以太，他确实链接到了文档。堆栈溢出只是没有在code
文本中显示非常引人注目的链接。这无法正确处理数组的边缘。请注意，您的输出包括诸如“狗”和“懒狗”之类的短语，它们实际上不会出现在文本中。@cjm:Ack！我显然没有仔细检查输出。不过，两分钟的概念验证也不错。我已经纠正了$phrase\u len
循环来解决这个问题。Text:：Ngrams完美地完成了这个技巧。事实上，我可以得到n克的任何大小与最小的努力是有益的。