Regex Perl匹配多个大写单词_Regex_Perl

Regex Perl匹配多个大写单词

regex perl

Regex Perl匹配多个大写单词,regex,perl,Regex,Perl,我正在做一个perl程序（脚本？），它读取一个文本文件，识别所有的名字，并将它们分类为个人、位置、组织或杂项。我对纽约或太平洋第一金融公司（Pacific First Financial Corp.）这类连续出现多个大写单词的情况感到困扰。我一直在使用： /([A-Z][a-z]+)+/ 捕获一行中与给定行中相同数量的大写单词。据我所知，+将匹配此类模式的一个或多个实例，但它只匹配一个（即纽约的New）。对于纽约，我可以重复[A-Z][A-Z]+两次，但它找不到一行中有两个以上大写单词的模式。

我正在做一个perl程序（脚本？），它读取一个文本文件，识别所有的名字，并将它们分类为个人、位置、组织或杂项。我对纽约或太平洋第一金融公司（Pacific First Financial Corp.）这类连续出现多个大写单词的情况感到困扰。我一直在使用：

/([A-Z][a-z]+)+/

捕获一行中与给定行中相同数量的大写单词。据我所知，+将匹配此类模式的一个或多个实例，但它只匹配一个（即纽约的New）。对于纽约，我可以重复[A-Z][A-Z]+两次，但它找不到一行中有两个以上大写单词的模式。我做错了什么

PS对不起，如果我的词汇量不好，我总是很糟糕。

你只是缺少单词之间的间距

以下内容匹配每个单词前的空格（第一个除外），因此涵盖了您描述的情况：

use strict;
use warnings;

while (<DATA>) {
    while (/(?=\w)((?:\s*[A-Z][a-z]+)+)/g) {
        print "$1\n";
    }
}

__DATA__
I'm doing a perl program (script?) that reads through a text file and identifies all names and categorizes them as either person, location, organization, or miscellaneous. I'm having trouble with things like New York or Pacific First Financial Corp. where there are multiple capitalized words in a row. I've been using:

to capture as many capitalized words in a row as there are on a given line. From what I understand the + will match 1 or more instances of such pattern, but it's only matching one (i.e. New in New York). For New York, I can just repeate the [A-Z][a-z]+ twice but it doesn't find patterns with more than 2 capitalized words in a row. What am I doing wrong?

PS Sorry if my use of vocabulary is off I'm always so bad with that.

有一个名为CPAN的模块，它似乎可以实现您想要的功能。也许值得快速查看一下。

方法您在问题中提供的模式，

/（[A-Z][A-Z]+）+/

，与连续给出的一个大写单词相匹配，如下所示

This
ThisAndThat

但它和这个不匹配

Not This

它实际上分别匹配了其中的每一个

Not
This

因此，让我们将正则表达式修改为

/（？：[A-Z][A-Z]+）（？：\s*[A-Z][A-Z]+）*/

。现在，这是一个有点多，所以让我们打破它的一点时间

(?: ... )      Groups like this don't capture which is more efficient
[A-Z][a-z]+    Matches a capitalised word
\s*[A-Z][a-z]+ Matches a subsequent capitalised word, optionally starting with
               whitespace

What-TL；博士把这些放在一起，我们现在有了一个正则表达式，它匹配一个大写的单词，然后是任何后续的带有或不带空格分隔的单词。所以它是匹配的

This
ThisAndThat
Not This

我们现在可以稍微抽象一下这个正则表达式，以避免重复，并在代码中使用它

my $CAPS_WORD = qr/[A-Z][a-z]+/;
my $FULL_RE   = qr/(?:$CAPS_WORD)(?:\s*$CAPS_WORD)*/;

$string =~ /$FULL_RE/;
say $&;

为什么这个答案提供了一个替代@Miller给出的已经很好的答案，两者都可以很好地工作，但是这个解决方案速度要快得多，因为它不使用前瞻。比原来快7倍

$ time ./bench-simple.pl
Running 100000 runs
800000 matches

real    0m2.869s
user    0m2.860s
sys     0m0.008s

$ time ./bench-lookahead.pl
Running 100000 runs
800000 matches

real    0m19.845s
user    0m19.831s
sys     0m0.012s

首先，您需要留出空间和可能的下一组单词，然后使用

表示零次或多次，而不是

（[A-Z][A-Z]+（？：[A-Z][A-Z]+）*）

这是什么意思？这是一个非捕获组。可以解释一下REI的变化。我认为

/[A-Z][A-Z]+（？：\s+[A-Z][A-Z]+）*/

通常更重要understandable@PeterR同意，这更简单。然而，我武断地决定，我想让NewYork也以大写形式传递。啊，是的，我没有注意到这个功能。要扩展上面更简单的正则表达式，只需将a+翻转到a*

/[a-Z][a-Z]+（？：\s*[a-Z][a-Z]+）*/

，我想知道这两种方法的相对性能是什么，比如lookahead会慢大约5倍

$ time ./bench-simple.pl
Running 100000 runs
800000 matches

real    0m2.869s
user    0m2.860s
sys     0m0.008s

$ time ./bench-lookahead.pl
Running 100000 runs
800000 matches

real    0m19.845s
user    0m19.831s
sys     0m0.012s