Perl:如何提取括号之间的字符串
我有一个文本格式的文件:Perl:如何提取括号之间的字符串,perl,matching,Perl,Matching,我有一个文本格式的文件: * [[ Virtualbox Guest Additions]] (2011/10/17 15:19) * [[ Abiword Wordprocessor]] (2010/10/27 20:17) * [[ Sylpheed E-Mail]] (2010/03/30 21:49) * [[ Kupfer]] (2010/05/16 20:18) “[]”和“]]”之间的所有单词都是对条目的简短描述。我需要提取整个条目,但不是每个单词 我在这里找到了一个类
* [[ Virtualbox Guest Additions]] (2011/10/17 15:19)
* [[ Abiword Wordprocessor]] (2010/10/27 20:17)
* [[ Sylpheed E-Mail]] (2010/03/30 21:49)
* [[ Kupfer]] (2010/05/16 20:18)
“[]”和“]]”之间的所有单词都是对条目的简短描述。我需要提取整个条目,但不是每个单词
我在这里找到了一个类似问题的答案:
但是我无法理解答案:“我的@array=$str=~/(\{(?:[^{}]*|(?0))*\})/xg;”
任何有效的方法都会被接受,但解释会有很大帮助,例如:什么是
(?0)
或/xg
。代码可能如下所示:
use warnings;
use strict;
my @subjects; # declaring a lexical variable to store all the subjects
my $pattern = qr/
\[ \[ # matching two `[` signs
\s* # ... and, if any, whitespace after them
([^]]+) # starting from the first non-whitespace symbol, capture all the non-']' symbols
]]
/x;
# main processing loop:
while (<DATA>) { # reading the source file line by line
if (/$pattern/) { # if line is matched by our pattern
push @subjects, $1; # ... push the captured group of symbols into our array
}
}
print $_, "\n" for @subjects; # print our array of subject line by line
__DATA__
* [[ Virtualbox Guest Additions]] (2011/10/17 15:19)
* [[ Abiword Wordprocessor]] (2010/10/27 20:17)
* [[ Sylpheed E-Mail]] (2010/03/30 21:49)
* [[ Kupfer]] (2010/05/16 20:18)
$text="* [[ Virtualbox Guest Additions]] (2011/10/17 15:19)
* [[ Abiword Wordprocessor]] (2010/10/27 20:17)
* [[ Sylpheed E-Mail]] (2010/03/30 21:49)
* [[ Kupfer]] (2010/05/16 20:18)
";
@array=($text=~/\[\[([^\]]*)\]\]/g);
print join(",",@array);
# this prints " Virtualbox Guest Additions, Abiword Wordprocessor, Sylpheed E-Mail, Kupfer"
正如您所看到的,这个描述很自然地转化为正则表达式。唯一可能不需要的是
/x
regex修饰符,它允许我对它进行大量注释。) 代码可能如下所示:
use warnings;
use strict;
my @subjects; # declaring a lexical variable to store all the subjects
my $pattern = qr/
\[ \[ # matching two `[` signs
\s* # ... and, if any, whitespace after them
([^]]+) # starting from the first non-whitespace symbol, capture all the non-']' symbols
]]
/x;
# main processing loop:
while (<DATA>) { # reading the source file line by line
if (/$pattern/) { # if line is matched by our pattern
push @subjects, $1; # ... push the captured group of symbols into our array
}
}
print $_, "\n" for @subjects; # print our array of subject line by line
__DATA__
* [[ Virtualbox Guest Additions]] (2011/10/17 15:19)
* [[ Abiword Wordprocessor]] (2010/10/27 20:17)
* [[ Sylpheed E-Mail]] (2010/03/30 21:49)
* [[ Kupfer]] (2010/05/16 20:18)
$text="* [[ Virtualbox Guest Additions]] (2011/10/17 15:19)
* [[ Abiword Wordprocessor]] (2010/10/27 20:17)
* [[ Sylpheed E-Mail]] (2010/03/30 21:49)
* [[ Kupfer]] (2010/05/16 20:18)
";
@array=($text=~/\[\[([^\]]*)\]\]/g);
print join(",",@array);
# this prints " Virtualbox Guest Additions, Abiword Wordprocessor, Sylpheed E-Mail, Kupfer"
正如您所看到的,这个描述很自然地转化为正则表达式。唯一可能不需要的是/x
regex修饰符,它允许我对它进行大量注释。)
\[
是一个文本[,,
]
是一个文本],
*
表示0个或更多字符的每个序列,
括号中的内容是一个捕获组,因此您可以稍后在脚本中使用$1(或$2..$9,具体取决于您有多少个组)访问它
将所有内容放在一起,您将匹配两个[
然后匹配所有内容,直到最后一次出现两个连续的]
更新
在再次阅读你的问题时,我突然感到困惑,你是需要[[和]]之间的内容,还是整行内容?在这种情况下,完全不需要括号,只需测试模式是否匹配,无需捕获
my @array = $str =~ /( \{ (?: [^{}]* | (?0) )* \} )/xg;
\[
是一个文本[,,
]
是一个文本],
*
表示0个或更多字符的每个序列,
括号中的内容是一个捕获组,因此您可以稍后在脚本中使用$1(或$2..$9,具体取决于您有多少个组)访问它
将所有内容放在一起,您将匹配两个[
然后匹配所有内容,直到最后一次出现两个连续的]
更新
在再次阅读你的问题时,我突然感到困惑,你是需要[[和]]之间的内容,还是整行内容?在这种情况下,完全不需要括号,只需测试模式是否匹配,无需捕获
my @array = $str =~ /( \{ (?: [^{}]* | (?0) )* \} )/xg;
“x”标志意味着在正则表达式中忽略空白,以允许更可读的表达式。“g”标志意味着结果将是从左到右的所有匹配的列表(match*g*lobally)
(?0)
表示第一组括号内的正则表达式。这是一个递归正则表达式,相当于一组规则,例如:
E := '{' ( NoBrace | E) '}'
NoBrace := [^{}]*
“x”标志意味着在正则表达式中忽略空白,以允许更可读的表达式。“g”标志意味着结果将是从左到右的所有匹配的列表(match*g*lobally)
(?0)
表示第一组括号内的正则表达式。这是一个递归正则表达式,相当于一组规则,例如:
E := '{' ( NoBrace | E) '}'
NoBrace := [^{}]*
您找到的答案是递归模式匹配,我认为您不需要
- /x允许在regexp中使用无意义的空格和注释
- /g在所有字符串中运行regexp。如果没有它,它只运行到第一场比赛
- /xg是/x和/g的组合
- (?0)再次运行regexp本身(递归)
use warnings;
use strict;
my @subjects; # declaring a lexical variable to store all the subjects
my $pattern = qr/
\[ \[ # matching two `[` signs
\s* # ... and, if any, whitespace after them
([^]]+) # starting from the first non-whitespace symbol, capture all the non-']' symbols
]]
/x;
# main processing loop:
while (<DATA>) { # reading the source file line by line
if (/$pattern/) { # if line is matched by our pattern
push @subjects, $1; # ... push the captured group of symbols into our array
}
}
print $_, "\n" for @subjects; # print our array of subject line by line
__DATA__
* [[ Virtualbox Guest Additions]] (2011/10/17 15:19)
* [[ Abiword Wordprocessor]] (2010/10/27 20:17)
* [[ Sylpheed E-Mail]] (2010/03/30 21:49)
* [[ Kupfer]] (2010/05/16 20:18)
$text="* [[ Virtualbox Guest Additions]] (2011/10/17 15:19)
* [[ Abiword Wordprocessor]] (2010/10/27 20:17)
* [[ Sylpheed E-Mail]] (2010/03/30 21:49)
* [[ Kupfer]] (2010/05/16 20:18)
";
@array=($text=~/\[\[([^\]]*)\]\]/g);
print join(",",@array);
# this prints " Virtualbox Guest Additions, Abiword Wordprocessor, Sylpheed E-Mail, Kupfer"
您找到的答案是递归模式匹配,我认为您不需要
- /x允许在regexp中使用无意义的空格和注释
- /g在所有字符串中运行regexp。如果没有它,它只运行到第一场比赛
- /xg是/x和/g的组合
- (?0)再次运行regexp本身(递归)
use warnings;
use strict;
my @subjects; # declaring a lexical variable to store all the subjects
my $pattern = qr/
\[ \[ # matching two `[` signs
\s* # ... and, if any, whitespace after them
([^]]+) # starting from the first non-whitespace symbol, capture all the non-']' symbols
]]
/x;
# main processing loop:
while (<DATA>) { # reading the source file line by line
if (/$pattern/) { # if line is matched by our pattern
push @subjects, $1; # ... push the captured group of symbols into our array
}
}
print $_, "\n" for @subjects; # print our array of subject line by line
__DATA__
* [[ Virtualbox Guest Additions]] (2011/10/17 15:19)
* [[ Abiword Wordprocessor]] (2010/10/27 20:17)
* [[ Sylpheed E-Mail]] (2010/03/30 21:49)
* [[ Kupfer]] (2010/05/16 20:18)
$text="* [[ Virtualbox Guest Additions]] (2011/10/17 15:19)
* [[ Abiword Wordprocessor]] (2010/10/27 20:17)
* [[ Sylpheed E-Mail]] (2010/03/30 21:49)
* [[ Kupfer]] (2010/05/16 20:18)
";
@array=($text=~/\[\[([^\]]*)\]\]/g);
print join(",",@array);
# this prints " Virtualbox Guest Additions, Abiword Wordprocessor, Sylpheed E-Mail, Kupfer"
如果文本永远不包含
]
,您只需按照之前的建议使用以下内容即可:
/\[\[ ( [^\]]* ) \]\]/x
下面允许在包含的文本中使用]
,但我建议不要将其合并到更大的模式中:
/\[\[ ( .*? ) \]\]/x
以下内容允许在包含的文本中使用]
,是最可靠的解决方案:
/\[\[ ( (?:(?!\]\]).)* ) \]\]/x
比如说,
if (my ($match) = $line =~ /\[\[ ( (?:(?!\]\]).)* ) \]\]/x) {
print "$match\n";
}
或
:忽略模式中的空白。允许添加空格以使模式可读,而不更改模式的含义。记录在/x
:查找所有匹配项。记录在/g
用于使模式递归,因为链接节点必须处理卷曲的任意嵌套。*<代码>/g:查找所有匹配项。记录在(?0)
]
,您只需按照之前的建议使用以下内容即可:
/\[\[ ( [^\]]* ) \]\]/x
下面允许在包含的文本中使用]
,但我建议不要将其合并到更大的模式中:
/\[\[ ( .*? ) \]\]/x
以下内容允许在包含的文本中使用]
,是最可靠的解决方案:
/\[\[ ( (?:(?!\]\]).)* ) \]\]/x
比如说,
if (my ($match) = $line =~ /\[\[ ( (?:(?!\]\]).)* ) \]\]/x) {
print "$match\n";
}
或
:忽略模式中的空白。允许添加空格以使模式可读,而不更改模式的含义。记录在/x
:查找所有匹配项。记录在/g
用于使模式递归,因为链接节点必须处理卷曲的任意嵌套。*<代码>/g:查找所有匹配项。记录在(?0)
> cat temp
* [[ Virtualbox Guest Additions]] (2011/10/17 15:19)
* [[ Abiword Wordprocessor]] (2010/10/27 20:17)
* [[ Sylpheed E-Mail]] (2010/03/30 21:49)
* [[ Kupfer]] (2010/05/16 20:18)
>
> perl -pe 's/.*\[\[(.*)\]\].*/\1/g' temp
Virtualbox Guest Additions
Abiword Wordprocessor
Sylpheed E-Mail
Kupfer
>
- s/[[(.)]./\1/g
- .*[->匹配任何字符直到[[
- (.*)]街