Regex Perl-正则表达式匹配的输出行为确实非常奇怪
我使用Perl和正则表达式来解析(格式不好的)输入文本文件中的条目。我的代码将输入文件的内容存储到$genes中,我定义了一个带有捕获组的正则表达式,将感兴趣的位存储在三个变量中:$number、$name和$sequence(请参见下面的Script.pl snippet) 在我尝试打印$sequence的值之前,这一切都非常有效。我试图在值周围添加引号,我的输出如下所示:Regex Perl-正则表达式匹配的输出行为确实非常奇怪,regex,perl,Regex,Perl,我使用Perl和正则表达式来解析(格式不好的)输入文本文件中的条目。我的代码将输入文件的内容存储到$genes中,我定义了一个带有捕获组的正则表达式,将感兴趣的位存储在三个变量中:$number、$name和$sequence(请参见下面的Script.pl snippet) 在我尝试打印$sequence的值之前,这一切都非常有效。我试图在值周围添加引号,我的输出如下所示: Number: '132' Name: 'rps12 AmtrCp046' 'equence: 'ATGAATCTCAA
Number: '132'
Name: 'rps12 AmtrCp046'
'equence: 'ATGAATCTCAATGACCAAGAATTGGCAATTGACACTGAAAGGAACTATAGAATACCTGGAATCTCACAAA
Number: '134'
Name: 'psbA AmtrCp001'
'equence: 'ATGATCCCTACCTTATTGACCGCAACTTCTGTATTTATTATCGCCTTCATTGCGGCTCCTCCAGTAGATA
请注意序列中缺少的S,它已被替换为一个单引号,并且请注意序列本身并没有像我预期的那样在其周围加引号。我不明白为什么$sequence的print语句的行为如此奇怪。我怀疑我的正则表达式有问题,但我一点也不知道那可能是什么。任何帮助都将不胜感激
Script.pl代码段
while ($genes =~ />([0-9]+)\s+([A-Za-z]+)\|([A-Za-z]+)\|([A-Za-z0-9]*\s+[A-Za-z0-9]+)\s+([ACGT]+\s)/g) {
# Get the value of the first capture group in the matched string (the first bit of stuff in parenthesis)
# ([0-9+)
$number = $1;
# Get the value of the fourth capture group
# ([A-Za-z0-9]*\s+[A-Za-z0-9]+)
$name = $4;
# Get the value of the fifth capture group
# ([ACGT]+\s)
$sequence = $5;
print "Number: \." . $number . "\.\n";
print "Name: \'" . $name . "\'\n";
print "sequence: \'" . $sequence . "\'\n";
print "\n";
}
输入文件片段
while ($genes =~ />([0-9]+)\s+([A-Za-z]+)\|([A-Za-z]+)\|([A-Za-z0-9]*\s+[A-Za-z0-9]+)\s+([ACGT]+\s)/g) {
# Get the value of the first capture group in the matched string (the first bit of stuff in parenthesis)
# ([0-9+)
$number = $1;
# Get the value of the fourth capture group
# ([A-Za-z0-9]*\s+[A-Za-z0-9]+)
$name = $4;
# Get the value of the fifth capture group
# ([ACGT]+\s)
$sequence = $5;
print "Number: \." . $number . "\.\n";
print "Name: \'" . $name . "\'\n";
print "sequence: \'" . $sequence . "\'\n";
print "\n";
}
132 gnl | Ambtr | rps12 AmtrCp046
ATGAATCTCAATGACCAAGAATGGCAATTGAAGGAACTATAGAATAGGAATCCAAA
AATCTGATTTTAGAATTCATTCATTCATTCAATAACATTCGTGGAATACGATTCATTCATTT
CAAGATGCCTTGGTGAATGGTAGACACGGACTCAAATCGTGCTAAAGAGAGCTGGAGTC
GAGTCTCTCTCAAGCATGAAGATGATGCTCATGATGAGCAATCAATACAGAGAGATCTCGATCT
AATCGATTGGCAAGTTCATAGAAGTATTCGGCGATCCCCACAGATCCGAGGTCAGCTGTTGTTTG
ATTTAGTTAGTTAACCA
似乎输入文件使用CR+LF来结束行。将其存储到$sequence(因为
\s
位于捕获括号内)。打印时,它将光标移动到行的开头,然后打印最终的引号,覆盖“序列”中的“S”
解决方案:不要捕获变量中的最终空格
$genes =~ />([0-9]+)\s+([A-Za-z]+)\|([A-Za-z]+)\|([A-Za-z0-9]*\s+[A-Za-z0-9]+)\s+([ACGT]+)\s/g
# ^^^
说明:
Options: dot matches newline; case insensitive; ^ and $ match at line breaks
Assert position at the beginning of a line (at beginning of the string or after a line break character) «^»
Match any single character «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the regular expression below and capture its match into backreference number 1 «([0-9]+)»
Match a single character in the range between “0” and “9” «[0-9]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match any single character «.*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the character “|” literally «\|»
Match the regular expression below and capture its match into backreference number 2 «([\w ]+)»
Match a single character present in the list below «[\w ]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
A word character (letters, digits, and underscores) «\w»
The character “ ” « »
Match the regular expression below and capture its match into backreference number 3 «(.+)»
Match any single character «.+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Assert position at the end of a line (at the end of the string or before a line break character) «$»
啊!非常感谢。我觉得我忽略了一些简单/愚蠢的事情。
Options: dot matches newline; case insensitive; ^ and $ match at line breaks
Assert position at the beginning of a line (at beginning of the string or after a line break character) «^»
Match any single character «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the regular expression below and capture its match into backreference number 1 «([0-9]+)»
Match a single character in the range between “0” and “9” «[0-9]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match any single character «.*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the character “|” literally «\|»
Match the regular expression below and capture its match into backreference number 2 «([\w ]+)»
Match a single character present in the list below «[\w ]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
A word character (letters, digits, and underscores) «\w»
The character “ ” « »
Match the regular expression below and capture its match into backreference number 3 «(.+)»
Match any single character «.+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Assert position at the end of a line (at the end of the string or before a line break character) «$»