如何在python中捕获分组正则表达式的补码
我想使用正则表达式(python中的re模块)检测类似C的多行注释 所以它应该能够找到如何在python中捕获分组正则表达式的补码,python,regex,Python,Regex,我想使用正则表达式(python中的re模块)检测类似C的多行注释 所以它应该能够找到 /* this is my first comment it also has a * in it. Now I end my first comment */ int a = 3; /* this is my second multiline comment */ 所以我需要使用re来查找这两个多行注释。我想做什么 re.findall(r'exp',string)。应该表达什么?我试着对分组字符进行补
/* this is my
first comment it also has a * in it.
Now I end my first comment */
int a = 3;
/* this is my second
multiline comment */
所以我需要使用re来查找这两个多行注释。我想做什么
re.findall(r'exp',string)
。应该表达什么?我试着对分组字符进行补码,比如r'\(\*[^(?:\*\)]*\*\)
基本上是分组*)并检查补码。但这不起作用。一种可能的方法:
import re
ccode = '''/* this is my
first comment it also has a * in it.
Now I end my first comment */
int a = 3;
/* this is my second
multiline comment */'''
for comment in re.findall('/[*].*?[*]/', ccode, re.DOTALL):
print comment
给出:
/* this is my
first comment it also has a * in it.
Now I end my first comment */
/* this is my second
multiline comment */
话虽如此,如果您正在构建一个解析器,那么最好首先在lexer中提取标记,并在解析器中将注释定义为多标记构造。仅使用正则表达式是不可行的-您可以创建状态机,因为您必须区分这些情况,例如:
- 三角图
- 连续线路
- /*可能在字符串中找到-则它不会开始注释
- 相反,如果openin/在字符串外部,则可能有/在字符串内
######################################################################################
#### Before going any further perform all 4 stages of preprocessing
#### described here http://gcc.gnu.org/onlinedocs/cpp/Initial-processing.html
############################# 1 - break file into lines ##############################
open FILE, $file or die "file [$file] was not found\n";
my @lines = <FILE>; # deletes \r from every line(\n stays on place)
close FILE;
################################ 2 - handle trigraphs ################################
foreach ( @lines )
{
s!\Q??=\E!#!g; #??= becomes #
s#\Q??/\E#\\#g; #??/ becomes \
s#\Q??'\E#^#g; #??' becomes ^
s#\Q??(\E#[#g; #??( becomes [
s#\Q??)\E#]#g; #??) becomes ]
s#\Q??!\E#|#g; #??! becomes |
s#\Q??<\E#{#g; #??< becomes {
s#\Q??>\E#}#g; #??> becomes }
s#\Q??-\E#~#g; #??- becomes ~
}
################################ 3 - merge continued lines ###########################
# everything in C/C++ may be spanned across many lines so we must merge continued
# lines to handle things correctly
# we do not delete lines that are merged with preceeding line - we just leave an
# empty line to preserve overal location of all things which will be needed later
# to properly report line numbers if we find sth that we are intersted in
for (my $i = 0; $i <= $#lines; $i++ )
{
# shows where continued line started ie. where to append following continued line(s)
state $appendHere; # acts also as an "append indicator"
my $continuedLine;
# theoretically continued line ends with \ but preprocessors accept \ followed by
# one or more whitespaces too so we accept it as well
if ( $lines[$i] =~ m#\\[ \t\v\f]*$# ) # merge with next line / continued line ?
{
$lines[$i] =~ s#\\[ \t\v\f]*$##; # delete \ with trailing whitespaces if any
$continuedLine = 1;
}
else
{
$continuedLine = 0;
}
if ( !defined $appendHere )
{
if ( $continuedLine == 1 )
{
# we will append continued lines to $lines[$appendHere]
$appendHere = $i;
}
}
else
{
chomp $lines[$appendHere]; # get rid of \n before appending next
chomp $lines[$i]; # get rid of \n before appending next
$lines[$appendHere] .= "$lines[$i]\n"; # append current line to previously marked location
$lines[$i] = "\n"; # leave only \n in the current line since we want to preserve line numbers
if ( $continuedLine == 0 ) # merge next line too?
{
$appendHere = undef;
}
}
}
#printFileFormatted();
######################## 4 - handle comments and strings ######################################
# similarly substituting a comment body with a single space may spoil our line numbers so
# we are just replacing comments with spaces preserving newlines where necessary
my $state = "out";
my $error;
my $COMMENT_SUBST = ' '; #'@';
my $STRING_SUBST = ' '; #'%';
ERROR: for ( my $line = 0; $line <= $#lines; $line++ )
{
state $hexVal = 0;
state $octVal = 0;
state $string = "";
my @chars = split //, $lines[$line];
my $newLine = "";
for ( my $i = 0; $i <= $#chars; $i++ )
{
my $c = $chars[$i];
if ( $state eq 'out' ) # ----------------------------------------------
{
if ( $c eq '/' )
{
$state = 'comment?';
$newLine .= $c;
}
elsif ( $c eq '"' )
{
$state = 'string char';
$newLine .= $STRING_SUBST;
}
else
{
$newLine .= $c;
}
}
elsif ( $state eq 'comment?' ) # ----------------------------------------------
{
if ( $c eq '/' )
{
$state = '//comment';
chop $newLine;
$newLine .= $COMMENT_SUBST x 2;
}
elsif ( $c eq '*' )
{
$state = '/*comment';
chop $newLine;
$newLine .= $COMMENT_SUBST x 2;
}
else
{
$state = 'out';
$newLine .= $c;
}
}
elsif ( $state eq '//comment' ) # ----------------------------------------------
{
if ( $c eq "\n" )
{
$state = 'out';
$newLine .= $c;
}
else
{
$newLine .= $COMMENT_SUBST;
}
}
elsif ( $state eq '/*comment' ) # ----------------------------------------------
{
if ( $c eq '*' )
{
$state = '/*comment end?';
$newLine .= $COMMENT_SUBST;
}
elsif ( $c eq "\n" )
{
$newLine .= $c;
}
else
{
$newLine .= $COMMENT_SUBST;
}
}
elsif ( $state eq '/*comment end?' ) # ----------------------------------------------
{
if ( $c eq '*' )
{
$newLine .= $COMMENT_SUBST;
}
elsif ( $c eq "\n" )
{
$newLine .= $c;
}
elsif ( $c eq '/' )
{
$state = 'out';
$newLine .= $COMMENT_SUBST;
}
else
{
$state = '/*comment';
$newLine .= $COMMENT_SUBST;
}
}
elsif ( $state eq 'string char' ) # ----------------------------------------------
{
# theoretically ignore "everything" within a string
# which may look like "abc\\" = abc\ or "abc\"" = abc"
# "abc\" - wrong - no end of string, "abc\\\" wrong again
# in order to detect if particular " terminates a string we have to check the whole string
# since it cannot be determined just by checking what the previous character was hence
# that state machine was created
if ( $c eq '"' )
{
$state = 'out';
$newLine .= $STRING_SUBST;
}
elsif ( $c eq "\\" )
{
$state = 'string esc seq';
$newLine .= $STRING_SUBST;
}
elsif ( $c eq "\n" )
{
$error = "line [".($line+1)."] - error - a newline within a string\n";
last ERROR;
}
else
{
$newLine .= $STRING_SUBST;
}
}
elsif ( $state eq 'string esc seq' ) # ----------------------------------------------
{
# simple esc seq \' \" \? \\ \a \b \f \n \r \t \v
# oct num \o \oo \ooo no more than 3 oct digits (o=[0-7]{1,3}) but value must be < than 255
# hex num \xh \xhh \xhhh..... unlimited number of hex digits (h=[0-9a-fA-F]+) but value must be < than 255
# in any other esc seq \ will be ignored hence \u=u \p=p \k=k etc
if ( $c =~ m#^['"\?\\abfhrtv]$# )
{
$state = 'string char';
$newLine .= $STRING_SUBST x 2;
}
elsif ( $c eq 'x' )
{
$state = 'string hex marker';
$newLine .= $STRING_SUBST;
}
elsif ( $c =~ m#^[0-7]$#)
{
$state = 'string oct';
$octVal = oct($c);
$newLine .= $STRING_SUBST;
}
elsif ( $c eq "\n" )
{
$error = "line [".($line+1)."] - error - a newline within a string\n";
last ERROR;
}
else # other esc seqences are ignored - usually a warning is issued
{
$state = 'string char';
$newLine .= $STRING_SUBST x 2;
}
}
elsif ( $state eq 'string hex marker' ) # ----------------------------------------------
{
if ( $c =~ m#^[0-9a-fA-F]$# )
{
$state = 'string hex';
$hexVal = hex($c);
$newLine .= $STRING_SUBST;
}
else
{
$error = "line [".($line+1)."] - error - hex escape sequence not finished\n";
last ERROR;
}
}
elsif ( $state eq 'string hex' ) # ----------------------------------------------
{
if ( $c =~ m#^[0-9a-fA-F]$# )
{
$hexVal <<= 4;
$hexVal += hex($c);
# treat as regular 8bit character sequence - no fancy long chars etc
if ( $hexVal > 255 )
{
$error = "line [".($line+1)."] - error - hex escape sequence too big for a character\n";
last ERROR;
}
$newLine .= $STRING_SUBST;
}
elsif ( $c eq '"' )
{
$state = 'out';
$newLine .= $STRING_SUBST;
$hexVal = 0;
}
elsif ( $c eq "\n" )
{
$error = "line [".($line+1)."] - error - a newline within a string\n";
last ERROR;
}
else
{
$state = 'string char';
$newLine .= $STRING_SUBST;
$hexVal = 0;
}
}
elsif ( $state eq 'string oct' ) # ----------------------------------------------
{
if ( $c =~ m#^[0-7]$# )
{
$octVal <<= 3;
$octVal += oct($c);
# treat as regular 8bit character sequence - no fancy long chars etc
if ( $octVal > 255 )
{
$error = "line [".($line+1)."] - error - oct esc sequence too big for a character\n";
last ERROR;
}
$newLine .= $STRING_SUBST;
}
elsif ( $c eq "\n" )
{
$error = "line [".($line+1)."] - error - a newline within a string\n";
last ERROR;
}
elsif ( $c eq '"' )
{
$state = 'out';
$newLine .= $STRING_SUBST;
$octVal = 0;
}
else
{
$state = 'string char';
$newLine .= $STRING_SUBST;
$octVal = 0;
}
}
else
{
$error = "line [".($line+1)."] - error - state machine problem - unknown state\n";
last ERROR;
}
}#for ( my $i = 0; $i <= $#chars; $i++ )
$lines[ $line ] = $newLine;
}#for ( my $line = 0; $line <= $#lines; $line++ )
if ( $error ) # errors detected within state machine?
{
print "$error";
exit(1);
}
else # EOF met - check the state
{
if ( $state eq 'out' )
{
# ok no problem
}
elsif ( $state eq 'comment?' )
{
# ok no problem - may be a division after all - not a preproc problem
}
elsif ( $state eq '//comment' )
{
# ok no problem
}
elsif ( $state eq '/*comment' )
{
print "EOF reached within /* */ comment\n";
exit(1);
}
elsif ( $state eq '/*comment end?' )
{
print "EOF reached within /* */ comment\n";
exit(1);
}
elsif ( $state eq 'string char' )
{
print "EOF reached within string\n";
exit(1);
}
elsif ( $state eq 'string esc seq' )
{
print "EOF reached within string\n";
exit(1);
}
elsif ( $state eq 'string hex marker' )
{
print "EOF reached within string\n";
exit(1);
}
elsif ( $state eq 'string hex' )
{
print "EOF reached within string\n";
exit(1);
}
elsif ( $state eq 'string oct' )
{
print "EOF reached within string\n";
exit(1);
}
else
{
print "EOF reached and state machine is in unknown state\n";
exit(1);
}
}
######################################################################################
####在进一步操作之前,请执行所有4个预处理阶段
####这里描述http://gcc.gnu.org/onlinedocs/cpp/Initial-processing.html
#############################1-将文件分成行##############################
“打开文件,$FILE或死亡”文件[$FILE]未找到\n;
我的@lines=#从每行删除\r(\n保持不变)
关闭文件;
################################双柄三角图################################
foreach(@行)
{
s!\Q???=\E!#!g;#???=变成#
s#\Q???/\E#g;#成为\
s#\Q??“\E#^#g;#成为^
成为[
s#\Q????\E#]#g#成为]
s#\Q!\E#|#g;#成为|
s#\Q???\E#}g;#???>变成}
s#\Q???-\E#~#g;#??-变成~
}
################################3-合并连续行###########################
#C/C++中的所有内容可能跨越许多行,因此我们必须继续合并
#正确处理事情的方法
#我们不删除与前一行合并的行-我们只留下一个
#空行以保留以后需要的所有东西的总体位置
#如果我们发现我们感兴趣的东西,正确地报告行号
对于(my$i=0;$i如果您正在编写一个标记器,并且您将检查字符串,因此您的模式将与字符串中的注释不匹配,那么此模式将适用于您:
“(/[*][\S\S]*?[*]/)”
为什么需要使用正则表达式?使用find()
或分区()
字符串的函数
。嗨@Steve,我正在编写一个标记器,可以将注释提取为标记。我认为查找和分区功能不够强大。嗨,Artur刚刚在ply示例中的ANSI C示例实现中找到了它。r'/*(.|\n)*?*/'是否:)@不:这个正则表达式只处理C/C++文件中可能出现的一小部分情况:试试这个:/*func(“*/abc”)*/谢谢佩雷尔。这很有效。我在ply附带的示例实现中检查了它。r'/*(.|\n)*?*/'也可以:)这个例子怎么样:/*func(“*/abc”);*/;-)@Artur,没错,这就是为什么使用/编写解析器是一个更好的主意。