使用antlr4对多行define语句进行词法分析_Antlr4

使用antlr4对多行define语句进行词法分析

antlr4

使用antlr4对多行define语句进行词法分析,antlr4,Antlr4,我正在尝试编写一个lexer来进行预处理，它可以处理多行define语句。例如，以下输入，其中多行定义被后续空行打断（但可能包含空格）：第一步是标记化输入流，其中对于任何定义，值都将作为一个字符串标记获得。例如，对于YY，我试图将“pqr+abc”作为它的字符串值。我编写了以下用于标记化的lexer： DEF: '#define' -> pushMode(def_mode); ID: Letter (Letter | DecDigit)* ; COMMENT : '/*' .*? '*/

我正在尝试编写一个lexer来进行预处理，它可以处理多行define语句。例如，以下输入，其中多行定义被后续空行打断（但可能包含空格）：

第一步是标记化输入流，其中对于任何定义，值都将作为一个字符串标记获得。例如，对于YY，我试图将“pqr+abc”作为它的字符串值。我编写了以下用于标记化的lexer：

DEF: '#define' -> pushMode(def_mode);
ID: Letter (Letter | DecDigit)* ;
COMMENT : '/*' .*? '*/' -> channel(HIDDEN) ;
LINE_COMMENT : '//' ~('\n'|'\r')* NL  -> channel(HIDDEN);
WS: ( ' ' |'\t' | NL )+     -> channel(HIDDEN) ;
SEMICOLN: ';' ;
COMMA: ',' ;
OB: '(' ;
CB: ')' ;
PLUS: '+' ;
fragment NL : '\r'? '\n' ;
fragment DecDigit: '0'..'9' ;
fragment Letter: 'A'..'Z' | 'a'..'z' | '_' ;
mode def_mode;
    STR2: '\r'? '\n' -> popMode;
    STR1: ~('\n'|'\r')* '\r'? '\n' ;

上面的lexer为#define行提供了以下标记：

[@10,26:32='#define',<2>,6:0]
[@11,33:42=' YY pqr \\n',<13>,6:7]
[@12,43:51='    +abc\n',<13>,7:0]
[@13,52:52='\n',<12>,8:0]
[@14,53:57='class',<3>,9:0]

[@10,26:32='#定义'，6:0]
[@11,33:42='YY pqr\\n'，6:7]
[@12,43:51='+abc\n'，7:0]
[@13,52:52='\n'，8:0]
[@14,53:57='class'，9:0]

只有在#define行之后有一个“空”行时，才能获得上述令牌。如果该行中有一些空白，即它不是真的为空，则该模式不会退出。以下是行中有空格时的标记：

[@10,26:32='#define',<2>,6:0]
[@11,33:42=' YY pqr \\n',<13>,6:7]
[@12,43:51='    +abc\n',<13>,7:0]
[@13,52:55='   \n',<13>,8:0]
[@14,56:74='class p(XX,YY,zz);\n',<13>,9:0]
[@15,75:83='endclass\n',<13>,10:0]
[@16,84:84='\n',<12>,11:0]
[@17,85:84='<EOF>',<-1>,12:0]

[@10,26:32='#定义'，6:0]
[@11,33:42='YY pqr\\n'，6:7]
[@12,43:51='+abc\n'，7:0]
[@13,52:55='\n'，8:0]
[@14,56:74='classp（XX，YY，zz）；\n'，9:0]
[@15,75:83='endclass\n'，10:0]
[@16,84:84='\n'，11:0]
[@17,85:84='',,12:0]

此外，lexer没有连接两条线路。如何修复这些错误？

我不是专家（每年一个语法），也不喜欢

模式，也不喜欢在lexer中进行太多处理。解析器有更多的功能。看看这个：
grammar Question;

/* Parsing preprocessor #define */

program
    :   statement+
    ;

statement
    :   aClass
    |   function
    |   preprocessor
    ;

aClass
    :   'class'
        classDef // classBody
        'endclass'
    ;

classDef
    :   ID '(' list ')' ';'
    ;

function
    :   ID '(' list ')' ';'
    ;

preprocessor
    :   DEFINE ID replacement
        {System.out.println($DEFINE.text + " value=" + $ID.text + " -> replaced by " + $replacement.text);}
    ;

replacement
    :   expr+
    ;

expr
    :   ID
    |   ID '+' ID
    ;

list
    :   ID ( ',' ID )*
    ;

ID  :   LETTER ALPHAMERIC* ;
DEFINE
    :   '#' 'define' ;
COMMENT
    :   '/*' .*? '*/' -> channel(HIDDEN) ;
LINE_COMMENT
    :   '//' ~('\r' | '\n')* -> channel(HIDDEN) ;
WS  :   [ \t\r\n]+ -> channel(HIDDEN) ; // keep spaces in $<rule>.text
//WS  :     [ \t\r\n]+ -> skip ;
CONTINUATION
// if you want to keep the exact value including NL :
//  :   '\\' '\r'? '\n' -> channel(HIDDEN) ;
// to discard the continuation character :
    :   '\\' '\r'? '\n' -> skip ; // ignored as in Ruby, concatenates two lines

fragment LETTER     : [a-zA-Z_] ;
fragment DIGIT      : [0-9] ;
fragment ALPHAMERIC : LETTER | DIGIT ;

输出为：
$ grun Question program -tokens data.txt 
[@0,0:26='/* function\n        call */',<COMMENT>,channel=1,1:0]
[@1,27:27='\n',<WS>,channel=1,2:15]
[@2,28:29='aa',<ID>,3:0]
...
[@8,36:42='#define',<DEFINE>,4:0]
[@9,43:43=' ',<WS>,channel=1,4:7]
[@10,44:45='XX',<ID>,4:8]
[@11,46:46=' ',<WS>,channel=1,4:10]
[@12,47:49='pqr',<ID>,4:11]
[@13,50:50='\n',<WS>,channel=1,4:14]
...
#define value=XX -> replaced by pqr
#define value=YY -> replaced by long replacement value
#define value=ZZ -> replaced by stu     +abc
#define value=WW -> replaced by vwx     +def

$grun问题程序-tokens data.txt
[@0,0:26='/*函数\n调用*/'，通道=1,1:0]
[@1,27:27='\n'，channel=1,2:15]
[@2,28:29='aa'，3:0]
...
[@8,36:42='#定义'，4:0]
[@9,43:43=''，频道=1,4:7]
[@10,44:45='XX'，4:8]
[@11,46:46=''，频道=1,4:10]
[@12,47:49='pqr'，4:11]
[@13,50:50='\n'，channel=1,4:14]
...
#定义值=XX->由pqr替换
#定义值=YY->替换为长替换值
#定义值=ZZ->替换为stu+abc
#定义值=WW->替换为vwx+def

如果使用skip
版本的WS
，您将拥有pqr+abc
以及longreplacementvalue
。我将这一点留给您的精明。是否可以添加代码手动扫描输入，直到看到空行/空行，然后退出模式？
/* function
        call */
aa(bb);
#define XX pqr
#define YY long replacement value
// multiline :
#define ZZ stu \
    +abc

class p(XX,YY,zz);
endclass
#define WW vwx \
    +def

// preceding line contains 10 spaces

$ hexdump -C data.txt 
...
000000b0  2b 64 65 66 0a 20 20 20  20 20 20 20 20 20 20 0a  |+def.          .|
000000c0  2f 2f 20 70 72 65 63 65  64 69 6e 67 20 6c 69 6e  |// preceding lin|
000000d0  65 20 63 6f 6e 74 61 69  6e 73 20 31 30 20 73 70  |e contains 10 sp|
000000e0  61 63 65 73                                       |aces|
000000e4

$ grun Question program -tokens data.txt 
[@0,0:26='/* function\n        call */',<COMMENT>,channel=1,1:0]
[@1,27:27='\n',<WS>,channel=1,2:15]
[@2,28:29='aa',<ID>,3:0]
...
[@8,36:42='#define',<DEFINE>,4:0]
[@9,43:43=' ',<WS>,channel=1,4:7]
[@10,44:45='XX',<ID>,4:8]
[@11,46:46=' ',<WS>,channel=1,4:10]
[@12,47:49='pqr',<ID>,4:11]
[@13,50:50='\n',<WS>,channel=1,4:14]
...
#define value=XX -> replaced by pqr
#define value=YY -> replaced by long replacement value
#define value=ZZ -> replaced by stu     +abc
#define value=WW -> replaced by vwx     +def