Bison 使用lex和yacc进行格式验证_Bison_Yacc_Flex Lexer_Lex

Bison 使用lex和yacc进行格式验证

bison

Bison 使用lex和yacc进行格式验证,bison,yacc,flex-lexer,lex,Bison,Yacc,Flex Lexer,Lex,假设我有一个包含如下字符串的文件： qwerty01234xy+-/ rtweqq22222xx+++ aaaaaa01W56ss--1 [A-Z]的前6个字符，然后是[0-9]的5个字符，然后是[A-Z]的2个字符，最后是[+-/]的3个字符。我想写一个格式检查器，它会产生语法错误。我一直在做的事情是这样的： qwerty01234xy+-/ rtweqq22222xx+++ aaaaaa01W56ss--1 lex文件： <code> ... /*states*/ %x

假设我有一个包含如下字符串的文件：

qwerty01234xy+-/
rtweqq22222xx+++

aaaaaa01W56ss--1

[A-Z]的前6个字符，然后是[0-9]的5个字符，然后是[A-Z]的2个字符，最后是[+-/]的3个字符。我想写一个格式检查器，它会产生语法错误。我一直在做的事情是这样的：

qwerty01234xy+-/
rtweqq22222xx+++

aaaaaa01W56ss--1

lex文件：

<code>
...
/*states*/
%x WORD1_STATE
%x NUMBER_STATE
%x WORD22_STATE
%x ETC_STATE
%%
...
yy_push_state(ETC_STATE)
yy_push_state(WORD22_STATE)
yy_push_state(NUMBER_STATE)
yy_push_state(WORD1_STATE)
...
 /*rules*/
<WORD1_STATE>^[A-Z]{6}    yy_pop_state(); yylval.string=strdup(yytext); return WORD1;
<NUMBER_STATE>[0-9]{5}    yy_pop_state(); yylval.string=strdup(yytext); return NUMBER;
<WORD22_STATE>[A-Z]{2}    yy_pop_state(); yylval.string=strdup(yytext); return WORD2;
<ETC_STATE>[+-/]{3}    yy_pop_state(); yylval.string=strdup(yytext); return ETC;

\n        /*do nothing*/
<*>.      fprintf(stderr, "Bad character at line %d column %d: \"%s\"\n", yylloc.first_line, yylloc.first_column, yytext); yy_pop_state();
</code>

我的目标如下：如果此检查器看到这样的行：

qwerty01234xy+-/
rtweqq22222xx+++

aaaaaa01W56ss--1

它会产生以下错误：

Bad character in NUMBER at line x at column 9
Bad character in ETC at line x at column 16

这是正确的方向吗？当然，我的代码不起作用。：）

对于这种简单的检查（因为您没有显示语法），您不需要“yacc”。如果你真的不在乎准确的错误信息，你可以写一个“grep”一行，过滤掉错误的行。如果您的动机是提供错误消息，“awk”是一个简洁的解决方案：

#!/usr/bin/awk -f

/[[:alpha:]]{6}[[:digit:]]{5}[[:alpha:]]{2}(+|-|\/)/ { next; }

{
    if (substr($0,1,6) !~ /[[:alpha:]]{6}/)
    print "first six chars in " NR, substr($0,1,6);
    # check for other mistakes
}

如果是lex/yacc知识，那么JP Bennett的“编译技术简介”是一个相当实用的介绍，尽管是古老的介绍。

如果你能够在你的lexer中完美地预测状态序列，那么使用

yacc

；这里真的没有提供任何有用的设施。另一方面，如果语法比简单的模式序列更复杂，您可能需要

yacc

；在这种情况下，您应该提供一个更准确的示例

在任何情况下，在堆栈上推送状态都不是处理进程的非常有效的机制。使用

BEGIN

宏通常更容易构建简单的状态机

下面是您的示例的基本lexer：

%s NUMBER WORD2 ETC

%%
    /* Any indented text before the first rule is inserted
     * at the top of the yylex function.
     */
    int error_count = 0;
<INITIAL>[A-Z]{6} BEGIN(NUMBER);
<NUMBER>[0-9]{5}  BEGIN(WORD2);
<WORD2>[A-Z]{2}   BEGIN(ETC);
<ETC>[+-/]{3}     BEGIN(EOL);
<EOL>" "*\n       BEGIN(INITIAL);
<RECOVER>.*       BEGIN(EOL);
.|\n              signal_error(); ++error_count; BEGIN(RECOVER);
<<EOF>>           return error_count != 0;

（注意

yyless（0）的使用）

在默认规则中。这将导致错误字符返回到输入源，以便在新的开始条件下重新扫描它，从而避免围绕换行的一些混乱逻辑，并获得正确的行和列计数器。此外，我们将所有换行处理集中在

EOL

开始条件的规则中，以防以后需要修改。）

现在只需要编写error reporter，我们需要将状态映射到字符串和

main

驱动程序，并添加避免编译器警告所需的内容：

%{
#  include <stdio.h>

void signal_error(int state, int line, int column);
%}
%option noyywrap nounput noinput

%s NUMBER WORD2 ETC EOL RECOVER

%%
    int error_count = 0;
    int line=1, column=1;
<INITIAL>[A-Z]{6} BEGIN(NUMBER);  column += yyleng;
<NUMBER>[0-9]{5}  BEGIN(WORD2);   column += yyleng;
<WORD2>[A-Z]{2}   BEGIN(ETC);     column += yyleng;
<ETC>[+-/]{3}     BEGIN(EOL);     column += yyleng;
<EOL>" "*\n       BEGIN(INITIAL); ++line; column = 1;
<RECOVER>.*       BEGIN(EOL);
.|\n              { signal_error(YY_START, line, column);
                    ++error_count; yyless(0); BEGIN(RECOVER);
                  }
<<EOF>>           return error_count != 0;
%%

typedef struct { int state; const char* name; } StateToName;
const StateToName state_to_name[] = {
  { INITIAL, "in WORD1" },
  { NUMBER,  "in NUMBER"},
  { WORD2,   "in WORD2" },
  { ETC,     "in ETC"   },
  { EOL,     "at end of line"},
  { -1,      NULL}
};

const char* find_name(int state) {
  for (const StateToName* ent = state_to_name; ent->name; ++ent)
    if (state == ent->state) return ent->name;
  return "in unknown state";
}

void signal_error(int state, int line, int column) {
  fprintf(stderr, "Bad character %s at line %d, column %d\n",
                  find_name(state), line, column);
}

int main(int argc, char** argv) {
  return yylex();
}

在上面，有必要将

ERROR

添加到开始条件列表中，但由于开始条件是包含的，因此不需要为该条件明确标记任何规则。

问题是，我显示的代码只是一个简化，我必须在每行处理更多的令牌，并且由于其他特性（如您提到的错误消息），“仅限正则表达式”解决方案不合适。无论如何，谢谢你的建议。你不需要lex或yacc来解决这个问题。只需要一个正则表达式。非常感谢你的精彩解释。还有一件事我想做的：必须做什么，在一行中报告多个错误？目前，如果识别出一个错误，当前l中的下一个错误我不会被“举报”。