Parsing Python风格缩进的PEG

Parsing Python风格缩进的PEG,parsing,syntax,language-design,treetop,peg,Parsing,Syntax,Language Design,Treetop,Peg,您将如何在以下任何解析器生成器(,)中编写一个可处理Python/Haskell/CoffeScript样式缩进的解析器: 尚未存在的编程语言示例: square x = x * x 我认为像这样的缩进敏感语言是上下文敏感的。我相信PEG只能做上下文无关的语言 请注意,虽然Nalpy的答案肯定是正确的,PEG.js可以通过外部状态(即可怕的全局变量)来实现,但这可能是一条危险的路径(比全局变量的常见问题更糟糕)。某些规则最初可以匹配(然后运行其操作),但父规则可能会失败,从而导致操

您将如何在以下任何解析器生成器(,)中编写一个可处理Python/Haskell/CoffeScript样式缩进的解析器:

尚未存在的编程语言示例:

square x =
    x * x



我认为像这样的缩进敏感语言是上下文敏感的。我相信PEG只能做上下文无关的语言


请注意,虽然Nalpy的答案肯定是正确的,PEG.js可以通过外部状态(即可怕的全局变量)来实现,但这可能是一条危险的路径(比全局变量的常见问题更糟糕)。某些规则最初可以匹配(然后运行其操作),但父规则可能会失败,从而导致操作运行无效。如果在这样的操作中更改了外部状态,则可能会以无效状态结束。这是非常可怕的,可能导致震颤、呕吐和死亡。这里的注释中有一些问题和解决方案:

因此,我们在这里使用缩进真正做的是创建类似于C风格块的东西,它通常有自己的词法范围。如果我正在为这样一种语言编写编译器,我想我会尝试让lexer跟踪缩进。每次缩进增加时,它都可以插入一个“{”标记。同样,每次缩进减少时,它都可以插入一个“}”标记。然后,编写带有显式花括号的表达式语法来表示词法范围变得更为直接。

纯PEG无法解析缩进

但是peg.js可以

我做了一个快速而肮脏的实验(受Ira Baxter关于作弊的评论的启发),并编写了一个简单的标记器

有关更完整的解决方案(完整的解析器),请参见以下问题:

/*初始化*/
{
功能启动(第一,尾部){
var done=[first[1]];
对于(变量i=0;i深度[0]){
深度。取消移位(深度);
返回[“缩进”];
}
var dents=[];
而(深度<深度[0]){
shift();
凹痕。推挤(“凹痕”);
}
如果(深度!=深度[0])凹痕推动(“凹痕”);
返回凹痕;
}
}
/*真正的语法*/
开始=第一行:行尾:(换行)*换行?{返回开始(第一个,尾部)}
line=depth:indent s:text{return[depth,s]}
缩进=s:“*{返回缩进}
text=c:[^\n]*{返回c.join(“”)}
换行符=“\n”{}
深度
是一组缩进。indent()返回一个缩进标记数组,start()展开该数组,使解析器的行为有点像流

peg.js为文本生成:

alpha
  beta
  gamma
    delta
epsilon
    zeta
  eta
theta
  iota
这些结果:

[
   "alpha",
   "INDENT",
   "beta",
   "gamma",
   "INDENT",
   "delta",
   "DEDENT",
   "DEDENT",
   "epsilon",
   "INDENT",
   "zeta",
   "DEDENT",
   "BADDENT",
   "eta",
   "theta",
   "INDENT",
   "iota",
   "DEDENT",
   "",
   ""
]

这个标记器甚至可以捕获错误的缩进。

您可以在Treetop中使用语义谓词来实现这一点。在这种情况下,您需要一个语义谓词来检测由于另一行出现相同或较小缩进而导致的空白缩进块的关闭。谓词必须从开始行开始计算缩进,如果当前行的缩进以相同或更短的长度完成,则返回true(块闭合)。因为关闭条件依赖于上下文,所以不能对其进行记忆。 下面是我将要添加到Treetop文档中的示例代码。请注意,我已经覆盖了Treetop的SyntaxNode检查方法,以便更容易地可视化结果

grammar IndentedBlocks
  rule top
    # Initialise the indent stack with a sentinel:
    &{|s| @indents = [-1] }
    nested_blocks
    {
      def inspect
        nested_blocks.inspect
      end
    }
  end

  rule nested_blocks
    (
      # Do not try to extract this semantic predicate into a new rule.
      # It will be memo-ized incorrectly because @indents.last will change.
      !{|s|
        # Peek at the following indentation:
        save = index; i = _nt_indentation; index = save
        # We're closing if the indentation is less or the same as our enclosing block's:
        closing = i.text_value.length <= @indents.last
      }
      block
    )*
    {
      def inspect
        elements.map{|e| e.block.inspect}*"\n"
      end
    }
  end

 rule block
    indented_line       # The block's opening line
    &{|s|               # Push the indent level to the stack
      level = s[0].indentation.text_value.length
      @indents << level
      true
    }
    nested_blocks       # Parse any nested blocks
    &{|s|               # Pop the indent stack
      # Note that under no circumstances should "nested_blocks" fail, or the stack will be mis-aligned
      @indents.pop
      true
    }
    {
      def inspect
        indented_line.inspect +
          (nested_blocks.elements.size > 0 ? (
            "\n{\n" +
            nested_blocks.elements.map { |content|
              content.block.inspect+"\n"
            }*'' +
            "}"
          )
          : "")
      end
    }
  end

  rule indented_line
    indentation text:((!"\n" .)*) "\n"
    {
      def inspect
        text.text_value
      end
    }
  end

  rule indentation
    ' '*
  end
end
语法缩进块
规则顶端
#使用sentinel初始化缩进堆栈:
&{| s |@indents=[-1]}
嵌套块
{
def检查
嵌套块。检查
结束
}
结束
规则嵌套块
(
#不要试图将此语义谓词提取到新规则中。
#由于@indents.last将发生更改,因此它将被错误地标记。
!{s|
#查看以下缩进:
保存=索引;i=缩进;索引=保存
#如果缩进小于或等于封闭块的缩进,我们将关闭:

closing=i.text\u value.length我知道这是一个旧线程,但我只是想在答案中添加一些PEGjs代码。该代码将解析一段文本并将其“嵌套”到一种“AST ish”中结构。它只深入了一层,看起来很难看,而且它并没有真正使用返回值来创建正确的结构,而是保留了一个内存中的语法树,并在最后返回。这可能会变得很笨拙,并导致一些性能问题,但至少它做到了它应该做的

注意:确保您有选项卡而不是空格

{ 
    var indentStack = [], 
        rootScope = { 
            value: "PROGRAM",
            values: [], 
            scopes: [] 
        };

    function addToRootScope(text) {
        // Here we wiggle with the form and append the new
        // scope to the rootScope.

        if (!text) return;

        if (indentStack.length === 0) {
            rootScope.scopes.unshift({
                text: text,
                statements: []
            });
        }
        else {
            rootScope.scopes[0].statements.push(text);
        }
    }
}

/* Add some grammar */

start
  = lines: (line EOL+)*
    { 
        return rootScope;
    }


line
  = line: (samedent t:text { addToRootScope(t); }) &EOL
  / line: (indent t:text { addToRootScope(t); }) &EOL
  / line: (dedent t:text { addToRootScope(t); }) &EOL
  / line: [ \t]* &EOL
  / EOF

samedent
  = i:[\t]* &{ return i.length === indentStack.length; }
    {
        console.log("s:", i.length, " level:", indentStack.length);
    }

indent
  = i:[\t]+ &{ return i.length > indentStack.length; }
    {
        indentStack.push(""); 
        console.log("i:", i.length, " level:", indentStack.length);
    }

dedent
    = i:[\t]* &{ return i.length < indentStack.length; }
      {
          for (var j = 0; j < i.length + 1; j++) {
              indentStack.pop();
          } 
          console.log("d:", i.length + 1, " level:", indentStack.length);  
      }

text
    = numbers: number+  { return numbers.join(""); } 
    / txt: character+   { return txt.join(""); }


number
    = $[0-9] 

character 
    = $[ a-zA-Z->+]  
__
    = [ ]+

_ 
    = [ ]*

EOF 
    = !.

EOL
    = "\r\n" 
    / "\n" 
    / "\r"
{
var indentStack=[],
根范围={
值:“程序”,
值:[],
范围:[]
};
函数addToRootScope(文本){
//在这里,我们使用表单摆动并附加新的
//范围到根范围。
如果(!text)返回;
如果(indentStack.length==0){
rootScope.scopes.unshift({
文本:文本,
声明:[]
});
}
否则{
rootScope.scopes[0].statements.push(文本);
}
}
}
/*添加一些语法*/
开始
=行:(行下线+)*
{ 
返回根镜;
}
线
=行:(samedent t:text{addToRootScope(t);})&EOL
/行:(缩进t:text{addToRootScope(t);})&EOL
/行:(dedent t:text{addToRootScope(t);})&EOL
/行:[\t]*&下线
/EOF
萨梅登特
=i:[\t]*&{return i.length===indentStack.length;}
{
log(“s:,i.length,“level:,indentStack.length”);
}
缩进
=i:[\t]+&{return i.length>indentStack.length;}
{
缩进堆栈。按(“”);
[
   "alpha",
   "INDENT",
   "beta",
   "gamma",
   "INDENT",
   "delta",
   "DEDENT",
   "DEDENT",
   "epsilon",
   "INDENT",
   "zeta",
   "DEDENT",
   "BADDENT",
   "eta",
   "theta",
   "INDENT",
   "iota",
   "DEDENT",
   "",
   ""
]
grammar IndentedBlocks
  rule top
    # Initialise the indent stack with a sentinel:
    &{|s| @indents = [-1] }
    nested_blocks
    {
      def inspect
        nested_blocks.inspect
      end
    }
  end

  rule nested_blocks
    (
      # Do not try to extract this semantic predicate into a new rule.
      # It will be memo-ized incorrectly because @indents.last will change.
      !{|s|
        # Peek at the following indentation:
        save = index; i = _nt_indentation; index = save
        # We're closing if the indentation is less or the same as our enclosing block's:
        closing = i.text_value.length <= @indents.last
      }
      block
    )*
    {
      def inspect
        elements.map{|e| e.block.inspect}*"\n"
      end
    }
  end

 rule block
    indented_line       # The block's opening line
    &{|s|               # Push the indent level to the stack
      level = s[0].indentation.text_value.length
      @indents << level
      true
    }
    nested_blocks       # Parse any nested blocks
    &{|s|               # Pop the indent stack
      # Note that under no circumstances should "nested_blocks" fail, or the stack will be mis-aligned
      @indents.pop
      true
    }
    {
      def inspect
        indented_line.inspect +
          (nested_blocks.elements.size > 0 ? (
            "\n{\n" +
            nested_blocks.elements.map { |content|
              content.block.inspect+"\n"
            }*'' +
            "}"
          )
          : "")
      end
    }
  end

  rule indented_line
    indentation text:((!"\n" .)*) "\n"
    {
      def inspect
        text.text_value
      end
    }
  end

  rule indentation
    ' '*
  end
end
require 'polyglot'
require 'treetop'
require 'indented_blocks'

parser = IndentedBlocksParser.new

input = <<END
def foo
  here is some indented text
    here it's further indented
    and here the same
      but here it's further again
      and some more like that
    before going back to here
      down again
  back twice
and start from the beginning again
  with only a small block this time
END 

parse_tree = parser.parse input

p parse_tree
{ 
    var indentStack = [], 
        rootScope = { 
            value: "PROGRAM",
            values: [], 
            scopes: [] 
        };

    function addToRootScope(text) {
        // Here we wiggle with the form and append the new
        // scope to the rootScope.

        if (!text) return;

        if (indentStack.length === 0) {
            rootScope.scopes.unshift({
                text: text,
                statements: []
            });
        }
        else {
            rootScope.scopes[0].statements.push(text);
        }
    }
}

/* Add some grammar */

start
  = lines: (line EOL+)*
    { 
        return rootScope;
    }


line
  = line: (samedent t:text { addToRootScope(t); }) &EOL
  / line: (indent t:text { addToRootScope(t); }) &EOL
  / line: (dedent t:text { addToRootScope(t); }) &EOL
  / line: [ \t]* &EOL
  / EOF

samedent
  = i:[\t]* &{ return i.length === indentStack.length; }
    {
        console.log("s:", i.length, " level:", indentStack.length);
    }

indent
  = i:[\t]+ &{ return i.length > indentStack.length; }
    {
        indentStack.push(""); 
        console.log("i:", i.length, " level:", indentStack.length);
    }

dedent
    = i:[\t]* &{ return i.length < indentStack.length; }
      {
          for (var j = 0; j < i.length + 1; j++) {
              indentStack.pop();
          } 
          console.log("d:", i.length + 1, " level:", indentStack.length);  
      }

text
    = numbers: number+  { return numbers.join(""); } 
    / txt: character+   { return txt.join(""); }


number
    = $[0-9] 

character 
    = $[ a-zA-Z->+]  
__
    = [ ]+

_ 
    = [ ]*

EOF 
    = !.

EOL
    = "\r\n" 
    / "\n" 
    / "\r"