用于分析Python的资源_Python_Lexical Analysis

用于分析Python的资源

python

用于分析Python的资源,python,lexical-analysis,Python,Lexical Analysis,作为一个教育练习，我开始用Python编写Python lexer。最后，我想实现一个可以自己运行的简单Python子集，所以我希望这个lexer可以用一个相当简单的Python子集编写，并尽可能少地导入例如，我发现的涉及词法分析的教程，只需向前看一个字符，就可以确定下一个应该是什么标记，但我担心这对于Python来说是不够的（一方面，仅仅看一个字符，您无法区分分隔符或运算符，或者标识符和关键字；此外，处理缩进在我看来就像一个新的野兽；除此之外）我发现这是非常有用的，但是，当我尝试实现它时

作为一个教育练习，我开始用Python编写Python lexer。最后，我想实现一个可以自己运行的简单Python子集，所以我希望这个lexer可以用一个相当简单的Python子集编写，并尽可能少地导入

例如，我发现的涉及词法分析的教程，只需向前看一个字符，就可以确定下一个应该是什么标记，但我担心这对于Python来说是不够的（一方面，仅仅看一个字符，您无法区分分隔符或运算符，或者标识符和关键字；此外，处理缩进在我看来就像一个新的野兽；除此之外）

我发现这是非常有用的，但是，当我尝试实现它时，我的代码很快开始看起来非常丑陋，有很多

if

语句和案例，而且这似乎不是一种“正确”的方式

有没有什么好的资源可以帮助我/教我这类代码（我也想完全解析它，但首先要做的是正确的？）

我不想在上面使用解析器生成器，但我希望生成的Python代码使用Python的一个简单子集，并且也是合理的自包含的，这样我至少可以梦想拥有一种能够解释自身的语言。（例如，从我的理解来看，如果我使用ply，我将需要我的语言来解释ply包以及解释它本身，我认为这会使事情变得更复杂）.

看看也许你发现它对你的任务有用。

考虑看看PyPy，一个基于python的python实现。它显然也有一个python解析器。

我过去在类似的项目中使用过传统的语法和语法。我也使用过（python lex yacc），我发现这些技能可以从一个转换到另一个

因此，如果您以前从未编写过解析器，我将使用ply编写您的第一个解析器，您将为以后的项目学习一些有用的技能

当你的ply解析器开始工作时，你可以手工制作一个，作为一个教育练习。根据我的经验，手工编写词法分析器和语法分析器会很快变得非常混乱-因此解析器生成器成功了！

这个简单的基于正则表达式的词法分析器已经为我服务了好几次，非常好：

#-------------------------------------------------------------------------------
# lexer.py
#
# A generic regex-based Lexer/tokenizer tool.
# See the if __main__ section in the bottom for an example.
#
# Eli Bendersky (eliben@gmail.com)
# This code is in the public domain
# Last modified: August 2010
#-------------------------------------------------------------------------------
import re
import sys


class Token(object):
    """ A simple Token structure.
        Contains the token type, value and position. 
    """
    def __init__(self, type, val, pos):
        self.type = type
        self.val = val
        self.pos = pos

    def __str__(self):
        return '%s(%s) at %s' % (self.type, self.val, self.pos)


class LexerError(Exception):
    """ Lexer error exception.

        pos:
            Position in the input line where the error occurred.
    """
    def __init__(self, pos):
        self.pos = pos


class Lexer(object):
    """ A simple regex-based lexer/tokenizer.

        See below for an example of usage.
    """
    def __init__(self, rules, skip_whitespace=True):
        """ Create a lexer.

            rules:
                A list of rules. Each rule is a `regex, type`
                pair, where `regex` is the regular expression used
                to recognize the token and `type` is the type
                of the token to return when it's recognized.

            skip_whitespace:
                If True, whitespace (\s+) will be skipped and not
                reported by the lexer. Otherwise, you have to 
                specify your rules for whitespace, or it will be
                flagged as an error.
        """
        # All the regexes are concatenated into a single one
        # with named groups. Since the group names must be valid
        # Python identifiers, but the token types used by the 
        # user are arbitrary strings, we auto-generate the group
        # names and map them to token types.
        #
        idx = 1
        regex_parts = []
        self.group_type = {}

        for regex, type in rules:
            groupname = 'GROUP%s' % idx
            regex_parts.append('(?P<%s>%s)' % (groupname, regex))
            self.group_type[groupname] = type
            idx += 1

        self.regex = re.compile('|'.join(regex_parts))
        self.skip_whitespace = skip_whitespace
        self.re_ws_skip = re.compile('\S')

    def input(self, buf):
        """ Initialize the lexer with a buffer as input.
        """
        self.buf = buf
        self.pos = 0

    def token(self):
        """ Return the next token (a Token object) found in the 
            input buffer. None is returned if the end of the 
            buffer was reached. 
            In case of a lexing error (the current chunk of the
            buffer matches no rule), a LexerError is raised with
            the position of the error.
        """
        if self.pos >= len(self.buf):
            return None
        else:
            if self.skip_whitespace:
                m = self.re_ws_skip.search(self.buf, self.pos)

                if m:
                    self.pos = m.start()
                else:
                    return None

            m = self.regex.match(self.buf, self.pos)
            if m:
                groupname = m.lastgroup
                tok_type = self.group_type[groupname]
                tok = Token(tok_type, m.group(groupname), self.pos)
                self.pos = m.end()
                return tok

            # if we're here, no rule matched
            raise LexerError(self.pos)

    def tokens(self):
        """ Returns an iterator to the tokens found in the buffer.
        """
        while 1:
            tok = self.token()
            if tok is None: break
            yield tok


if __name__ == '__main__':
    rules = [
        ('\d+',             'NUMBER'),
        ('[a-zA-Z_]\w+',    'IDENTIFIER'),
        ('\+',              'PLUS'),
        ('\-',              'MINUS'),
        ('\*',              'MULTIPLY'),
        ('\/',              'DIVIDE'),
        ('\(',              'LP'),
        ('\)',              'RP'),
        ('=',               'EQUALS'),
    ]

    lx = Lexer(rules, skip_whitespace=True)
    lx.input('erw = _abc + 12*(R4-623902)  ')

    try:
        for tok in lx.tokens():
            print(tok)
    except LexerError as err:
        print('LexerError at position %s' % err.pos)

#-------------------------------------------------------------------------------
#lexer.py
#
#基于regex的通用Lexer/tokenizer工具。
#有关示例，请参见底部的if _; main __;部分。
#
#伊莱·本德斯基(eliben@gmail.com)
#此代码位于公共域中
#最后修改日期：2010年8月
#-------------------------------------------------------------------------------
进口稀土
导入系统
类标记（对象）：
“”“一个简单的令牌结构。
包含令牌类型、值和位置。
"""
定义初始值（自身、类型、值、位置）：
self.type=type
self.val=val
self.pos=pos
定义（自我）：
返回“%s（%s）”（self.type、self.val、self.pos）
类LexeError（异常）：
“”“Lexer错误异常。
销售时点情报系统：
输入行中发生错误的位置。
"""
定义初始（自我，位置）：
self.pos=pos
类Lexer（对象）：
“”“一个简单的基于正则表达式的lexer/tokenizer。
请参见下面的用法示例。
"""
def uuu init uuuu（self、rules、skip_whitespace=True）：
“”“创建一个lexer。
规则：
规则列表。每个规则都是`正则表达式，类型`
pair，其中使用的正则表达式是'regex'
要识别标记，“type”是类型
识别时要返回的令牌的名称。
跳过空白：
如果为True，则将跳过空白（\s+）而不是
由lexer报告。否则，你必须
指定空白的规则，否则将被删除
标记为错误。
"""
#所有正则表达式都连接成一个正则表达式
#具有命名组。因为组名称必须有效
#Python标识符，但
#用户是任意字符串，我们自动生成组
#命名并将其映射到令牌类型。
#
idx=1
regex_parts=[]
self.group_type={}
对于正则表达式，键入规则：
groupname='组%s'%idx
regex_parts.append（“（？P%s）”（组名，regex））
self.group\u type[groupname]=类型
idx+=1
self.regex=re.compile（'|'.join（regex_部分））
self.skip_空格=skip_空格
self.re_ws_skip=re.compile（'\S'）
def输入（自身、buf）：
“”“使用缓冲区作为输入初始化lexer。
"""
self.buf=buf
self.pos=0
def令牌（自身）：
“”“返回在中找到的下一个令牌（令牌对象）。”
输入缓冲区。如果
已到达缓冲区。
如果出现词法分析错误（当前的
缓冲区不匹配任何规则），将使用
错误的位置。
"""
如果self.pos>=len（self.buf）：
一无所获
其他：
如果self.skip_空白：
m=self.re_ws_skip.search（self.buf，self.pos）
如果m：
self.pos=m.start（）
其他：
一无所获
m=self.regex.match（self.buf，self.pos）
如果m：
groupname=m.lastgroup
tok_type=self.group_type[groupname]
tok=令牌（tok_类型，m.group（groupname），self.pos）
self.pos=m.end（）
返回tok
#如果我们在这里，没有匹配的规则
提升杆错误（自身位置）
def令牌（自身）：
“”“返回在缓冲区中找到的标记的迭代器。”