Parsing 当一个表达式有多种可能的形式时,如何为它编写语法

Parsing 当一个表达式有多种可能的形式时,如何为它编写语法,parsing,grammar,pyparsing,Parsing,Grammar,Pyparsing,我有一些句子需要转换成正则表达式代码,我试着使用Pyparsing。这些句子基本上是搜索规则,告诉我们要搜索什么 句子示例- 行\u包含这是一个短语 -这是一个示例搜索规则,告诉您正在搜索的行应该有短语这是一个短语 LINE\u STARTSWITH while we-这是一个示例搜索规则,告诉您正在搜索的行应以短语while we开头 这些规则也可以组合在一起,比如-LINE_在{phrase2和phrase3}之前包含短语1,LINE_开始使用,但是我们 可以找到所有实际句子的列表(如有必要

我有一些句子需要转换成正则表达式代码,我试着使用Pyparsing。这些句子基本上是搜索规则,告诉我们要搜索什么

句子示例-

  • 行\u包含这是一个短语
    -这是一个示例搜索规则,告诉您正在搜索的行应该有短语
    这是一个短语

  • LINE\u STARTSWITH while we
    -这是一个示例搜索规则,告诉您正在搜索的行应以短语
    while we
    开头

  • 这些规则也可以组合在一起,比如-
    LINE_在{phrase2和phrase3}之前包含短语1,LINE_开始使用,但是我们

  • 可以找到所有实际句子的列表(如有必要)。
    所有行都以上述两个符号中的任何一个开头(称为line_指令)。现在,我试图解析这些句子,然后将它们转换为正则表达式代码。我开始为我的语法写BNF,这就是我想到的-

    lpar ::= '{'
    rpar ::= '}'
    line_directive ::= LINE_CONTAINS | LINE_STARTSWITH
    phrase ::= lpar(?) + (word+) + rpar(?) # meaning if a phrase is parenthesized, its still the same
    
    upto_N_words ::= lpar + 'UPTO' + num + 'WORDS' + rpar
    N_words ::= lpar + num + 'WORDS' + rpar
    upto_N_characters ::= lpar + 'UPTO' + num + 'CHARACTERS' + rpar
    N_characters ::= lpar + num + 'CHARACTERS' + rpar
    
    JOIN_phrase ::= phrase + JOIN + phrase
    AND_phrase ::= phrase (+ JOIN + phrase)+
    OR_phrase ::= phrase (+ OR + phrase)+
    BEFORE_phrase ::= phrase (+ BEFORE + phrase)+
    AFTER_phrase ::= phrase (+ AFTER + phrase)+
    
    braced_OR_phrase ::= lpar + OR_phrase + rpar
    braced_AND_phrase ::= lpar + AND_phrase + rpar
    braced_BEFORE_phrase ::= lpar + BEFORE_phrase + rpar
    braced_AFTER_phrase ::= lpar + AFTER_phrase + rpar
    braced_JOIN_phrase ::= lpar + JOIN_phrase + rpar
    
    rule ::= line_directive + subrule
    final_expr ::= rule (+ AND/OR + rule)+
    
    问题是
    子规则
    ,根据我的经验数据,我已经能够得出以下所有表达式-

    subrule ::= phrase
            ::= OR_phrase
            ::= JOIN_phrase
            ::= BEFORE_phrase
            ::= AFTER_phrase
            ::= AND_phrase
            ::= phrase + upto_N_words + phrase
            ::= braced_OR_phrase + phrase
            ::= phrase + braced_OR_phrase
            ::= phrase + braced_OR_phrase + phrase
            ::= phrase + upto_N_words + braced_OR_phrase
            ::= phrase + upto_N_characters + phrase
            ::= braced_OR_phrase + phrase + upto_N_words + phrase
            ::= phrase + braced_OR_phrase + upto_N_words + phrase
    
    举一个例子,我的一句话是
    这项研究的目的是{识别或识别}上调的基因
    。对于这一点,上面提到的子规则是
    短语+大括号\u或\u短语+短语

    因此,我的问题是如何为
    子规则编写一个简单的BNF语法表达式,以便能够使用Pyparsing轻松地为其编写语法?此外,任何关于我目前技术的意见都是绝对欢迎的


    编辑:在应用@Paul在其回答中阐明的原则后,以下是代码的MCVE版本。它获取要解析的句子列表
    hrrsents
    ,解析每个句子,将其转换为相应的正则表达式,并返回正则表达式字符串列表-

    from pyparsing import *
    import re
    
    
    def parse_hrr(hrrsents):
        UPTO, AND, OR, WORDS, CHARACTERS = map(Literal, "UPTO AND OR WORDS CHARACTERS".split())
        LBRACE,RBRACE = map(Suppress, "{}")
        integer = pyparsing_common.integer()
    
        LINE_CONTAINS, PARA_STARTSWITH, LINE_ENDSWITH = map(Literal,
            """LINE_CONTAINS PARA_STARTSWITH LINE_ENDSWITH""".split()) # put option for LINE_ENDSWITH. Users may use, I don't presently
        BEFORE, AFTER, JOIN = map(Literal, "BEFORE AFTER JOIN".split())
        keyword = UPTO | WORDS | AND | OR | BEFORE | AFTER | JOIN | LINE_CONTAINS | PARA_STARTSWITH
    
        class Node(object):
            def __init__(self, tokens):
                self.tokens = tokens
    
            def generate(self):
                pass
    
        class LiteralNode(Node):
            def generate(self):
                return "(%s)" %(re.escape(''.join(self.tokens[0]))) # here, merged the elements, so that re.escape does not have to do an escape for the entire list
    
        class ConsecutivePhrases(Node):
            def generate(self):
                join_these=[]
                tokens = self.tokens[0]
                for t in tokens:
                    tg = t.generate()
                    join_these.append(tg)
                seq = []
                for word in join_these[:-1]:
                    if (r"(([\w]+\s*)" in word) or (r"((\w){0," in word): #or if the first part of the regex in word:
                        seq.append(word + "")
                    else:
                        seq.append(word + "\s+")
                seq.append(join_these[-1])
                result = "".join(seq)
                return result
    
        class AndNode(Node):
            def generate(self):
                tokens = self.tokens[0]
                join_these=[]
                for t in tokens[::2]:
                    tg = t.generate()
                    tg_mod = tg[0]+r'?=.*\b'+tg[1:][:-1]+r'\b)' # to place the regex commands at the right place
                    join_these.append(tg_mod)
                joined = ''.join(ele for ele in join_these)
                full = '('+ joined+')'
                return full
    
        class OrNode(Node):
            def generate(self):
                tokens = self.tokens[0]
                joined = '|'.join(t.generate() for t in tokens[::2])
                full = '('+ joined+')'
                return full
    
        class LineTermNode(Node):
            def generate(self):
                tokens = self.tokens[0]
                ret = ''
                dir_phr_map = {
                    'LINE_CONTAINS': lambda a:  r"((?:(?<=^)|(?<=[\W_]))" + a + r"(?=[\W_]|$))456", 
                    'PARA_STARTSWITH':
                        lambda a: ( r"(^" + a + r"(?=[\W_]|$))457") if 'gene' in repr(a)
                        else (r"(^" + a + r"(?=[\W_]|$))458")}
    
                for line_dir, phr_term in zip(tokens[0::2], tokens[1::2]):
                    ret = dir_phr_map[line_dir](phr_term.generate())
                return ret
    
        class LineAndNode(Node):
            def generate(self):
                tokens = self.tokens[0]
                return '&&&'.join(t.generate() for t in tokens[::2])
    
        class LineOrNode(Node):
            def generate(self):
                tokens = self.tokens[0]
                return '@@@'.join(t.generate() for t in tokens[::2])
    
        class UpToWordsNode(Node):
            def generate(self):
                tokens = self.tokens[0]
                ret = ''
                word_re = r"([\w]+\s*)"
                for op, operand in zip(tokens[1::2], tokens[2::2]):
                    # op contains the parsed "upto" expression
                    ret += "(%s{0,%d})" % (word_re, op)
                return ret
    
        class UpToCharactersNode(Node):
            def generate(self):
                tokens = self.tokens[0]
                ret = ''
                char_re = r"\w"
                for op, operand in zip(tokens[1::2], tokens[2::2]):
                    # op contains the parsed "upto" expression
                    ret += "((%s){0,%d})" % (char_re, op)
                return ret
    
        class BeforeAfterJoinNode(Node):
            def generate(self):
                tokens = self.tokens[0]
                operator_opn_map = {'BEFORE': lambda a,b: a + '.*?' + b, 'AFTER': lambda a,b: b + '.*?' + a, 'JOIN': lambda a,b: a + '[- ]?' + b}
                ret = tokens[0].generate()
                for operator, operand in zip(tokens[1::2], tokens[2::2]):
                    ret = operator_opn_map[operator](ret, operand.generate()) # this is basically calling a dict element, and every such element requires 2 variables (a&b), so providing them as ret and op.generate
                return ret
    
    ## THE GRAMMAR
        word = ~keyword + Word(alphas, alphanums+'-_+/()')
        uptowords_expr = Group(LBRACE + UPTO + integer("numberofwords") + WORDS + RBRACE).setParseAction(UpToWordsNode)
        uptochars_expr = Group(LBRACE + UPTO + integer("numberofchars") + CHARACTERS + RBRACE).setParseAction(UpToCharactersNode)
        some_words = OneOrMore(word).setParseAction(' '.join, LiteralNode)
        phrase_item = some_words | uptowords_expr | uptochars_expr
    
        phrase_expr = infixNotation(phrase_item,
                                    [
                                    ((BEFORE | AFTER | JOIN), 2, opAssoc.LEFT, BeforeAfterJoinNode), # was not working earlier, because BEFORE etc. were not keywords, and hence parsed as words
                                    (None, 2, opAssoc.LEFT, ConsecutivePhrases),
                                    (AND, 2, opAssoc.LEFT, AndNode),
                                    (OR, 2, opAssoc.LEFT, OrNode),
                                    ],
                                    lpar=Suppress('{'), rpar=Suppress('}')
                                    ) # structure of a single phrase with its operators
    
        line_term = Group((LINE_CONTAINS|PARA_STARTSWITH)("line_directive") +
                          (phrase_expr)("phrases")) # basically giving structure to a single sub-rule having line-term and phrase
        #
        line_contents_expr = infixNotation(line_term.setParseAction(LineTermNode),
                                           [(AND, 2, opAssoc.LEFT, LineAndNode),
                                            (OR, 2, opAssoc.LEFT, LineOrNode),
                                            ]
                                           ) # grammar for the entire rule/sentence
    ######################################
        mrrlist=[]
        for t in hrrsents:
            t = t.strip()
            if not t:
                continue
            try:
                parsed = line_contents_expr.parseString(t)
            except ParseException as pe:
                print(' '*pe.loc + '^')
                print(pe)
                continue
    
    
            temp_regex = parsed[0].generate()
            final_regexes3 = re.sub(r'gene','%s',temp_regex) # this can be made more precise by putting a condition of [non-word/^/$] around the 'gene'
            mrrlist.append(final_regexes3)
        return(mrrlist)
    
    从pyparsing导入*
    进口稀土
    def parse_hrr(hrrsents):
    最多和,或,单词,字符=映射(文字“最多和或单词字符”.split())
    LBRACE,RBRACE=map(抑制“{}”)
    integer=pyu公共.integer()
    行包含,段落开始,行结束=映射(文字,
    “LINE_包含带有LINE_ENDSWITH”“.split())#行_ENDSWITH的看跌期权的PARA_开始。用户可以使用,我现在不知道
    BEFORE,AFTER,JOIN=map(文字“BEFORE-AFTER-JOIN”.split())
    关键字=最多|单词|和|或|前|后|连接|行|包含|段落|开始
    类节点(对象):
    定义初始化(自我,令牌):
    self.tokens=令牌
    def生成(自身):
    通过
    类LiteralNode(节点):
    def生成(自身):
    返回“(%s)”%(re.escape(''.join(self.tokens[0]))#在这里,合并元素,以便re.escape不必对整个列表进行转义
    类连续短语(节点):
    def生成(自身):
    加入这些=[]
    tokens=self.tokens[0]
    对于t in代币:
    tg=t.generate()
    加入这些。附加(tg)
    seq=[]
    对于join_中的单词,这些[:-1]:
    如果(r)([\w]+\s*)在word中)或(r)(\w{0,”在word中):#或者如果正则表达式的第一部分在word中:
    序号.追加(word+“”)
    其他:
    序号.追加(word+“\s+”)
    seq.追加(加入这些[-1])
    结果=“加入(seq)
    返回结果
    类和节点(节点):
    def生成(自身):
    tokens=self.tokens[0]
    加入这些=[]
    对于令牌中的t[::2]:
    tg=t.generate()
    tg_mod=tg[0]+r'?=.*\b'+tg[1:][:-1]+r'\b')将正则表达式命令放置在正确的位置
    加入这些。附加(tg\u mod)
    join=''.join(这些元素中的ele代表ele)
    完整='('+连接+')'
    全额返还
    类或节点(节点):
    def生成(自身):
    tokens=self.tokens[0]
    join='|'.join(t.generate(),用于标记[::2]中的t)
    完整='('+连接+')'
    全额返还
    类LineTermNode(节点):
    def生成(自身):
    tokens=self.tokens[0]
    ret=''
    方向/相位/映射={
    
    “行_包含”:lambda a:r”((?:(?这里有一个两层语法,因此您最好一次只关注一层,我们已经在您的其他一些问题中介绍了这一层。较低的一层是
    短语expr
    ,它稍后将作为
    行指令expr
    的参数。因此,请先定义短语表达式的示例-提取它们从您的完整语句示例列表中。您的
    短语的完成BNF将具有最低级别的递归,如下所示:

    phrase_atom ::= <one or more types of terminal items, like words of characters 
                     or quoted strings, or *possibly* expressions of numbers of 
                     words or characters>  |  brace + phrase_expr + brace`
    
    然后,当您转换为pyparsing时,一次添加一点组和结果名称!不要立即对所有内容进行分组或命名。通常我建议随意使用结果名称,但在中缀符号语法中,许多结果名称可能会将结果弄得杂乱无章。让组(最终是节点类)来处理进行结构化,节点类中的行为将引导您找到想要的结果名称。因此,结果类的结构通常非常简单,在类init或evaluate方法中进行列表解包通常更容易。从简单表达式开始,直到复杂表达式。(请看
    )这是你最简单的测试用例之一,但你把它作为#97?)如果你只是按照长度顺序对这个列表进行排序,那将是一个很好的粗略划分。或者通过增加操作符的数量进行排序。但是在你让简单的用例工作之前处理复杂的用例,你也将不得不这样做
    
    line_directive_item ::= line_directive phrase_expr | brace line_directive_expr brace
    line_directive_and ::= line_directive_item (AND line_directive_item)*
    line_directive_or ::= line_directive_and (OR line_directive_and)*
    line_directive_expr ::= line_directive_or