Python 使用pyparsing分析多行上的单词转义拆分
我正在尝试使用反斜杠换行组合(“Python 使用pyparsing分析多行上的单词转义拆分,python,parsing,pyparsing,Python,Parsing,Pyparsing,我正在尝试使用反斜杠换行组合(“\\n”)解析可以拆分为多行的单词。以下是我所做的: from pyparsing import * continued_ending = Literal('\\') + lineEnd word = Word(alphas) split_word = word + Suppress(continued_ending) multi_line_word = Forward() multi_line_word << (word | (split_word
\\n
”)解析可以拆分为多行的单词。以下是我所做的:
from pyparsing import *
continued_ending = Literal('\\') + lineEnd
word = Word(alphas)
split_word = word + Suppress(continued_ending)
multi_line_word = Forward()
multi_line_word << (word | (split_word + multi_line_word))
print multi_line_word.parseString(
'''super\\
cali\\
fragi\\
listic''')
我又摸索了一会儿,才发现这里有一个值得注意的地方 我经常看到低效的语法 有人实现了pyparsing语法 直接来自BNF定义。BNF 没有“一个或多个”的概念 “更多”或“零或更多”或 “可选” 有了这些,我有了改变这两行的想法
multi_line_word = Forward()
multi_line_word << (word | (split_word + multi_line_word))
这让它输出了我想要的东西:['super','cali',fragi','listic']
接下来,我添加了一个将这些令牌连接在一起的解析操作:
multi_line_word.setParseAction(lambda t: ''.join(t))
这将给出['supercalifragilistic']
的最终输出
我学到的带回家的信息是,一个人并不简单
只是开玩笑
带回家的信息是,不能简单地用pyparsing实现BNF的一对一翻译。应该调用使用迭代类型的一些技巧
编辑2009-11-25:为了补偿更繁重的测试用例,我将代码修改为:
no_space = NotAny(White(' \t\r'))
# make sure that the EOL immediately follows the escape backslash
continued_ending = Literal('\\') + no_space + lineEnd
word = Word(alphas)
# make sure that the escape backslash immediately follows the word
split_word = word + NotAny(White()) + Suppress(continued_ending)
multi_line_word = OneOrMore(split_word + NotAny(White())) + Optional(word)
multi_line_word.setParseAction(lambda t: ''.join(t))
这样做的好处是确保任何元素之间都没有空格(转义反斜杠后的换行除外)。您的代码非常接近。这些MOD中的任何一个都可以工作:
# '|' means MatchFirst, so you had a left-recursive expression
# reversing the order of the alternatives makes this work
multi_line_word << ((split_word + multi_line_word) | word)
# '^' means Or/MatchLongest, but beware using this inside a Forward
multi_line_word << (word ^ (split_word + multi_line_word))
# an unusual use of delimitedList, but it works
multi_line_word = delimitedList(word, continued_ending)
# in place of your parse action, you can wrap in a Combine
multi_line_word = Combine(delimitedList(word, continued_ending))
#“|”表示匹配优先,因此您有一个左递归表达式
#颠倒备选方案的顺序可以实现这一点
多行字使用Combine
也不强制执行中间空白。有趣。尝试了multi\u-line\u-word=Combine(Combine(OneOrMore(split\u-word))+Optional(word))
,但它在'sh\\\\n iny'
案例中中断,因为它不会引发异常,而是返回['sh']
。我遗漏了什么吗?嗯,你的单词不仅仅是跨越“\”新行的字母,而是在字母“I”之前有一个空格,可以算作分词,所以Combine在“sh”之后停止。您可以使用nexting=False构造函数参数修改Combine,但请注意,您可能会将整个文件作为一个单词来使用!或者,如果您还想折叠任何前导空格,可以重新定义continued\u ending的定义,以包括行尾后的任何空格。我更喜欢多行词.parseString('sh\\\n iny')
raiseParseException
,而不是将'sh'
标识为其标记。在这种情况下,'sh'
和'iny'
是两个单词,而不是一个断字的一部分,因为'iny'
部分与EOL不连续。因此,多行词
不应该识别它。它应该举手说:“这不是一个有效的断字!”
no_space = NotAny(White(' \t\r'))
# make sure that the EOL immediately follows the escape backslash
continued_ending = Literal('\\') + no_space + lineEnd
word = Word(alphas)
# make sure that the escape backslash immediately follows the word
split_word = word + NotAny(White()) + Suppress(continued_ending)
multi_line_word = OneOrMore(split_word + NotAny(White())) + Optional(word)
multi_line_word.setParseAction(lambda t: ''.join(t))
# '|' means MatchFirst, so you had a left-recursive expression
# reversing the order of the alternatives makes this work
multi_line_word << ((split_word + multi_line_word) | word)
# '^' means Or/MatchLongest, but beware using this inside a Forward
multi_line_word << (word ^ (split_word + multi_line_word))
# an unusual use of delimitedList, but it works
multi_line_word = delimitedList(word, continued_ending)
# in place of your parse action, you can wrap in a Combine
multi_line_word = Combine(delimitedList(word, continued_ending))