Python 使用pyparsing分析多行上的单词转义拆分_Python_Parsing_Pyparsing

Python 使用pyparsing分析多行上的单词转义拆分

python parsing

Python 使用pyparsing分析多行上的单词转义拆分,python,parsing,pyparsing,Python,Parsing,Pyparsing,我正在尝试使用反斜杠换行组合（“\\n”）解析可以拆分为多行的单词。以下是我所做的： from pyparsing import * continued_ending = Literal('\\') + lineEnd word = Word(alphas) split_word = word + Suppress(continued_ending) multi_line_word = Forward() multi_line_word << (word | (split_word

我正在尝试使用反斜杠换行组合（“

\\n

”）解析可以拆分为多行的单词。以下是我所做的：

from pyparsing import *

continued_ending = Literal('\\') + lineEnd
word = Word(alphas)
split_word = word + Suppress(continued_ending)
multi_line_word = Forward()
multi_line_word << (word | (split_word + multi_line_word))

print multi_line_word.parseString(
'''super\\
cali\\
fragi\\
listic''')

我又摸索了一会儿，才发现这里有一个值得注意的地方

我经常看到低效的语法有人实现了pyparsing语法直接来自BNF定义。BNF 没有“一个或多个”的概念 “更多”或“零或更多”或 “可选”

有了这些，我有了改变这两行的想法

multi_line_word = Forward()
multi_line_word << (word | (split_word + multi_line_word))

这让它输出了我想要的东西：

['super'，'cali'，fragi'，'listic']

接下来，我添加了一个将这些令牌连接在一起的解析操作：

multi_line_word.setParseAction(lambda t: ''.join(t))

这将给出

['supercalifragilistic']

的最终输出

我学到的带回家的信息是，一个人并不简单

只是开玩笑

带回家的信息是，不能简单地用pyparsing实现BNF的一对一翻译。应该调用使用迭代类型的一些技巧

编辑2009-11-25:为了补偿更繁重的测试用例，我将代码修改为：

no_space = NotAny(White(' \t\r'))
# make sure that the EOL immediately follows the escape backslash
continued_ending = Literal('\\') + no_space + lineEnd
word = Word(alphas)
# make sure that the escape backslash immediately follows the word
split_word = word + NotAny(White()) + Suppress(continued_ending)
multi_line_word = OneOrMore(split_word + NotAny(White())) + Optional(word)
multi_line_word.setParseAction(lambda t: ''.join(t))

这样做的好处是确保任何元素之间都没有空格（转义反斜杠后的换行除外）。

您的代码非常接近。这些MOD中的任何一个都可以工作：

# '|' means MatchFirst, so you had a left-recursive expression
# reversing the order of the alternatives makes this work
multi_line_word << ((split_word + multi_line_word) | word)

# '^' means Or/MatchLongest, but beware using this inside a Forward
multi_line_word << (word ^ (split_word + multi_line_word))

# an unusual use of delimitedList, but it works
multi_line_word = delimitedList(word, continued_ending)

# in place of your parse action, you can wrap in a Combine
multi_line_word = Combine(delimitedList(word, continued_ending))

#“|”表示匹配优先，因此您有一个左递归表达式
#颠倒备选方案的顺序可以实现这一点
多行字使用Combine
也不强制执行中间空白。有趣。尝试了multi\u-line\u-word=Combine（Combine（OneOrMore（split\u-word））+Optional（word））
，但它在'sh\\\\n iny'
案例中中断，因为它不会引发异常，而是返回['sh']
。我遗漏了什么吗？嗯，你的单词不仅仅是跨越“\”新行的字母，而是在字母“I”之前有一个空格，可以算作分词，所以Combine在“sh”之后停止。您可以使用nexting=False构造函数参数修改Combine，但请注意，您可能会将整个文件作为一个单词来使用！或者，如果您还想折叠任何前导空格，可以重新定义continued\u ending的定义，以包括行尾后的任何空格。我更喜欢多行词.parseString（'sh\\\n iny'）
raiseParseException
，而不是将'sh'
标识为其标记。在这种情况下，'sh'
和'iny'
是两个单词，而不是一个断字的一部分，因为'iny'
部分与EOL不连续。因此，多行词不应该识别它。它应该举手说：“这不是一个有效的断字！”
no_space = NotAny(White(' \t\r'))
# make sure that the EOL immediately follows the escape backslash
continued_ending = Literal('\\') + no_space + lineEnd
word = Word(alphas)
# make sure that the escape backslash immediately follows the word
split_word = word + NotAny(White()) + Suppress(continued_ending)
multi_line_word = OneOrMore(split_word + NotAny(White())) + Optional(word)
multi_line_word.setParseAction(lambda t: ''.join(t))

# '|' means MatchFirst, so you had a left-recursive expression
# reversing the order of the alternatives makes this work
multi_line_word << ((split_word + multi_line_word) | word)

# '^' means Or/MatchLongest, but beware using this inside a Forward
multi_line_word << (word ^ (split_word + multi_line_word))

# an unusual use of delimitedList, but it works
multi_line_word = delimitedList(word, continued_ending)

# in place of your parse action, you can wrap in a Combine
multi_line_word = Combine(delimitedList(word, continued_ending))