Python 在pyparsing中匹配大量包含空格的字符串_Python_Performance_Parsing_Case Insensitive_Pyparsing

Python 在pyparsing中匹配大量包含空格的字符串

python performance parsing

Python 在pyparsing中匹配大量包含空格的字符串,python,performance,parsing,case-insensitive,pyparsing,Python,Performance,Parsing,Case Insensitive,Pyparsing,我需要为以下表达式编写一个匹配器 a + names + c 与和名称匹配字符串列表名称_列表中的一个条目这两个并发症是：名称列表中的条目可以包含空格匹配的需要不区分大小写名称列表相当大（~20000个条目）我试过了 names_kw_list = [pp.Keyword(name, caseless=True) for name in names_list ] names = pp.Or(names_kw_list) 这不适用于带有空格的条目，而且我担心这不是一种非常有效的书

我需要为以下表达式编写一个匹配器

a + names + c

与

和

名称

匹配字符串列表

名称_列表

中的一个条目

这两个并发症是：

名称列表中的条目可以包含空格


匹配的需要不区分大小写

名称列表
相当大（~20000个条目）
我试过了
names_kw_list = [pp.Keyword(name, caseless=True) for name in names_list ]
names = pp.Or(names_kw_list)

这不适用于带有空格的条目，而且我担心这不是一种非常有效的书写方式
有没有办法让它对条目中的空格起作用，或者让它执行得更快？
部分答案：
使用正确的stopOn
功能可以解决空格问题：
def last_occurrence_of(expr):
    return expr + ~pp.FollowedBy(pp.SkipTo(expr))

names_kw_list = [pp.Keyword(word, caseless=True)
                                       for word in names_list ]
names = pp.Or(names_kw_list)("names")
a = pp.OneOrMore(pp.Word(pp.alphas), stopOn=last_occurrence_of(names))("A")
c = pp.OneOrMore(pp.Word(pp.nums))("C")

expr = a + names + c 

这指示a
不要吃名称的字符串
但是，性能会下降，因为现在在stopOn
表达式中使用了一长串名称。
顺便说一句，这不是一种典型的解析方法。通常，当输入的一部分可能与许多可能的值匹配时，值本身用开始/结束标记（如引号）分隔，s、大括号等。此外，您的可能值列表可能会随着时间的推移而增加，因此您必须不断更新解析器以反映这些新值。使用分隔表达式意味着您的解析器将自动处理有效值列表中的更改。如有必要，在基础用例中添加一个用于验证解析值的条件表达式。名称值不由标记分隔。相反，名称是标记：它们是地址中的城市（每个国家的城市数量非常少：数万，而不是更多），它们将地址“拆分”为城市前后的一部分。这些部分的语义通常非常不同，因此，解析程序有助于了解是在城市之前还是之后解析信息。地址解析本质上是一个困难的问题。我曾经尝试过街道地址解析器，但它可能只是一个60-70%的解决方案。祝你好运
def last_occurrence_of(expr):
    return expr + ~pp.FollowedBy(pp.SkipTo(expr))

names_kw_list = [pp.Keyword(word, caseless=True)
                                       for word in names_list ]
names = pp.Or(names_kw_list)("names")
a = pp.OneOrMore(pp.Word(pp.alphas), stopOn=last_occurrence_of(names))("A")
c = pp.OneOrMore(pp.Word(pp.nums))("C")

expr = a + names + c