Python 以特殊字符开头或结尾的单词边界会产生意外的结果_Python_Regex

Python 以特殊字符开头或结尾的单词边界会产生意外的结果

python regex

Python 以特殊字符开头或结尾的单词边界会产生意外的结果,python,regex,Python,Regex,假设我想匹配短语test Sortes\index[persons]{Sortes}文本中短语Sortes\index[persons]{Sortes}的存在使用pythonre我可以做到这一点： >>> search = re.escape('Sortes\index[persons]{Sortes}') >>> match = 'test Sortes\index[persons]{Sortes} text' >>> re.search

假设我想匹配短语

test Sortes\index[persons]{Sortes}文本中短语Sortes\index[persons]{Sortes}
的存在
使用pythonre
我可以做到这一点：
>>> search = re.escape('Sortes\index[persons]{Sortes}')
>>> match = 'test Sortes\index[persons]{Sortes} text'
>>> re.search(search, match)
<_sre.SRE_Match object; span=(5, 34), match='Sortes\\index[persons]{Sortes}'>

因此我使用\b
模式，如下所示：
search = r'\b' + re.escape('Sortes\index[persons]{Sortes}') + r'\b'
match = 'test Sortes\index[persons]{Sortes} text'
re.search(search, match)

现在，我没有找到匹配的
如果搜索模式不包含任何字符[]{}
，则它可以工作。例如：
>>> re.search(r'\b' + re.escape('Sortes\index') + r'\b', 'test Sortes\index test')
<_sre.SRE_Match object; span=(5, 17), match='Sortes\\index'>

此外，报告还提到了\b

请注意，形式上，\b定义为\w和\w字符之间的边界（反之亦然），或\w和字符串的开头/结尾之间的边界
因此，我尝试将最后的\b
替换为（\W |$）
：
>>re.search（r'\b'+re.escape（'Sortes\index[persons]{Sortes}'）+'（\W |$），'test Sortes\index[persons]{Sortes}test'）

看哪，它起作用了！
这是怎么回事？我遗漏了什么？
查看单词边界匹配的内容：
单词边界可以出现在以下三个位置之一：

在字符串的第一个字符之前，如果第一个字符是单词字符
如果最后一个字符是单词字符，则在字符串中最后一个字符之后
在字符串中的两个字符之间，其中一个是单词字符，另一个不是单词字符

在您的模式中，}\b
仅当}
后面有一个单词char（字母、数字或
）时匹配
当您使用（\W |$）
时，需要显式使用非单词或字符串结尾
在这些情况下，我始终建议基于负面环视的明确单词边界：
re.search(r'(?<!\w){}(?!\w)'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')

re.search（r'（？我想这就是你遇到的问题：
\b
位于\w
和\w
的边界上，但在示例中，这不起作用。'{sorters}\b'
是\w
和\w
之间的边界，因为'
不匹配[a-zA-Z0-9]
，通常设置为\w
}
，模式的最后一个字符是非单词字符，它后面的空格也是。因此没有单词边界，也没有匹配。如果最后一个字符是s
，它是单词字符，因此有单词边界。我喜欢关于负面外观的建议。这个正则表达式匹配是我代码中非常热门的部分，因此我担心匹配的性能。这是否是环视的问题？@Stenskjaer\b
与其他环视一样，也是一个零宽度断言。由于这些环视模式只包含单个原子，因此开销应该与您已经使用了\bs。如果您担心，您可以设置一个快速性能测试，但这是我能想到的解决问题的唯一正确的正则表达式方法。是的！我只是自己测试了它。性能上没有（可检测的）差异。谢谢。
re.search(r'\b' + re.escape('Sortes\index[persons]{Sortes}'), 'test Sortes\index[persons]{Sortes} test')
<_sre.SRE_Match object; span=(5, 34), match='Sortes\\index[persons]{Sortes}'>

>>> re.search(r'\b' + re.escape('Sortes\index[persons]{Sortes}') + '(\W|$)', 'test Sortes\index[persons]{Sortes} test')
<_sre.SRE_Match object; span=(5, 35), match='Sortes\\index[persons]{Sortes} '>

re.search(r'(?<!\w){}(?!\w)'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')