Regex 从文本中删除大型字符串列表_Regex_Python 3.x

Regex 从文本中删除大型字符串列表

regex python-3.x

Regex 从文本中删除大型字符串列表,regex,python-3.x,Regex,Python 3.x,假设 txt='Daniel Johnson and Ana Hickman are friends. They know each other for a long time. Daniel Johnson is a professor and Ana Hickman is writer.' 是一大块文本，我想删除一大串字符串，如 removalLists=['Daniel Johnson','Ana Hickman'] 从他们那里。我的意思是我想用替换列表中的所有元素 ' ' 我知道我

假设

txt='Daniel Johnson and Ana Hickman are friends. They know each other for a long time. Daniel Johnson is a professor and Ana Hickman is writer.'

是一大块文本，我想删除一大串字符串，如

removalLists=['Daniel Johnson','Ana Hickman']

从他们那里。我的意思是我想用替换列表中的所有元素

' '

我知道我可以很容易地使用循环来完成，例如

for string in removalLists:
    txt=re.sub(string,' ',txt)

我想知道我是否能做得更快。

一种方法是生成一个单一的正则表达式模式，它是替换术语的替代。因此，我建议您使用以下正则表达式模式作为示例：

\bDaniel Johnson\b|\bAna Hickman\b

要生成这个，我们可以首先用单词边界（

\b

）包装每个术语。然后，使用

作为分隔符，将列表折叠为单个字符串。最后，我们可以使用

re.sub

将任何术语的所有出现替换为单个空格

txt = 'Daniel Johnson and Ana Hickman are friends. They know each other for a long time. Daniel Johnson is a professor and Ana Hickman is writer.'
removalLists = ['Daniel Johnson','Ana Hickman']

regex = '|'.join([r'\b' + s + r'\b' for s in removalLists])
output = re.sub(regex, " ", txt)

print(output)

  and   are friends. They know each other for a long time.   is a professor and   is writer.

如果我做了正确的性能检查，单词边界会使这比OP的代码花费更多的时间。如果项目应该由单词边界包围，最好将

\b

s放在每个项目之间交替的非捕获组之外，

regex=r'\b（？：'+'|'。join（removalLists）+r'\b'

。（regex比RemovalList中字符串的

性能更好

但是列表中的项目越多）@CertainPerformance我明白了。谢谢你的评论。我不会提出你的想法，但是想要更快解决方案的人可能会这么做。