Python 用正则表达式替换重复出现的子字符串?

Python 用正则表达式替换重复出现的子字符串?,python,regex,Python,Regex,我试图从下面的文本中删除表描述,以便只保留非表文本。我一直在玩regex101.com,但似乎找不到真正实现这一点的模式(它总是占用整个部分)。我错过了什么 表37-1描述表格的多行文字(.pdf) 非表格文本 >>> text = '''TABLE 37-1 Text over multiple ...: lines that describes the table (.pdf) ...: Non table text line 1. ...: Non table te

我试图从下面的文本中删除表描述,以便只保留非表文本。我一直在玩regex101.com,但似乎找不到真正实现这一点的模式(它总是占用整个部分)。我错过了什么

表37-1描述表格的多行文字(.pdf)

非表格文本

>>> text = '''TABLE 37-1 Text over multiple 
...: lines that describes the table (.pdf) 
...: Non table text line 1. 
...: Non table text line 2. 
...: TABLE 37-2 non table text that 
...: starts with TABLE and ends with (.pdf)(.pdf) 
...: TABLE 37-2 Text over multiple 
...: lines that describes the table (.pdf)'''                                                 
>>>                                                                                           
>>> re.match(r'TABLE.*?\(\.pdf\)\n(.*)TABLE.*?\(\.pdf\)$', text, re.DOTALL).group(1)          
'Non table text line 1.\nNon table text line 2.\nTABLE 37-2 non table text that\nstarts with TABLE and ends with (.pdf)(.pdf)\n'
表37-2描述表格的多行文字(.pdf)


这将提取需要的文本,而不是用空字符串替换不需要的文本

>>> import re                                                                   
>>>                                                                             
>>> text = '''TABLE 37-1 Text over multiple 
...: lines that describes the table (.pdf) 
...: Non table text line 1. 
...: Non table text line 2. 
...: TABLE 37-2 Text over multiple 
...: lines that describes the table (.pdf)'''                                   
>>>                                                                             
>>> re.match(r'TABLE.*?\(\.pdf\)\n(.*)TABLE.*?\(\.pdf\)$', text, re.DOTALL).group(1)          
'Non table text line 1.\nNon table text line 2.\n'
如果在非表格文本中有
“TABLE…(.pdf)”
字符串,则也应起作用

>>> text = '''TABLE 37-1 Text over multiple 
...: lines that describes the table (.pdf) 
...: Non table text line 1. 
...: Non table text line 2. 
...: TABLE 37-2 non table text that 
...: starts with TABLE and ends with (.pdf)(.pdf) 
...: TABLE 37-2 Text over multiple 
...: lines that describes the table (.pdf)'''                                                 
>>>                                                                                           
>>> re.match(r'TABLE.*?\(\.pdf\)\n(.*)TABLE.*?\(\.pdf\)$', text, re.DOTALL).group(1)          
'Non table text line 1.\nNon table text line 2.\nTABLE 37-2 non table text that\nstarts with TABLE and ends with (.pdf)(.pdf)\n'

这将提取需要的文本,而不是用空字符串替换不需要的文本

>>> import re                                                                   
>>>                                                                             
>>> text = '''TABLE 37-1 Text over multiple 
...: lines that describes the table (.pdf) 
...: Non table text line 1. 
...: Non table text line 2. 
...: TABLE 37-2 Text over multiple 
...: lines that describes the table (.pdf)'''                                   
>>>                                                                             
>>> re.match(r'TABLE.*?\(\.pdf\)\n(.*)TABLE.*?\(\.pdf\)$', text, re.DOTALL).group(1)          
'Non table text line 1.\nNon table text line 2.\n'
如果在非表格文本中有
“TABLE…(.pdf)”
字符串,则也应起作用

>>> text = '''TABLE 37-1 Text over multiple 
...: lines that describes the table (.pdf) 
...: Non table text line 1. 
...: Non table text line 2. 
...: TABLE 37-2 non table text that 
...: starts with TABLE and ends with (.pdf)(.pdf) 
...: TABLE 37-2 Text over multiple 
...: lines that describes the table (.pdf)'''                                                 
>>>                                                                                           
>>> re.match(r'TABLE.*?\(\.pdf\)\n(.*)TABLE.*?\(\.pdf\)$', text, re.DOTALL).group(1)          
'Non table text line 1.\nNon table text line 2.\nTABLE 37-2 non table text that\nstarts with TABLE and ends with (.pdf)(.pdf)\n'

显示输入和预期输出这是否回答了您的问题?通过在
(.|\n)*
之后添加
,使正则表达式不贪婪,即
(表)(.|\n)*?(\(.pdf\)
@Nick这仍然会删除整个文本块,而不是在第一个“(.pdf)”@user3495364显示输入和预期输出这是否回答了您的问题?通过在
(.|\n)*
之后添加
使正则表达式不贪婪,即
(表)(.|\n)*?(\(.pdf\)
@Nick这仍然会删除整个文本块,而不是在第一个“(.pdf)”@user3495364处停止