Python 用正则表达式替换重复出现的子字符串？_Python_Regex

Python 用正则表达式替换重复出现的子字符串？

python regex

Python 用正则表达式替换重复出现的子字符串？,python,regex,Python,Regex,我试图从下面的文本中删除表描述，以便只保留非表文本。我一直在玩regex101.com，但似乎找不到真正实现这一点的模式（它总是占用整个部分）。我错过了什么表37-1描述表格的多行文字（.pdf）非表格文本 >>> text = '''TABLE 37-1 Text over multiple ...: lines that describes the table (.pdf) ...: Non table text line 1. ...: Non table te

我试图从下面的文本中删除表描述，以便只保留非表文本。我一直在玩regex101.com，但似乎找不到真正实现这一点的模式（它总是占用整个部分）。我错过了什么

表37-1描述表格的多行文字（.pdf）

非表格文本

>>> text = '''TABLE 37-1 Text over multiple 
...: lines that describes the table (.pdf) 
...: Non table text line 1. 
...: Non table text line 2. 
...: TABLE 37-2 non table text that 
...: starts with TABLE and ends with (.pdf)(.pdf) 
...: TABLE 37-2 Text over multiple 
...: lines that describes the table (.pdf)'''                                                 
>>>                                                                                           
>>> re.match(r'TABLE.*?\(\.pdf\)\n(.*)TABLE.*?\(\.pdf\)$', text, re.DOTALL).group(1)          
'Non table text line 1.\nNon table text line 2.\nTABLE 37-2 non table text that\nstarts with TABLE and ends with (.pdf)(.pdf)\n'

表37-2描述表格的多行文字（.pdf）

这将提取需要的文本，而不是用空字符串替换不需要的文本

>>> import re                                                                   
>>>                                                                             
>>> text = '''TABLE 37-1 Text over multiple 
...: lines that describes the table (.pdf) 
...: Non table text line 1. 
...: Non table text line 2. 
...: TABLE 37-2 Text over multiple 
...: lines that describes the table (.pdf)'''                                   
>>>                                                                             
>>> re.match(r'TABLE.*?\(\.pdf\)\n(.*)TABLE.*?\(\.pdf\)$', text, re.DOTALL).group(1)          
'Non table text line 1.\nNon table text line 2.\n'

如果在非表格文本中有

“TABLE…（.pdf）”

字符串，则也应起作用

>>> text = '''TABLE 37-1 Text over multiple 
...: lines that describes the table (.pdf) 
...: Non table text line 1. 
...: Non table text line 2. 
...: TABLE 37-2 non table text that 
...: starts with TABLE and ends with (.pdf)(.pdf) 
...: TABLE 37-2 Text over multiple 
...: lines that describes the table (.pdf)'''                                                 
>>>                                                                                           
>>> re.match(r'TABLE.*?\(\.pdf\)\n(.*)TABLE.*?\(\.pdf\)$', text, re.DOTALL).group(1)          
'Non table text line 1.\nNon table text line 2.\nTABLE 37-2 non table text that\nstarts with TABLE and ends with (.pdf)(.pdf)\n'

这将提取需要的文本，而不是用空字符串替换不需要的文本

>>> import re                                                                   
>>>                                                                             
>>> text = '''TABLE 37-1 Text over multiple 
...: lines that describes the table (.pdf) 
...: Non table text line 1. 
...: Non table text line 2. 
...: TABLE 37-2 Text over multiple 
...: lines that describes the table (.pdf)'''                                   
>>>                                                                             
>>> re.match(r'TABLE.*?\(\.pdf\)\n(.*)TABLE.*?\(\.pdf\)$', text, re.DOTALL).group(1)          
'Non table text line 1.\nNon table text line 2.\n'

如果在非表格文本中有

“TABLE…（.pdf）”

字符串，则也应起作用

>>> text = '''TABLE 37-1 Text over multiple 
...: lines that describes the table (.pdf) 
...: Non table text line 1. 
...: Non table text line 2. 
...: TABLE 37-2 non table text that 
...: starts with TABLE and ends with (.pdf)(.pdf) 
...: TABLE 37-2 Text over multiple 
...: lines that describes the table (.pdf)'''                                                 
>>>                                                                                           
>>> re.match(r'TABLE.*?\(\.pdf\)\n(.*)TABLE.*?\(\.pdf\)$', text, re.DOTALL).group(1)          
'Non table text line 1.\nNon table text line 2.\nTABLE 37-2 non table text that\nstarts with TABLE and ends with (.pdf)(.pdf)\n'

显示输入和预期输出这是否回答了您的问题？通过在

（.|\n）*

之后添加

？

，使正则表达式不贪婪，即

（表）（.|\n）*？（\（.pdf\）

@Nick这仍然会删除整个文本块，而不是在第一个“（.pdf）”@user3495364显示输入和预期输出这是否回答了您的问题？通过在

（.|\n）*

之后添加

？

使正则表达式不贪婪，即

（表）（.|\n）*？（\（.pdf\）

@Nick这仍然会删除整个文本块，而不是在第一个“（.pdf）”@user3495364处停止