Python 用正则表达式替换重复出现的子字符串?
我试图从下面的文本中删除表描述,以便只保留非表文本。我一直在玩regex101.com,但似乎找不到真正实现这一点的模式(它总是占用整个部分)。我错过了什么 表37-1描述表格的多行文字(.pdf) 非表格文本Python 用正则表达式替换重复出现的子字符串?,python,regex,Python,Regex,我试图从下面的文本中删除表描述,以便只保留非表文本。我一直在玩regex101.com,但似乎找不到真正实现这一点的模式(它总是占用整个部分)。我错过了什么 表37-1描述表格的多行文字(.pdf) 非表格文本 >>> text = '''TABLE 37-1 Text over multiple ...: lines that describes the table (.pdf) ...: Non table text line 1. ...: Non table te
>>> text = '''TABLE 37-1 Text over multiple
...: lines that describes the table (.pdf)
...: Non table text line 1.
...: Non table text line 2.
...: TABLE 37-2 non table text that
...: starts with TABLE and ends with (.pdf)(.pdf)
...: TABLE 37-2 Text over multiple
...: lines that describes the table (.pdf)'''
>>>
>>> re.match(r'TABLE.*?\(\.pdf\)\n(.*)TABLE.*?\(\.pdf\)$', text, re.DOTALL).group(1)
'Non table text line 1.\nNon table text line 2.\nTABLE 37-2 non table text that\nstarts with TABLE and ends with (.pdf)(.pdf)\n'
表37-2描述表格的多行文字(.pdf)
这将提取需要的文本,而不是用空字符串替换不需要的文本
>>> import re
>>>
>>> text = '''TABLE 37-1 Text over multiple
...: lines that describes the table (.pdf)
...: Non table text line 1.
...: Non table text line 2.
...: TABLE 37-2 Text over multiple
...: lines that describes the table (.pdf)'''
>>>
>>> re.match(r'TABLE.*?\(\.pdf\)\n(.*)TABLE.*?\(\.pdf\)$', text, re.DOTALL).group(1)
'Non table text line 1.\nNon table text line 2.\n'
如果在非表格文本中有“TABLE…(.pdf)”
字符串,则也应起作用
>>> text = '''TABLE 37-1 Text over multiple
...: lines that describes the table (.pdf)
...: Non table text line 1.
...: Non table text line 2.
...: TABLE 37-2 non table text that
...: starts with TABLE and ends with (.pdf)(.pdf)
...: TABLE 37-2 Text over multiple
...: lines that describes the table (.pdf)'''
>>>
>>> re.match(r'TABLE.*?\(\.pdf\)\n(.*)TABLE.*?\(\.pdf\)$', text, re.DOTALL).group(1)
'Non table text line 1.\nNon table text line 2.\nTABLE 37-2 non table text that\nstarts with TABLE and ends with (.pdf)(.pdf)\n'
这将提取需要的文本,而不是用空字符串替换不需要的文本
>>> import re
>>>
>>> text = '''TABLE 37-1 Text over multiple
...: lines that describes the table (.pdf)
...: Non table text line 1.
...: Non table text line 2.
...: TABLE 37-2 Text over multiple
...: lines that describes the table (.pdf)'''
>>>
>>> re.match(r'TABLE.*?\(\.pdf\)\n(.*)TABLE.*?\(\.pdf\)$', text, re.DOTALL).group(1)
'Non table text line 1.\nNon table text line 2.\n'
如果在非表格文本中有“TABLE…(.pdf)”
字符串,则也应起作用
>>> text = '''TABLE 37-1 Text over multiple
...: lines that describes the table (.pdf)
...: Non table text line 1.
...: Non table text line 2.
...: TABLE 37-2 non table text that
...: starts with TABLE and ends with (.pdf)(.pdf)
...: TABLE 37-2 Text over multiple
...: lines that describes the table (.pdf)'''
>>>
>>> re.match(r'TABLE.*?\(\.pdf\)\n(.*)TABLE.*?\(\.pdf\)$', text, re.DOTALL).group(1)
'Non table text line 1.\nNon table text line 2.\nTABLE 37-2 non table text that\nstarts with TABLE and ends with (.pdf)(.pdf)\n'
显示输入和预期输出这是否回答了您的问题?通过在
(.|\n)*
之后添加?
,使正则表达式不贪婪,即(表)(.|\n)*?(\(.pdf\)
@Nick这仍然会删除整个文本块,而不是在第一个“(.pdf)”@user3495364显示输入和预期输出这是否回答了您的问题?通过在(.|\n)*
之后添加?
使正则表达式不贪婪,即(表)(.|\n)*?(\(.pdf\)
@Nick这仍然会删除整个文本块,而不是在第一个“(.pdf)”@user3495364处停止