python复合正则表达式，用于在不同文档中的不同标记之间提取文本_Python_Regex

python复合正则表达式，用于在不同文档中的不同标记之间提取文本

python regex

python复合正则表达式，用于在不同文档中的不同标记之间提取文本,python,regex,Python,Regex,我有两种类型的文档，其中包含要复制和保存的文本。一种文档类型的有趣文本由名为TAGSTART和TAGEND的标记分隔。另一种文档类型的有趣文本由CORESTART和COREEND定义。以下是两个示例： intro intro intro intro intro intro BEGIN A This is where some text starts That is not interesting or wanted CORESTART save text A save text A save t

我有两种类型的文档，其中包含要复制和保存的文本。一种文档类型的有趣文本由名为TAGSTART和TAGEND的标记分隔。另一种文档类型的有趣文本由CORESTART和COREEND定义。以下是两个示例：

intro intro intro intro intro intro
BEGIN A This is where some text starts
That is not interesting or wanted
CORESTART save text A save text A save text A save text A 
save text A save text A save text A save text A save text A 
save text A COREEND
This is an addendum that is not needed but is just in the way
END A outro outro outro outro outro outro 
outro outro outro outro outro outro outro

这个python脚本适用于第一种类型的文件

import os
import re
import codecs
# walk the directory tree
rootDir = '.'
for dirName, subdirs, files in os.walk(rootDir):
    #    exclude hidden files and directories
    files = [f for f in files if not f[0] == '.']
    subdirs[:] = [d for d in subdirs if not d[0] == '.']
    for fname in files:
         if fname.endswith(('.txt', '.TXT')):
            #    create the full path
            filename = os.path.join(dirName, fname)
            with codecs.open(filename, encoding='utf-8', errors='ignore') as infile, codecs.open('SAVED.txt', 'a',encoding='utf-8') as outfile: 
                stuff = infile.read()
                saveTEXT = '\n' + ''.join(re.findall(r"CORESTART(.+?)COREEND", stuff, re.DOTALL|re.MULTILINE)) + '\n'
                outfile.write(saveTEXT)
                infile.close()
                outfile.close()

如果我把正则表达式改为

      saveTEXT = '\n' + ''.join(re.findall(r"TAGSTART B(.+?)TAGEND B", stuff, re.DOTALL|re.MULTILINE)) + '\n'

我可以从第二类文件中得到我想要的。但是，复合正则表达式失败：

      saveTEXT = '\n' + ''.join(re.findall(r"CORESTART|TAGSTART B(.+?)COREEND|TAGEND B", stuff, re.DOTALL|re.MULTILINE)) + '\n'

什么也没找到。我尝试将原始正则表达式封装在parens中，但随后出现了一个错误，即正则表达式需要一个字符串，但找到了一个元组。我试着在正则表达式中用\b来表示单词边界，就像这样

       saveTEXT = '\n' + ''.join(re.findall(r"\bCORESTART B\b|\bTAGSTART B\b(.+?)\bCOREEND B\b|\bTAGEND B\b", stuff, re.DOTALL|re.MULTILINE)) + '\n'

但这也是空的。当我试着用这根未加工的绳子时，我的头脑完全崩溃了：

[\bCORESTART\b|\bTAGSTART B\b](.+?)[\bCOREEND\b|\bTAGEND B\b]

我可以对我忽略的东西有一些指导吗？我的脑子坏了。

如果你允许一些小偏差（比如“CORESTART”后面可能跟一个空格+“B”，你不想从比赛中得到），这是正确的方法。也就是说，我建议将

（？：B）？

添加到

（TAG | CORE）START B（+？）（\1END B）

regex:

(TAG|CORE)START(?: B)?(.+?)(\1END(?: B)?)

见

您还必须使用

re.finditer

提取字符串，因为

re.findall

将提取所有Capture组值

请注意，

re.MULTILINE

在正则表达式中是多余的，因为此标志重新定义了开始匹配行的开始和结束而不是整个字符串的

和

锚的行为。因此，我把它从正则表达式声明中删除了。

这个方法（

r”\b（？：CORESTART b | TAGSTART b）\b（+？）\b（？：COREEND b | TAGEND b）\b“

）是不安全的。您可以在

CORESTART B

和

TAGEND B

之间获得文本。如果允许一些小偏差（例如“CORESTART”后面可能跟一个空格+“B”），我认为BobbleBobble的正则表达式是正确的方法，您不希望从匹配中获得它。您还必须使用

re.finditer

提取字符串。哇！它起作用了。是的，我明白你的担心。此外，如果文档中有一个标记丢失或拼写错误，正则表达式将完全跳过该文档。最后，有趣的是，它以相反的顺序附加保存的文本-首先保存文档B的文本，然后保存文档A的文本。如果你把你的评论作为答案发表，我会给你评分。谁是BobbleBobble？是的，没有尝试太多，工作：）你的会做，但也符合

TAGSTART B

TAGEND C

@BobbleBobble:无论如何，你的想法是正确的，我只是先把重点放在分组上，你开始用眼睛分析模式，并且做得非常好。这一定是我在这方面有过的最积极和满意的经历之一。谢谢你们两位。这告诉你我是多么的挣扎。但它也告诉你，清晰、简洁的答案是宝贵的。能告诉我投票给谁吗？请不要说特朗普（希特勒？）我不是美国公民，请投票给你认为有价值的人。

(TAG|CORE)START(?: B)?(.+?)(\1END(?: B)?)

import re
p = re.compile(r'(TAG|CORE)START(?: B)?(.+?)(\1END(?: B)?)', re.DOTALL)
test_str = "intro intro intro intro intro intro\nBEGIN A This is where some text starts\nThat is not interesting or wanted\nCORESTART save text A save text A save text A save text A \nsave text A save text A save text A save text A save text A \nsave text A COREEND\nThis is an addendum that is not needed but is just in the way\nEND A outro outro outro outro outro outro \noutro outro outro outro outro outro outro \n.\n\nintro intro intro intro intro intro\nINIT B This is where some text starts\nThat is not interesting or wanted\nTAGSTART B save text B save text B save text B save text B \nsave text B save text B save text B save text B save text B \nsave text B TAGEND B\nThis is an addendum that is not needed but is just in the way\nTERM B outro outro outro outro outro outro \noutro outro outro outro outro outro outro "
print([x.group(2) for x in p.finditer(test_str)])