Python 如何在匹配字符串(包括匹配字符串和行)之间提取文本

Python 如何在匹配字符串(包括匹配字符串和行)之间提取文本,python,regex,Python,Regex,我正在使用python来提取匹配字符串之间的特定字符串。这些字符串是从一个列表生成的,该列表也是由一个单独的python函数动态生成的。我正在处理的列表如下所示:- sample_list = ['line1 this line a first line', 'line1 this line is also considered as line one...', 'line1 this line is the first line', 'line2

我正在使用python来提取匹配字符串之间的特定字符串。这些字符串是从一个列表生成的,该列表也是由一个单独的python函数动态生成的。我正在处理的列表如下所示:-

sample_list = ['line1 this line a first line',
        'line1 this line is also considered as line one...',
        'line1 this line is the first line',
        'line2 this line is second line to be included in output',
        'line3 this should also be included in output',
        'line1 this contain other strings',
        'line1 this may contain other strings as well',
        'line2 this line is second line to be included in output',
        'line3 this should also be included in output',
        'line1 what the heck is it...'
        ]
我想要的输出与此类似:-

line1 this line is the first line
line2 this line is second line to be included in output
line3 this should also be included in output
line1 this may contain other strings as well
line2 this line is second line to be included in output
line3 this should also be included in output
如您所见,我想提取以line1开头,以line3结尾的文本/行(直到行尾)。最终输出包括匹配的单词(即第1行和第3行)

我尝试过的代码是:-

# Convert list to string first
list_to_str = '\n'.join(sample_list)
# Get desired output
print(re.findall('\nline1(.*?)\nline2(.*?)\nline3($)', list_to_str, re.DOTALL))
这就是我得到的输出():-

感谢您的帮助

Edit1:- 我做了一些工作,找到了最接近的解决方案:-

matches = (re.findall(r"^line1(.*)\nline2(.*)\nline3(.*)$", list_to_str, re.MULTILINE))

for match in matches:
    print('\n'.join(match))
它给了我这个输出:-

 this line is the first line
 this line is second line to be included in output
 this is the third and it should also be included in output
 this may contain other strings as well
 this line is second line to be included in output...
 this is the third should also be included in output

输出几乎正确,但不包括匹配文本

如果要查找没有重复项的第1、2和3行序列
就是这个

line1.*\s*(?!\s |行[13])line2.*\s*(?!\s |行[12])line3.*

解释

 line1 .* \s*             # line 1 plus newline(s)
 (?! \s | line [13] )     # Next cannot be line 1 or 3 (or whitespace)
 line2 .* \s*             # line 2 plus newline(s)
 (?! \s | line [12] )     # Next cannot be line 1 or 2 (or whitespace)
 line3 .*                 # line 3 

如果您想捕获行内容,只需在
(.*)

周围放置捕获组,这可能不是最清晰的方式(您可能希望使用正则表达式),但可以输出您想要的:

sample_list = ['line1 this line a first line',
        'line1 this line is also considered as line one...',
        'line1 this line is the first line',
        'line2 this line is second line to be included in output',
        'line3 this should also be included in output',
        'line1 this contain other strings',
        'line1 this may contain other strings as well',
        'line2 this line is second line to be included in output',
        'line3 this should also be included in output',
        'line1 what the heck is it...'
        ]
output = []
text = str
line1 = ""
line2 = ""
line3 = ""
prevStart = ""
for text in sample_list:
    if prevStart == "":
        if text.startswith("line1"):
            prevStart = "line1"
            line1 = text
    elif prevStart == "line1":
        if text.startswith("line2"):
            prevStart ="line2"
            line2 = text
        elif text.startswith("line1"):
            line1 = text
            prevStart = "line1"
        else:
            prevStart = ""
    elif prevStart == "line2":
        if text.startswith("line3"):
            prevStart = ""
            line3 = text
        else:
            prevStart = ""
    if line1 != "" and line2 != "" and line3 != "":
        output.append(line1)
        output.append(line2)
        output.append(line3)
        line1 = ""
        line2 = ""
        line3 = ""

for line in output:
    print line
此代码的输出为:

line1 this line is the first line
line2 this line is second line to be included in output
line3 this should also be included in output
line1 this may contain other strings as well
line2 this line is second line to be included in output
line3 this should also be included in output

您只需迭代列表,检查每个值
.startswith('line1')
,或
'line2'
,等等是否正确。但是您不能一次捕获“line1”、“line2”和“line3”。通过“匹配文本”,如果您是说findall()在输出数组中不包含组0,只需在整个正则表达式周围添加一个捕获组
示例
)(^line1(.*)\nline2(.*)\nline3(.*)
您的示例似乎不起作用。它匹配所有行并给出。我得到的最接近的一个是在原始帖子的编辑部分发布的。阅读最后一行
如果你想捕获行内容,只需将捕获组放在周围(.*)
对我来说,更重要的是显示断言而不显示捕获组混乱。你是正确的。我将你的正则表达式添加到OP中编辑过的代码中,现在可以使用了。非常感谢。
line1 this line is the first line
line2 this line is second line to be included in output
line3 this should also be included in output
line1 this may contain other strings as well
line2 this line is second line to be included in output
line3 this should also be included in output