Python 如何从一个匹配到下一个相同模式的匹配中获取文本

Python 如何从一个匹配到下一个相同模式的匹配中获取文本,python,regex,python-2.7,Python,Regex,Python 2.7,如何从一个匹配中获取文本,直到下一个相同模式的匹配 我有这样一个日志文件: INFO1: BLAH INFO2: BLAH SOMETHING RELATED TO THE INFO1 AND INFO2 SOMETHING DIFFERENT RELATED TO THE INFO1 AND INFO2 SOMETHING OTHER WAY RELATED TO THE INFO1 AND INFO2 SOMETHING ALSO RELATED TO THE INFO1 AND INFO2

如何从一个匹配中获取文本,直到下一个相同模式的匹配

我有这样一个日志文件:

INFO1: BLAH
INFO2: BLAH
SOMETHING RELATED TO THE INFO1 AND INFO2
SOMETHING DIFFERENT RELATED TO THE INFO1 AND INFO2
SOMETHING OTHER WAY RELATED TO THE INFO1 AND INFO2
SOMETHING ALSO RELATED TO THE INFO1 AND INFO2
INFO1: BLAH
INFO2: BLAH
SOMETHING RELATED TO THE INFO1 AND INFO2
SOMETHING DIFFERENT RELATED TO THE INFO1 AND INFO2
SOMETHING OTHER WAY RELATED TO THE INFO1 AND INFO2
SOMETHING ALSO RELATED TO THE INFO1 AND INFO2
INFO1: BLAH
INFO2: BLAH
SOMETHING RELATED TO THE INFO1 AND INFO2
SOMETHING DIFFERENT RELATED TO THE INFO1 AND INFO2
SOMETHING OTHER WAY RELATED TO THE INFO1 AND INFO2
SOMETHING ALSO RELATED TO THE INFO1 AND INFO2
INFO1: BLAH
INFO2: BLAH
SOMETHING RELATED TO THE INFO1 AND INFO2
SOMETHING DIFFERENT RELATED TO THE INFO1 AND INFO2
SOMETHING OTHER WAY RELATED TO THE INFO1 AND INFO2
SOMETHING ALSO RELATED TO THE INFO1 AND INFO2
separator = "INFO1: BLAH\nINFO2: BLAH"
result = ''.join(string.split(separator)[1]) 
print('{0}\n{1}'.format(separator, result)
我能找到前两行,但在下一场比赛之前,我无法找到其他行。 所以我得到的只是: 信息1:废话 信息2:废话

但我希望课外小组像这样:

INFO1: BLAH
INFO2: BLAH
SOMETHING RELATED TO THE INFO1 AND INFO2
SOMETHING DIFFERENT RELATED TO THE INFO1 AND INFO2
SOMETHING OTHER WAY RELATED TO THE INFO1 AND INFO2
SOMETHING ALSO RELATED TO THE INFO1 AND INFO2
INFO1: BLAH
INFO2: BLAH
SOMETHING RELATED TO THE INFO1 AND INFO2
SOMETHING DIFFERENT RELATED TO THE INFO1 AND INFO2
SOMETHING OTHER WAY RELATED TO THE INFO1 AND INFO2
SOMETHING ALSO RELATED TO THE INFO1 AND INFO2
INFO1: BLAH
INFO2: BLAH
SOMETHING RELATED TO THE INFO1 AND INFO2
SOMETHING DIFFERENT RELATED TO THE INFO1 AND INFO2
SOMETHING OTHER WAY RELATED TO THE INFO1 AND INFO2
SOMETHING ALSO RELATED TO THE INFO1 AND INFO2
INFO1: BLAH
INFO2: BLAH
SOMETHING RELATED TO THE INFO1 AND INFO2
SOMETHING DIFFERENT RELATED TO THE INFO1 AND INFO2
SOMETHING OTHER WAY RELATED TO THE INFO1 AND INFO2
SOMETHING ALSO RELATED TO THE INFO1 AND INFO2
separator = "INFO1: BLAH\nINFO2: BLAH"
result = ''.join(string.split(separator)[1]) 
print('{0}\n{1}'.format(separator, result)
我试过这个:

start_exec_ptrn = r'INFO1: .+\nINFO2: .+'
last_exec_start = last_exec_end = 0
for m in re.finditer(start_exec_ptrn, log_content):
    start_exec = m.start()
    end_exec = m.end()
    print start_exec, '-', end_exec
    print log_content[last_exec_end:end_exec]
    last_exec_start = start_exec
    last_exec_end = end_exec
    print 150 * '*'
提前谢谢,我的英语很抱歉

这里:

>>> import re
>>> separator = "INFO1: BLAH\nINFO2: BLAH\n"
>>> map(lambda(p): "%s%s" % (separator, p), re.split(r'%s.*' % separator, all_text)[1:])
这将返回您正在查找的内容:

['INFO1: BLAH\nINFO2: BLAH\nSOMETHING RELATED TO THE INFO1 AND INFO2\nSOMETHING DIFFERENT
 RELATED TO THE INFO1 AND INFO2\nSOMETHING OTHER WAY RELATED TO THE INFO1 AND INFO2\nSOME
THING ALSO RELATED TO THE INFO1 AND INFO2\n', 'INFO1: BLAH\nINFO2: BLAH\nSOMETHING RELATE
D TO THE INFO1 AND INFO2\nSOMETHING DIFFERENT RELATED TO THE INFO1 AND INFO2\nSOMETHING O
THER WAY RELATED TO THE INFO1 AND INFO2\nSOMETHING ALSO RELATED TO THE INFO1 AND INFO2\n'
, 'INFO1: BLAH\nINFO2: BLAH\nSOMETHING RELATED TO THE INFO1 AND INFO2\nSOMETHING DIFFEREN
T RELATED TO THE INFO1 AND INFO2\nSOMETHING OTHER WAY RELATED TO THE INFO1 AND INFO2\nSOM
ETHING ALSO RELATED TO THE INFO1 AND INFO2\n']
您应该检查findall()调用中的字符串

  ## Suppose we have a text with many email addresses
  str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'

  ## Here re.findall() returns a list of all the found email strings
  emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) ## ['alice@google.com', 'bob@abc.com']
  for email in emails:
    # do something with each found email string
    print email

尝试使用前瞻和重新拆分

或者使用
re.findall
和标志
re.DOTALL

(INFO1:).*?(?=\1|$)

你可以不用正则表达式

with open('file.log') as f:
    data = f.readlines()

matches, headers, sec = [], [], []
for i, line in enumerate(data):
    if not line:
        continue
    line_lower = line.lower()
    if line_lower.startswith('info'):
        if not data[i - 1].lower().startswith('info'):
            if headers and sec:
                matches.append({'headers': headers, 'matches': sec})
            headers, sec = [], []
        head = line_lower.split(':')[0]
        headers.append(head)
        continue
    if any(x in line_lower for x in headers):
        sec.append(line)
print matches
#[{'headers': ['info1', 'info2'], 'matches': ['SOMETHING RELATED TO THE INFO1 AND INFO2', 'SOMETHING DIFFERENT RELATED TO THE INFO1 AND INFO2', 'SOMETHING OTHER WAY RELATED TO THE INFO1 AND INFO2', 'SOMETHING ALSO RELATED TO THE INFO1 AND INFO2']}, {'headers': ['info1', 'info2'], 'matches': ['SOMETHING RELATED TO THE INFO1 AND INFO2', 'SOMETHING DIFFERENT RELATED TO THE INFO1 AND INFO2', 'SOMETHING OTHER WAY RELATED TO THE INFO1 AND INFO2', 'SOMETHING ALSO RELATED TO THE INFO1 AND INFO2']}]

要检索包含INFO1或INFO2的所有行,正则表达式模式应为:

^.*\b(INFO1|INFO2)\b.*$

霍普帮了你

使用
split()
怎么样? 假设您将文本分配给
字符串
,您可以这样做:

INFO1: BLAH
INFO2: BLAH
SOMETHING RELATED TO THE INFO1 AND INFO2
SOMETHING DIFFERENT RELATED TO THE INFO1 AND INFO2
SOMETHING OTHER WAY RELATED TO THE INFO1 AND INFO2
SOMETHING ALSO RELATED TO THE INFO1 AND INFO2
INFO1: BLAH
INFO2: BLAH
SOMETHING RELATED TO THE INFO1 AND INFO2
SOMETHING DIFFERENT RELATED TO THE INFO1 AND INFO2
SOMETHING OTHER WAY RELATED TO THE INFO1 AND INFO2
SOMETHING ALSO RELATED TO THE INFO1 AND INFO2
INFO1: BLAH
INFO2: BLAH
SOMETHING RELATED TO THE INFO1 AND INFO2
SOMETHING DIFFERENT RELATED TO THE INFO1 AND INFO2
SOMETHING OTHER WAY RELATED TO THE INFO1 AND INFO2
SOMETHING ALSO RELATED TO THE INFO1 AND INFO2
INFO1: BLAH
INFO2: BLAH
SOMETHING RELATED TO THE INFO1 AND INFO2
SOMETHING DIFFERENT RELATED TO THE INFO1 AND INFO2
SOMETHING OTHER WAY RELATED TO THE INFO1 AND INFO2
SOMETHING ALSO RELATED TO THE INFO1 AND INFO2
separator = "INFO1: BLAH\nINFO2: BLAH"
result = ''.join(string.split(separator)[1]) 
print('{0}\n{1}'.format(separator, result)

如果部分始终以
INFO
开头,则可以使用groupby:

from itertools import groupby

with open("in.txt") as f:
    grps = groupby(f, key=lambda x: x.startswith(("INFO1:","INFO2:")))
    for k,v in grps:
        if k:
            print("".join((v)) + "".join((next(grps,["",""])[1])))
输出:

INFO1: BLAH
INFO2: BLAH
SOMETHING RELATED TO THE INFO1 AND INFO2
SOMETHING DIFFERENT RELATED TO THE INFO1 AND INFO2
SOMETHING OTHER WAY RELATED TO THE INFO1 AND INFO2
SOMETHING ALSO RELATED TO THE INFO1 AND INFO2

INFO1: BLAH
INFO2: BLAH
SOMETHING RELATED TO THE INFO1 AND INFO2
SOMETHING DIFFERENT RELATED TO THE INFO1 AND INFO2
SOMETHING OTHER WAY RELATED TO THE INFO1 AND INFO2
SOMETHING ALSO RELATED TO THE INFO1 AND INFO2

INFO1: BLAH
INFO2: BLAH
SOMETHING RELATED TO THE INFO1 AND INFO2
SOMETHING DIFFERENT RELATED TO THE INFO1 AND INFO2
SOMETHING OTHER WAY RELATED TO THE INFO1 AND INFO2
SOMETHING ALSO RELATED TO THE INFO1 AND INFO2

因此,您想要基于
INFO1:
INFO2:
的部分?您想要打印包含INFO1或INFO2的所有行?
re.split(r'(INFO1:.\nINFO2:...\n'),stuff)
?正是@padraickenningham。我想提取整个部分,把它传递给一个能够使用它的函数。因此,我们可以考虑IfO2:作为定界符,或者它可以不出现在前面的IfOF1?thAK。但是我希望所有的行都是INFO1:BLAH,直到INFO1:BLAH之前的最后一个字符。所以你想把你的日志分为:关于INFO1的所有行和关于INFO2的所有行,对吗?