Python 如何收集文件中关键字之间的所有数据行-开始+；以换行符结束_Python_Regex_Python 3.x_Parsing

Python 如何收集文件中关键字之间的所有数据行-开始+；以换行符结束

python regex python-3.x parsing

Python 如何收集文件中关键字之间的所有数据行-开始+；以换行符结束,python,regex,python-3.x,parsing,Python,Regex,Python 3.x,Parsing,我试图从非常大的日志文件中收集特定信息，但无法确定如何获得所需的行为作为参考，示例日志如下所示： start_capture = False for current_line in fileName: if 'keyword1' in current_line: start_capture = True if start_capture: new_list.append(current_line) if 'keyword2' in cur

我试图从非常大的日志文件中收集特定信息，但无法确定如何获得所需的行为

作为参考，示例日志如下所示：

start_capture = False
for current_line in fileName:
    if 'keyword1' in current_line:
        start_capture = True
    if start_capture:
        new_list.append(current_line)
    if 'keyword2' in current_line:
        return(new_list)

def takewhile_plus_next(predicate, xs):
for x in xs:
    if not predicate(x):
        break
    yield x
yield x
with lastdb as f:
    lines = map(str.rstrip, f)
    skipped = dropwhile(lambda line: 'Warning: fatal assert' not in line, lines)
    lines_to_keep = takewhile_plus_next(lambda line: 'uptime:' not in line, skipped)

我需要的是找到“关键字1”，抓取整行关键字1处于打开状态（返回时间戳）和所有后续行，直到（包括）整行关键字2处于打开状态（通过最后一位数据）

到目前为止，我已经尝试了一些事情。我无法通过re方法（findall、match、search等）获得令人满意的结果；我不知道如何在比赛前抓取数据（即使是向后看一眼），但更重要的是，我不知道如何在一个短语而不仅仅是一个字符处抓取数据

for match in re.findall('keyword1[keyword2]+|', showall.read()):

我也试过这样的方法：

start_capture = False
for current_line in fileName:
    if 'keyword1' in current_line:
        start_capture = True
    if start_capture:
        new_list.append(current_line)
    if 'keyword2' in current_line:
        return(new_list)

def takewhile_plus_next(predicate, xs):
for x in xs:
    if not predicate(x):
        break
    yield x
yield x
with lastdb as f:
    lines = map(str.rstrip, f)
    skipped = dropwhile(lambda line: 'Warning: fatal assert' not in line, lines)
    lines_to_keep = takewhile_plus_next(lambda line: 'uptime:' not in line, skipped)

不管我怎么做，它都返回一个空列表

最后，我尝试了这样的方法：

start_capture = False
for current_line in fileName:
    if 'keyword1' in current_line:
        start_capture = True
    if start_capture:
        new_list.append(current_line)
    if 'keyword2' in current_line:
        return(new_list)

def takewhile_plus_next(predicate, xs):
for x in xs:
    if not predicate(x):
        break
    yield x
yield x
with lastdb as f:
    lines = map(str.rstrip, f)
    skipped = dropwhile(lambda line: 'Warning: fatal assert' not in line, lines)
    lines_to_keep = takewhile_plus_next(lambda line: 'uptime:' not in line, skipped)

最后一个包含了从关键字1到EOF的所有内容，EOF包含了近100000行垃圾数据。

如果指定并使用lazy anythings，您可以使用regex。*？要匹配开始和结束：

import re

regex = r"\n.*?(keyword 1).*?(keyword 2).*?$"

test_str = ("garbage I don't need - garbage I don't need\n"
    "timestamp - date - server info - 'keyword 1' - data\n"
    "more data more data more data more data\n"
    "more data more data more data more data\n"
    "more data more data 'keyword 2' - last bit of data\n"
    "garbage I don't need - garbage I don't need")

matches = re.finditer(regex, test_str, re.DOTALL | re.MULTILINE)

for matchNum, match in enumerate(matches):
    matchNum = matchNum + 1

    print (match.group()) # your match is the whole group

输出：

timestamp - date - server info - 'keyword 1' - data 
more data more data more data more data
more data more data more data more data
more data more data 'keyword 2' - last bit of data

您可能需要

从中剥离（'\n'）

您可以在这里查看：-它还包含对模式的解释。简而言之：

\n        newline 
   .*?    as few as possible anythings
   (keyword 1)   literal text - the () are not needed only if you want the group
   .*?    as few as possible anythings
   (keyword 2)   literal text - again () are not needed 
   .*?    as few as possible anythings
$         end of line

为了清晰起见，我加入了（）项-您不需要评估组，而是要删除它们。

以下内容适用于任何大小的文件。它在3秒钟内从一个250米的日志文件中提取出近200万行。提取的部分位于文件的末尾

如果您的文件可能无法放入可用内存，我不建议使用

list

、正则表达式或其他内存技术

测试文本文件

startstop\u text

：

line 1 this should not appear in output
line 2 keyword1
line 3 appears in output
line 4 keyword2
line 5 this should not appear in output

代码：

其他的回答都不起作用，但我能用正则表达式解决这个问题

for match in re.findall(".*keyword1[\s\S]*?keyword2:[\s\S]*?keyword3.*", log_file.read()):

看，您试图检查行中是否包含

关键字1

，但您的数据包含

关键字1

。试试看。@WiktorStribiżew这不是我的文字代码，在我的实际代码中我有明确的匹配项sso，有什么问题吗？如果这些是regexp，请使用

If re.search（rx，line）

而不是

If行中的'keyword'。检查第二个示例中的缩进。@Nick这些只是相关代码的摘录。在我的真实代码中，新的_列表被初始化，并在代码块运行后保持为空。第二个缩进与文本格式有关。我没意识到它在复制/粘贴时被弄得一团糟。同样，在我的真实代码中它是正确的。很感谢您尝试给出反馈。抱歉-我把上次编辑弄糟了-我用文件名替换了test\u str matches=re.finditer（regex，filename，re.DOTALL | re.MULTILINE）文件“C:\Program Files\Python36\lib\re.py”，finditer return\u compile（模式，标志）。finditer（字符串）TypeError:应为字符串或字节，如object@Toenailsmcgee使用带有open（filename，“r”）的作为f:re.finditer（regex，f.read（），…flags…）应该可以做到这一点，除非您的文件太大，无法放入memory@Nick不起作用-试试看。它仍然会匹配，但从第一行开始，而不是从最近的\nHey@toenailsmggee开始，以上解决方案不是完整的代码。我的解决方案在Python 2.7和3.6中对大小文件都很有效，提取的行位于初始、最终或中间位置。如果有问题，请让我知道什么错误或什么错误的输出你得到。是否要查找keyword1
和keyword2
的多个实例并全部提取？如果是这样，我的解决方案就行不通了——但这不是你想要的。