使用Python拆分数据块时出错_Python

使用Python拆分数据块时出错

python

使用Python拆分数据块时出错,python,Python,我有一个解析文件，需要根据LogType拆分数据。下面是我的数据： =================================================================================== LogType:container-localizer-syslog Log Upload Time :Thu Jun 25 12:24:45 +0100 2020 LogLength:0 Log Contents: LogType:stderr Log Upl

我有一个解析文件，需要根据

LogType

拆分数据。下面是我的数据：

===================================================================================
LogType:container-localizer-syslog
Log Upload Time :Thu Jun 25 12:24:45 +0100 2020
LogLength:0
Log Contents:

LogType:stderr
Log Upload Time :Thu Jun 25 12:24:52 +0100 2020
LogLength:3000
Log Contents:
20/06/25 12:19:33 INFO datasources.FileScanRDD
20/06/25 12:19:40 INFO executor.EXECUTOR: Finished task 18.0 in stage 0.0 (TID 18),18994 bytes result sent to driver.
20/06/21 12:19:40 INFO eas
20/06/25 12:20:41 WARN Warning as the node is accessed without started

===================================================================================
LogType:container-localizer-syslog
Log Upload Time :Thu Jun 25 12:24:45 +0100 2020
LogLength:0

我应用了一个代码，该代码在分割数据时会导致一些错误。下面是我应用的代码：

def parse_container(text,full_text_lines,filter_log_types=None,filter_content_types=None):
    results={}

    first, rest  = text.split('\n', 1)
   #print(rest)      #rest is the block of data mentioned above
    results['id'] = first
    all_log_types = re.compile('^(?=LogType:)',flags=re.MULTILINE).split(rest)
    print(all_log_types)

我得到的结果是：

['========================================================================\nLogType:container-
localizer-syslog\nLog Upload Time :Thu Jun 25 12:24:45 +0100 2020\nLogLength:0\nLog Contents:\n\n
LogType:stderr\nLog Upload Time :Thu Jun 25 12:24:52 +0100 2020\nLogLength:3000\nLog Contents:\n20/06/25 12:19:33 INFO datasources.FileScanRDD \n20/06/25 12:19:40 INFO executor.EXECUTOR: Finished task 18.0 in stage 0.0 (TID 18),18994 bytes result sent to driver.\n
20/06/21 12:19:40 INFO eas\n20/06/25 12:20:41 WARN Warning as the node is accessed without started\n \n']
['========================================================================\nLogType:container-
localizer-syslog\nLog Upload Time :Thu Jun 25 12:24:45 +0100 2020\nLogLength:0\nLog Contents:\n\n']

我需要的输出：

['========================================================================\n','LogType:contain
er-localizer-syslog\nLog Upload Time :Thu Jun 25 12:24:45 +0100 2020\nLogLength:0\nLog Contents:\n\n', 
 'LogType:stderr\nLog Upload Time :Thu Jun 25 12:24:52 +0100 2020\nLogLength:3000\nLog Contents:\n20/06/25 12:19:33 INFO datasources.FileScanRDD \n20/06/25 12:19:40 INFO executor.EXECUTOR: Finished task 18.0 in stage 0.0 (TID 18),18994 bytes result sent to driver.\n20/06/21 12:19:40 INFO eas\n20/06/25 12:20:41 WARN Warning as the node is accessed without started\n \n']

['========================================================================\n','LogType:contain
    er-localizer-syslog\nLog Upload Time :Thu Jun 25 12:24:45 +0100 2020\nLogLength:0\nLog Contents:\n\n']

在我的输出中，您可以看到我在日志类型的开头得到了

\n

，但我需要根据日志类型按

逗号进行拆分
在预期的输出中，您可以看到数据已根据日志类型被，

我正在使用Python 2.6.6。请帮我解决这个问题。非常感谢
 如果一个文件中有多个日志，请尝试以下操作：
import re

results={}
logs = re.split('^=', text, 0, re.MULTILINE)

for log in logs:
    if (len(log) > 0):
        first, rest = log.split('=\n')
        print('first', first)
        print('rest',rest)
        print("\n\n")

输出：
first =================================================================================
rest LogType:container-localizer-syslog
Log Upload Time :Thu Jun 25 12:24:45 +0100 2020
LogLength:0
Log Contents:

LogType:stderr
Log Upload Time :Thu Jun 25 12:24:52 +0100 2020
LogLength:3000
Log Contents:
20/06/25 12:19:33 INFO datasources.FileScanRDD
20/06/25 12:19:40 INFO executor.EXECUTOR: Finished task 18.0 in stage 0.0 (TID 18),18994 bytes result sent to driver.
20/06/21 12:19:40 INFO eas
20/06/25 12:20:41 WARN Warning as the node is accessed without started



first =================================================================================
rest LogType:container-localizer-syslog
Log Upload Time :Thu Jun 25 12:24:45 +0100 2020
LogLength:0

你可以根据你的问题使用这个
text=text.replace('=','')
 all_log_types=text.split('\n\n') # splitting based on an Empty line
 print(all_log_types)

我们可以在python
中使用正则表达式轻松地拆分日志。以下代码按两种情况中的或
拆分日志
条件1:多次出现=
，后跟\n

条件2:2次出现\n

如果满足任何条件，我们就得到输出filter
将删除split
返回的所有空字符串，并返回对象。然后将该对象
转换为列表

import re

text = """===================================================================================
LogType:container-localizer-syslog
Log Upload Time :Thu Jun 25 12:24:45 +0100 2020
LogLength:0
Log Contents:

LogType:stderr
Log Upload Time :Thu Jun 25 12:24:52 +0100 2020
LogLength:3000
Log Contents:
20/06/25 12:19:33 INFO datasources.FileScanRDD
20/06/25 12:19:40 INFO executor.EXECUTOR: Finished task 18.0 in stage 0.0 (TID 18),18994 bytes result sent to driver.
20/06/21 12:19:40 INFO eas
20/06/25 12:20:41 WARN Warning as the node is accessed without started

===================================================================================
LogType:container-localizer-syslog
Log Upload Time :Thu Jun 25 12:24:45 +0100 2020
LogLength:0
"""


output = list(filter(None, re.compile('[=]+.\n|\n\n').split(text)))

print(output)

输出：
['LogType:container-localizer-syslog\nLog Upload Time :Thu Jun 25 12:24:45 +0100 2020\nLogLength:0\nLog Contents:', 'LogType:stderr\nLog Upload Time :Thu Jun 25 12:24:52 +0100 2020\nLogLength:3000\nLog Contents:\n20/06/25 12:19:33 INFO datasources.FileScanRDD\n20/06/25 12:19:40 INFO executor.EXECUTOR: Finished task 18.0 in stage 0.0 (TID 18),18994 bytes result sent to driver.\n20/06/21 12:19:40 INFO eas\n20/06/25 12:20:41 WARN Warning as the node is accessed without started', 'LogType:container-localizer-syslog\nLog Upload Time :Thu Jun 25 12:24:45 +0100 2020\nLogLength:0\n']

谢谢你，唐纳德！但是我仍然得到了相同的输出。这有什么问题吗：all_log_types=re.compile（“^（？=LogType:）”，flags=re.MULTILINE）。split（rest）@Lekshmi，第二行看起来也正确。我试过了，得到了你想要的结果。我在我的回答中添加了一些东西，希望你能帮助缩小问题的范围。正如你所说，我认为我的文本源是一个问题，我想是的。有什么办法可以解决这个问题吗？因为我必须对给定的文件执行此操作。为了更好地理解这个问题，我增加了一些内容。或者你能建议一些方法来解决这个错误吗？我在上面的示例中添加了另一个字符串来匹配，“=$”，它应该在第一行末尾找到最后一个“=”，然后拆分，看看这是否有帮助。非常感谢Syed。我得到了部分输出。但是我有第一行“=========================”，它仍然连接到我的第一个日志类型。如何删除？如果不需要“==”，则可以使用此文本=文本。替换（“=”，“”）文本=文本。替换（“=”，“”）#删除“=”所有日志类型=文本。拆分（'\n\n'）打印（所有日志类型）当我只有一个块时，它会给出正确的输出，但如果有许多块，如我在问题中提到的，它就会失败。我在另一个领域遇到了类似的问题。你能帮我解决这个问题吗？这是链接