使用Python拆分数据块时出错
我有一个解析文件,需要根据使用Python拆分数据块时出错,python,Python,我有一个解析文件,需要根据LogType拆分数据。下面是我的数据: =================================================================================== LogType:container-localizer-syslog Log Upload Time :Thu Jun 25 12:24:45 +0100 2020 LogLength:0 Log Contents: LogType:stderr Log Upl
LogType
拆分数据。下面是我的数据:
===================================================================================
LogType:container-localizer-syslog
Log Upload Time :Thu Jun 25 12:24:45 +0100 2020
LogLength:0
Log Contents:
LogType:stderr
Log Upload Time :Thu Jun 25 12:24:52 +0100 2020
LogLength:3000
Log Contents:
20/06/25 12:19:33 INFO datasources.FileScanRDD
20/06/25 12:19:40 INFO executor.EXECUTOR: Finished task 18.0 in stage 0.0 (TID 18),18994 bytes result sent to driver.
20/06/21 12:19:40 INFO eas
20/06/25 12:20:41 WARN Warning as the node is accessed without started
===================================================================================
LogType:container-localizer-syslog
Log Upload Time :Thu Jun 25 12:24:45 +0100 2020
LogLength:0
我应用了一个代码,该代码在分割数据时会导致一些错误。下面是我应用的代码:
def parse_container(text,full_text_lines,filter_log_types=None,filter_content_types=None):
results={}
first, rest = text.split('\n', 1)
#print(rest) #rest is the block of data mentioned above
results['id'] = first
all_log_types = re.compile('^(?=LogType:)',flags=re.MULTILINE).split(rest)
print(all_log_types)
我得到的结果是:
['========================================================================\nLogType:container-
localizer-syslog\nLog Upload Time :Thu Jun 25 12:24:45 +0100 2020\nLogLength:0\nLog Contents:\n\n
LogType:stderr\nLog Upload Time :Thu Jun 25 12:24:52 +0100 2020\nLogLength:3000\nLog Contents:\n20/06/25 12:19:33 INFO datasources.FileScanRDD \n20/06/25 12:19:40 INFO executor.EXECUTOR: Finished task 18.0 in stage 0.0 (TID 18),18994 bytes result sent to driver.\n
20/06/21 12:19:40 INFO eas\n20/06/25 12:20:41 WARN Warning as the node is accessed without started\n \n']
['========================================================================\nLogType:container-
localizer-syslog\nLog Upload Time :Thu Jun 25 12:24:45 +0100 2020\nLogLength:0\nLog Contents:\n\n']
我需要的输出:
['========================================================================\n','LogType:contain
er-localizer-syslog\nLog Upload Time :Thu Jun 25 12:24:45 +0100 2020\nLogLength:0\nLog Contents:\n\n',
'LogType:stderr\nLog Upload Time :Thu Jun 25 12:24:52 +0100 2020\nLogLength:3000\nLog Contents:\n20/06/25 12:19:33 INFO datasources.FileScanRDD \n20/06/25 12:19:40 INFO executor.EXECUTOR: Finished task 18.0 in stage 0.0 (TID 18),18994 bytes result sent to driver.\n20/06/21 12:19:40 INFO eas\n20/06/25 12:20:41 WARN Warning as the node is accessed without started\n \n']
['========================================================================\n','LogType:contain
er-localizer-syslog\nLog Upload Time :Thu Jun 25 12:24:45 +0100 2020\nLogLength:0\nLog Contents:\n\n']
在我的输出中,您可以看到我在日志类型的开头得到了\n
,但我需要根据日志类型按逗号进行拆分
在预期的输出中,您可以看到数据已根据日志类型被,
我正在使用Python 2.6.6。请帮我解决这个问题。非常感谢 如果一个文件中有多个日志,请尝试以下操作:
import re
results={}
logs = re.split('^=', text, 0, re.MULTILINE)
for log in logs:
if (len(log) > 0):
first, rest = log.split('=\n')
print('first', first)
print('rest',rest)
print("\n\n")
输出:
first =================================================================================
rest LogType:container-localizer-syslog
Log Upload Time :Thu Jun 25 12:24:45 +0100 2020
LogLength:0
Log Contents:
LogType:stderr
Log Upload Time :Thu Jun 25 12:24:52 +0100 2020
LogLength:3000
Log Contents:
20/06/25 12:19:33 INFO datasources.FileScanRDD
20/06/25 12:19:40 INFO executor.EXECUTOR: Finished task 18.0 in stage 0.0 (TID 18),18994 bytes result sent to driver.
20/06/21 12:19:40 INFO eas
20/06/25 12:20:41 WARN Warning as the node is accessed without started
first =================================================================================
rest LogType:container-localizer-syslog
Log Upload Time :Thu Jun 25 12:24:45 +0100 2020
LogLength:0
你可以根据你的问题使用这个
text=text.replace('=','')
all_log_types=text.split('\n\n') # splitting based on an Empty line
print(all_log_types)
我们可以在python
中使用正则表达式轻松地拆分日志。以下代码按两种情况中的或
拆分日志
条件1:多次出现=
,后跟\n
条件2:2次出现\n
如果满足任何条件,我们就得到输出filter
将删除split
返回的所有空字符串,并返回对象。然后将该对象
转换为列表
import re
text = """===================================================================================
LogType:container-localizer-syslog
Log Upload Time :Thu Jun 25 12:24:45 +0100 2020
LogLength:0
Log Contents:
LogType:stderr
Log Upload Time :Thu Jun 25 12:24:52 +0100 2020
LogLength:3000
Log Contents:
20/06/25 12:19:33 INFO datasources.FileScanRDD
20/06/25 12:19:40 INFO executor.EXECUTOR: Finished task 18.0 in stage 0.0 (TID 18),18994 bytes result sent to driver.
20/06/21 12:19:40 INFO eas
20/06/25 12:20:41 WARN Warning as the node is accessed without started
===================================================================================
LogType:container-localizer-syslog
Log Upload Time :Thu Jun 25 12:24:45 +0100 2020
LogLength:0
"""
output = list(filter(None, re.compile('[=]+.\n|\n\n').split(text)))
print(output)
输出:
['LogType:container-localizer-syslog\nLog Upload Time :Thu Jun 25 12:24:45 +0100 2020\nLogLength:0\nLog Contents:', 'LogType:stderr\nLog Upload Time :Thu Jun 25 12:24:52 +0100 2020\nLogLength:3000\nLog Contents:\n20/06/25 12:19:33 INFO datasources.FileScanRDD\n20/06/25 12:19:40 INFO executor.EXECUTOR: Finished task 18.0 in stage 0.0 (TID 18),18994 bytes result sent to driver.\n20/06/21 12:19:40 INFO eas\n20/06/25 12:20:41 WARN Warning as the node is accessed without started', 'LogType:container-localizer-syslog\nLog Upload Time :Thu Jun 25 12:24:45 +0100 2020\nLogLength:0\n']
谢谢你,唐纳德!但是我仍然得到了相同的输出。这有什么问题吗:all_log_types=re.compile(“^(?=LogType:)”,flags=re.MULTILINE)。split(rest)@Lekshmi,第二行看起来也正确。我试过了,得到了你想要的结果。我在我的回答中添加了一些东西,希望你能帮助缩小问题的范围。正如你所说,我认为我的文本源是一个问题,我想是的。有什么办法可以解决这个问题吗?因为我必须对给定的文件执行此操作。为了更好地理解这个问题,我增加了一些内容。或者你能建议一些方法来解决这个错误吗?我在上面的示例中添加了另一个字符串来匹配,“=$”,它应该在第一行末尾找到最后一个“=”,然后拆分,看看这是否有帮助。非常感谢Syed。我得到了部分输出。但是我有第一行“=========================”,它仍然连接到我的第一个日志类型。如何删除?如果不需要“==”,则可以使用此文本=文本。替换(“=”,“”)文本=文本。替换(“=”,“”)#删除“=”所有日志类型=文本。拆分('\n\n')打印(所有日志类型)当我只有一个块时,它会给出正确的输出,但如果有许多块,如我在问题中提到的,它就会失败。我在另一个领域遇到了类似的问题。你能帮我解决这个问题吗?这是链接