使用Python拆分数据并将排序后的数据分配给excel视图的列
嗨,我在一个文本文件中有一组如下所示的数据(虚拟数据替换学校数据)使用Python拆分数据并将排序后的数据分配给excel视图的列,python,list,text,data-processing,Python,List,Text,Data Processing,嗨,我在一个文本文件中有一组如下所示的数据(虚拟数据替换学校数据) 01-01-1998 00:00:00 AM GP: D(B):1234 to time difference. Hourly Avg:-3 secs 01-01-1998 00:00:12 AM GP: D(A): 2345 to time difference. Hourly Avg:0 secs 01-01-1998 00:08:08 AM SYS: The Screen Is now minimised. 01-0
01-01-1998 00:00:00 AM GP: D(B):1234 to time difference. Hourly Avg:-3 secs
01-01-1998 00:00:12 AM GP: D(A): 2345 to time difference. Hourly Avg:0 secs
01-01-1998 00:08:08 AM SYS: The Screen Is now minimised.
01-01-1998 00:09:10 AM 00:09:10 AM SC: Findcorrect: W. D:1. Count one two three four five. #there are somehow some glitch in the system showing 2 timestamp
01-01-1998 00:14:14 AM SC: D1 test. Old:111, New:222, Calculated was 123, out of 120 secs.
01-01-1998 01:06:24 AM ET: Program Disconnected event.
我想整理数据,如下所示,以
[['Timestamp','System','Di','Message'] # <-- header
['01-01-1998 00:00:00 AM', 'GP:','D(B):','1234 to time difference. Hourly Avg:-3 secs'],
['01-01-1998 00:00:12 AM', 'GP:','D(A):', '2345 to time difference. Hourly Avg:0 secs'],
['01-01-1998 00:08:08 AM', 'SYS:','','The Screen Is now minimised.'], #<-- with a blank
['01-01-1998 00:09:10 AM', 'SC:','','Findcorrect: HW. D:1. Count one two three four five.'],
['01-01-1998 00:14:14 AM', 'SC:','D1','test. Old:111, New:222, Calculated was 123, out of 120 secs.' ],
['01-01-1998 01:06:24 AM', 'ET:','', 'Program Disconnected event.']]
由于缺乏python方面的知识,代码尚未完全开发,我将非常感谢任何指导或示例!
需要思考的问题是,我是否使用pandas/dataframe?或者我可以不用警局就这么做
编辑:第一行数据更新为“D(B)1234”,数字和D(B)之间不应有任何空格。清除此混乱数据的代码部分使用正则表达式,部分使用字符串插值
由于需要在文本中屏蔽内部的,
(例如,在第行中,旧的:111,新的:222,),已清理csv的写入使用模块:
创建演示文件:
with open("data.txt","w") as w:
w.write("""01-01-1998 00:00:00 AM GP: D(B): 1234 to time difference. Hourly Avg:-3 secs
01-01-1998 00:00:12 AM GP: D(A): 2345 to time difference. Hourly Avg:0 secs
01-01-1998 00:08:08 AM SYS: The Screen Is now minimised.
01-01-1998 00:09:10 AM 00:09:10 AM SC: Findcorrect: W. D:1. Count one two three four five. #there are somehow some glitch in the system showing 2 timestamp
01-01-1998 00:14:14 AM SC: D1 test. Old:111, New:222, Calculated was 123, out of 120 secs.
01-01-1998 01:06:24 AM ET: Program Disconnected event.""")
解析并编写它:
import re
def parseLine(line):
# get the timestamp
ts = re.match(r"\d{2}-\d{2}-\d{4} \d{2}:\d{2}:\d{2} +(?:AM|PM)",line)
# get all but the timestamp - cleaning the double-time issue
cleaned = re.sub(r"^\d{2}-\d{2}-\d{4} (\d{2}:\d{2}:\d{2} (AM|PM) +)+","", line)
# split cleaned part based on occurence of ["D(A)", "D(B)", "D1", "D2"]
if any(k in cleaned.split(":")[1] for k in ["D(A)", "D(B)", "D1", "D2"]):
system, di, msg = cleaned.split(" ", maxsplit = 2)
else:
di = ""
system, msg = cleaned.split(":", maxsplit = 1)
# return each line as list of cleaned stuff:
return [ts[0].strip() ,system.strip(), di.strip(), msg.strip()]
# fixed header, lines will be appended
p = [['Timestamp','System','Di','Message']]
with open("data.txt","r") as r:
for l in r:
l = l.strip()
p.append(parseLine(l))
import csv
with open("c.csv","w",newline="") as w:
writer = csv.writer(w,quoting=csv.QUOTE_ALL)
writer.writerows(p)
读取并输出写入的文件:
with open("c.csv") as r:
print(r.read())
文件内容(屏蔽csv)否则st.旧:111,新:222,计算为123,
将损坏您的格式:
"Timestamp","System","Di","Message"
"01-01-1998 00:00:00 AM","GP:","D(B):","1234 to time difference. Hourly Avg:-3 secs"
"01-01-1998 00:00:12 AM","GP:","D(A):","2345 to time difference. Hourly Avg:0 secs"
"01-01-1998 00:08:08 AM","SYS","","The Screen Is now minimised."
"01-01-1998 00:09:10 AM","SC","","Findcorrect: W. D:1. Count one two three four five. #there are somehow some glitch in the system showing 2 timestamp"
"01-01-1998 00:14:14 AM","SC:","D1","test. Old:111, New:222, Calculated was 123, out of 120 secs."
"01-01-1998 01:06:24 AM","ET","","Program Disconnected event."
你好谢谢你的指导。我只是想和你核对一些东西,因为上面显示的数据只是我测试数据的一部分,当我用我的实际数据进行测试时,它会导致“非类型”对象不可订阅的错误。我正在进行一些故障排除,并意识到当我的时间是小时部分的一位数时会发生错误。我已尝试将检查数字的参数更改为“\d{2,}”,但仍然不起作用。请提供建议,谢谢!当我在测试数据中发现错误时,已更新问题!@Thanksforelping似乎您的白天是我的睡眠时间。您可以更改
\d{1,2}
使其最多接受1到2个数字-\d{2,}
接受2个或更多的数字。我用来测试正则表达式-它甚至可以用“普通”文本向您解释它们-数据中的错误不应该影响这一点,因为该部分在:
-空格与否应该无关紧要
"Timestamp","System","Di","Message"
"01-01-1998 00:00:00 AM","GP:","D(B):","1234 to time difference. Hourly Avg:-3 secs"
"01-01-1998 00:00:12 AM","GP:","D(A):","2345 to time difference. Hourly Avg:0 secs"
"01-01-1998 00:08:08 AM","SYS","","The Screen Is now minimised."
"01-01-1998 00:09:10 AM","SC","","Findcorrect: W. D:1. Count one two three four five. #there are somehow some glitch in the system showing 2 timestamp"
"01-01-1998 00:14:14 AM","SC:","D1","test. Old:111, New:222, Calculated was 123, out of 120 secs."
"01-01-1998 01:06:24 AM","ET","","Program Disconnected event."