使用Python拆分数据并将排序后的数据分配给excel视图的列_Python_List_Text_Data Processing

使用Python拆分数据并将排序后的数据分配给excel视图的列

python list text

使用Python拆分数据并将排序后的数据分配给excel视图的列,python,list,text,data-processing,Python,List,Text,Data Processing,嗨，我在一个文本文件中有一组如下所示的数据（虚拟数据替换学校数据） 01-01-1998 00:00:00 AM GP: D(B):1234 to time difference. Hourly Avg:-3 secs 01-01-1998 00:00:12 AM GP: D(A): 2345 to time difference. Hourly Avg:0 secs 01-01-1998 00:08:08 AM SYS: The Screen Is now minimised. 01-0

嗨，我在一个文本文件中有一组如下所示的数据（虚拟数据替换学校数据）

01-01-1998 00:00:00 AM  GP: D(B):1234 to time difference. Hourly Avg:-3 secs
01-01-1998 00:00:12 AM  GP: D(A): 2345 to time difference. Hourly Avg:0 secs
01-01-1998 00:08:08 AM  SYS: The Screen Is now minimised.
01-01-1998 00:09:10 AM  00:09:10 AM SC: Findcorrect: W. D:1. Count one two three four five.       #there are somehow some glitch in the system showing 2 timestamp
01-01-1998 00:14:14 AM  SC: D1 test. Old:111, New:222, Calculated was 123, out of 120 secs.    
01-01-1998 01:06:24 AM  ET: Program Disconnected event.

我想整理数据，如下所示，以

[['Timestamp','System','Di','Message']    #  <-- header
['01-01-1998 00:00:00 AM', 'GP:','D(B):','1234 to time difference. Hourly Avg:-3 secs'],
['01-01-1998 00:00:12 AM', 'GP:','D(A):', '2345 to time difference. Hourly Avg:0 secs'],
['01-01-1998 00:08:08 AM', 'SYS:','','The Screen Is now minimised.'],   #<-- with a blank
['01-01-1998 00:09:10 AM', 'SC:','','Findcorrect: HW. D:1. Count one two three four five.'],
['01-01-1998 00:14:14 AM', 'SC:','D1','test. Old:111, New:222, Calculated was 123, out of 120 secs.' ],
['01-01-1998 01:06:24 AM', 'ET:','', 'Program Disconnected event.']]

由于缺乏python方面的知识，代码尚未完全开发，我将非常感谢任何指导或示例！需要思考的问题是，我是否使用pandas/dataframe？或者我可以不用警局就这么做

编辑：第一行数据更新为“D（B）1234”，数字和D（B）之间不应有任何空格。

清除此混乱数据的代码部分使用正则表达式，部分使用字符串插值

由于需要在文本中屏蔽内部的

，

（例如，在第行中，旧的：111，新的：222，），已清理csv的写入使用模块：

创建演示文件：

with open("data.txt","w") as w:
    w.write("""01-01-1998 00:00:00 AM  GP: D(B): 1234 to time difference. Hourly Avg:-3 secs
01-01-1998 00:00:12 AM  GP: D(A): 2345 to time difference. Hourly Avg:0 secs
01-01-1998 00:08:08 AM  SYS: The Screen Is now minimised.
01-01-1998 00:09:10 AM  00:09:10 AM SC: Findcorrect: W. D:1. Count one two three four five.       #there are somehow some glitch in the system showing 2 timestamp
01-01-1998 00:14:14 AM  SC: D1 test. Old:111, New:222, Calculated was 123, out of 120 secs.    
01-01-1998 01:06:24 AM  ET: Program Disconnected event.""")

解析并编写它：

import re

def parseLine(line):
    # get the timestamp
    ts = re.match(r"\d{2}-\d{2}-\d{4} \d{2}:\d{2}:\d{2} +(?:AM|PM)",line)

    # get all but the timestamp - cleaning the double-time issue
    cleaned = re.sub(r"^\d{2}-\d{2}-\d{4} (\d{2}:\d{2}:\d{2} (AM|PM) +)+","", line)

    # split cleaned part based on occurence of ["D(A)", "D(B)", "D1", "D2"]
    if any(k in cleaned.split(":")[1] for k in ["D(A)", "D(B)", "D1", "D2"]):
        system, di, msg = cleaned.split(" ", maxsplit = 2)
    else:
        di = ""
        system, msg = cleaned.split(":", maxsplit = 1)

    # return each line as list of cleaned stuff:
    return [ts[0].strip() ,system.strip(), di.strip(), msg.strip()]

# fixed header, lines will be appended   
p = [['Timestamp','System','Di','Message']]

with open("data.txt","r") as r:
    for l in r:
        l = l.strip()
        p.append(parseLine(l))

import csv
with open("c.csv","w",newline="") as w:
    writer = csv.writer(w,quoting=csv.QUOTE_ALL)
    writer.writerows(p)

读取并输出写入的文件：

with open("c.csv") as r:
    print(r.read())

文件内容（屏蔽csv）否则

st.旧：111，新：222，计算为123，

将损坏您的格式：

"Timestamp","System","Di","Message"
"01-01-1998 00:00:00 AM","GP:","D(B):","1234 to time difference. Hourly Avg:-3 secs"
"01-01-1998 00:00:12 AM","GP:","D(A):","2345 to time difference. Hourly Avg:0 secs"
"01-01-1998 00:08:08 AM","SYS","","The Screen Is now minimised."
"01-01-1998 00:09:10 AM","SC","","Findcorrect: W. D:1. Count one two three four five.       #there are somehow some glitch in the system showing 2 timestamp"
"01-01-1998 00:14:14 AM","SC:","D1","test. Old:111, New:222, Calculated was 123, out of 120 secs."
"01-01-1998 01:06:24 AM","ET","","Program Disconnected event."

你好谢谢你的指导。我只是想和你核对一些东西，因为上面显示的数据只是我测试数据的一部分，当我用我的实际数据进行测试时，它会导致“非类型”对象不可订阅的错误。我正在进行一些故障排除，并意识到当我的时间是小时部分的一位数时会发生错误。我已尝试将检查数字的参数更改为“\d{2，}”，但仍然不起作用。请提供建议，谢谢！当我在测试数据中发现错误时，已更新问题！@Thanksforelping似乎您的白天是我的睡眠时间。您可以更改

\d{1,2}

使其最多接受1到2个数字-

\d{2，}

接受2个或更多的数字。我用来测试正则表达式-它甚至可以用“普通”文本向您解释它们-数据中的错误不应该影响这一点，因为该部分在

：

-空格与否应该无关紧要

"Timestamp","System","Di","Message"
"01-01-1998 00:00:00 AM","GP:","D(B):","1234 to time difference. Hourly Avg:-3 secs"
"01-01-1998 00:00:12 AM","GP:","D(A):","2345 to time difference. Hourly Avg:0 secs"
"01-01-1998 00:08:08 AM","SYS","","The Screen Is now minimised."
"01-01-1998 00:09:10 AM","SC","","Findcorrect: W. D:1. Count one two three four five.       #there are somehow some glitch in the system showing 2 timestamp"
"01-01-1998 00:14:14 AM","SC:","D1","test. Old:111, New:222, Calculated was 123, out of 120 secs."
"01-01-1998 01:06:24 AM","ET","","Program Disconnected event."