用python解析数据_Python_Parsing

用python解析数据

python parsing

用python解析数据,python,parsing,Python,Parsing,我正在写一个程序来解析一个.txt文件。 .txt数据文件如下所示： 2020062300,TAB DEBUT20200623 01AAAAA BJAZBACVB. 2012100199991231 01BBBBB BJSSBACVB. 2012100199991231 01SS BTRFBACVB. 2012100199991231 01D. BJSSBACVB. 2012100

我正在写一个程序来解析一个.txt文件。 .txt数据文件如下所示：

2020062300,TAB
DEBUT20200623
01AAAAA BJAZBACVB.              2012100199991231
01BBBBB BJSSBACVB.              2012100199991231
01SS    BTRFBACVB.              2012100199991231
01D.    BJSSBACVB.              2012100199991231
02AAAAA BJAZBACVB.              2012100199991231
02BBBBB BJSSBACVB.              2012100199991231
03SS    BTRFBACVB.              2012100199991231
03D.    BJSSBACVB.              2012100199991231
FIN20200623
2020062301,TAB
DEBUT20200623
FRAAAAA BJAZ            2012100199991231   
KSBBBBB BJSCVB.         2012100199991231
BBSS    BTRFBACVB.      2012100199991231
SSD.    BJSSBACVB.      2012100199991231
FIN20200623
2020062309,TAB
DEBUT20200623
TOTO    BJAZDGGD          2012100199991231   
TATA    BJSCVBNS          2012100199991231
TITI    BTRFBACV          2012100199991231
TOMA    BJSSBACV          2012100199991231
FIN20200623

文件的每个部分都由以下分隔：
2020062300，选项卡
首次亮相20200623
…
FIN20200623

作为输出，我们希望有3个对象： T00_XX和XX：每行的前两个字符。所以我们应该有三个输出表：T00_01，T00_02，T00_03

表T00\U 01

tab, name, des, start_date, end_date
01, AAAAA, BJAZBACVB., 2012100, 199991231
01, BBBBB, BJSSBACVB., 20121001, 99991231
01, SS, BTRFBACVB., 20121001,99991231
01, D., BJSSBACVB., 20121001, 99991231

tab,name, des,start_date,end_date
02, AAAAA, BJAZBACVB., 20121001, 99991231
02, BBBBB, BJSSBACVB., 20121001, 99991231

tab, name, des, start_date, end_date
03, SS, BTRFBACVB., 20121001, 99991231
03, D., BJSSBACVB., 20121001, 99991231
03D., BJSSBACVB., 20121001, 99991231

表T00\U 02

tab, name, des, start_date, end_date
01, AAAAA, BJAZBACVB., 2012100, 199991231
01, BBBBB, BJSSBACVB., 20121001, 99991231
01, SS, BTRFBACVB., 20121001,99991231
01, D., BJSSBACVB., 20121001, 99991231

tab,name, des,start_date,end_date
02, AAAAA, BJAZBACVB., 20121001, 99991231
02, BBBBB, BJSSBACVB., 20121001, 99991231

tab, name, des, start_date, end_date
03, SS, BTRFBACVB., 20121001, 99991231
03, D., BJSSBACVB., 20121001, 99991231
03D., BJSSBACVB., 20121001, 99991231

表T00\U 03

tab, name, des, start_date, end_date
01, AAAAA, BJAZBACVB., 2012100, 199991231
01, BBBBB, BJSSBACVB., 20121001, 99991231
01, SS, BTRFBACVB., 20121001,99991231
01, D., BJSSBACVB., 20121001, 99991231

tab,name, des,start_date,end_date
02, AAAAA, BJAZBACVB., 20121001, 99991231
02, BBBBB, BJSSBACVB., 20121001, 99991231

tab, name, des, start_date, end_date
03, SS, BTRFBACVB., 20121001, 99991231
03, D., BJSSBACVB., 20121001, 99991231
03D., BJSSBACVB., 20121001, 99991231

2020062301，选项卡
首次亮相20200623

FIN20200623

表T01

name, desc,  start_date, end_date
FRAAAAA, BJAZ, 20121001, 99991231   
KSBBBBB, BJSCVB., 20121001, 99991231
BBSS, BTRFBACVB., 20121001, 99991231
SSD., BJSSBACVB., 20121001, 99991231

2020062309，选项卡
首次亮相20200623

FIN20200623

表T09

TOTO, BJAZDGGD, 2012100199991231   
TATA, BJSCVBNS, 2012100199991231
TITI, BTRFBACV, 2012100199991231
TOMA, BJSSBACV, 2012100199991231

我开始编写一个目前不符合我需要的程序：

%%time
path=r"fichiertest.txt"
data_00_01=[]
with open(path, "r") as f:
    for line in f.readlines():
        print(line[8:14])
        if(line[8:13]=="00,TAB"):
            print(line)
            if(line[0:5]=="DEBUT"):
                print(line)
                if(line[0:2]=="01"):
                    print(line)
                    content_00_01 = {}
                    content_00_01["tab"]=line[0:2]
                    content_00_01["nom"]=line[2:8]
                    content_00_01["desc"]=line[8:20]
                    content_00_01["date_debut"]=line[32:40]
                    content_00_01["date_fin"]=line[40:48]

if条件的使用不允许满足几行的条件。

我的解决方案当然可以简化，但可以向您展示它是如何工作的。一个或另一个正则表达式也可能必须适应环境

import re
from collections import defaultdict

def readFile(path):
    data = defaultdict(list)
    with open(path, 'r') as fp:
        while True:
            ln = fp.readline()
            if ln == '': break

            m = re.match(r'^(\d{8})(\d{2}),TAB$', ln.strip())
            if not m: raise OSError('Illegal format')
            id,num = m.groups()

            ln = fp.readline().strip()
            if ln != f'DEBUT{id}': raise OSError('Illegal format')

            while True:
                ln = fp.readline().strip()
                if ln == f'FIN{id}': break
                elif ln == '' or (ln[:3] == 'FIN' and ln[3:] != id): 
                    raise OSError('Illegal format')
                if num == '00':
                    m = re.match(r'^(\d{2})([A-Z\.]+)\s+([^\s]+)\s+(\d{8})(\d{8})$', ln)
                    if m:
                        tab,name,desc,start_date,end_date = m.groups()
                        data[f'T{num}_{tab}'].append({
                            # 'tab': tab,
                            'name': name,
                            'desc': desc,
                            'start_date': start_date,
                            'end_date': end_date
                        })
                    else:
                        raise OSError('Illegal format')
                elif num == '01' or num == '09':
                    m = re.match(r'^([A-Z\.]+)\s+([^\s]+)\s+(\d{8})(\d{8})$', ln)
                    if m:
                        name,desc,start_date,end_date = m.groups()
                        data[f'T{num}'].append({
                            'name': name,
                            'desc': desc,
                            'start_date': start_date,
                            'end_date': end_date
                        })
                    else:
                        raise OSError('Illegal format')
    return data

另一种方法是：

from io import StringIO

txt = '''
2020062300,TAB
DEBUT20200623
01AAAAA BJAZBACVB.              2012100199991231
01BBBBB BJSSBACVB.              2012100199991231
01SS    BTRFBACVB.              2012100199991231
01D.    BJSSBACVB.              2012100199991231
02AAAAA BJAZBACVB.              2012100199991231
02BBBBB BJSSBACVB.              2012100199991231
03SS    BTRFBACVB.              2012100199991231
03D.    BJSSBACVB.              2012100199991231
FIN20200623
2020062301,TAB
DEBUT20200623
FRAAAAA BJAZ            2012100199991231   
KSBBBBB BJSCVB.         2012100199991231
BBSS    BTRFBACVB.      2012100199991231
SSD.    BJSSBACVB.      2012100199991231
FIN20200623
2020062309,TAB
DEBUT20200623
TOTO    BJAZDGGD          2012100199991231   
TATA    BJSCVBNS          2012100199991231
TITI    BTRFBACV          2012100199991231
TOMA    BJSSBACV          2012100199991231
FIN20200623
'''.strip()

tname = ""
all = {}  # dictionary of all tables
for ln in StringIO(txt):   # can also read from file
    ln = ' '.join(ln.split())  # remove extra spaces
    if ',TAB' in ln:   # set table name
       tname='T' + ln[-6:-4]  # 00
       continue
    if ln[:5] == 'DEBUT':  # skip row
       continue
    if ln[:3] == 'FIN':  # end of table
       tname=""
       continue
    if ln[0] == '0':  # table 00
       tname = tname.split('_')[0] + "_" + ln[:2]
       ln = ','.join((ln[:2] + " " + ln[2:-8] + " " + ln[-8:]).split(" "))
       if not tname in all: all[tname]=['tab, name, des, start_date, end_date']
    else:  # other tables
       ln = ','.join((ln[:-8] + " " + ln[-8:]).split(" "))
       if not tname in all: all[tname]=['name, desc,  start_date, end_date']
    all[tname].append(ln)

# print tables
for t in all:
   print(t)
   for r in all[t]:
      print(r)

看起来您的各种条件都是可选的-您可能希望

如果<代码>elif
elif
（都在相同的缩进级别）而不是嵌套的if
语句在满足您的需求之前，您为什么不继续编写程序？您的预期输出是什么？谢谢，作为预期输出，我们预期第一个块2020062300有5个表：T00_01、T00_02、T00_03，选项卡DEBUT20200623。。。为另外两个区块查找20200623和T01和T09对不起，但仍然不清楚-您所说的是3个表、5个表、5个元素等等。。。您是否可以编辑问题（即，不在注释中）并显示准确的预期输出，而不是描述输出？