如何使用正则表达式规范txt格式的报告,该正则表达式在python中包含重复的行标题?
我正在尝试将其规范化为一个数据帧,该数据帧将包含以下字段:如何使用正则表达式规范txt格式的报告,该正则表达式在python中包含重复的行标题?,python,regex,pandas,Python,Regex,Pandas,我正在尝试将其规范化为一个数据帧,该数据帧将包含以下字段: 家长账户-这些是四位数字,没有 余额/活动字段中的美元金额。应重复此操作 每个子账户 父项描述-在父项帐户旁边找到的描述。这应该对每个子账户重复 Sub_账户-这些数字也有一个“-”并且在第一个“-”之后 中心-帐户中的第三级。然后是第二个“-”。这并不总是存在,在这种情况下将是空白的 Sub_说明-Sub_帐户的说明 期初余额-期初余额下的数字。如果有 数字旁边是一个“cr”,它应该是数字乘以-1 期间\活动-期间活动下的编号。如果
- 家长账户-这些是四位数字,没有 余额/活动字段中的美元金额。应重复此操作 每个子账户
- 父项描述-在父项帐户旁边找到的描述。这应该对每个子账户重复
- Sub_账户-这些数字也有一个“-”并且在第一个“-”之后
- 中心-帐户中的第三级。然后是第二个“-”。这并不总是存在,在这种情况下将是空白的
- Sub_说明-Sub_帐户的说明
- 期初余额-期初余额下的数字。如果有 数字旁边是一个“cr”,它应该是数字乘以-1
- 期间\活动-期间活动下的编号。如果有 “cr”旁边的数字,应该是数字乘以-1
- Ending-结尾下的数字。如果在屏幕旁边有一个“cr” 数字,它应该是数字乘以-1
import pandas as pd
import numpy as np
import re
string = """ gltbrp.p 2+ 25.15.4 Trial Balance Summary Date: 10/02/20
Page: 1 COMP AB&E Time: 16:24:55
COMP AB & E Reporting Currency: NIS
Exchange Rate:
Beginning Balance Period Activity Ending Balance
Account Description 01/01/19 31/12/19 Adjust Balance
----------------------- ------------------------ ------------------- ------------------- ------------------- ------ -------
1010 Cash-Deposit-0 Bal., FC
1010-1111 CFS RECEIVABLES CASH BOO 848,377.90 646,932.39 1,495,310.29
1010-2611 INTER ACCOUNT TRANSFERS 4,453,872.12cr 15,804,424.27 20,258,296.39
1010-9122 DEFAULT SUB-ACCOUNT CODE 1,088,346.84 1,423,931.41cr 2,512,278.25
1012 Cash-Disburse-0 Bal.,FC
1012-1114 QUEENSMAIN ACCOUNT 9,193,838.58 3,141,528.70cr 6,052,309.88
1014 EURO CONTROL ACCT
1014-9122 DEFAULT SUB-ACCOUNT CODE EUR 2,789.21 11,403.07cr 8,613.86cr
1016 USD CONTROL ACCT
1016-9122 DEFAULT SUB-ACCOUNT CODE USD .00 78,484.56 78,484.56
1022 EURO BANK ACCOUNTS
1022-9122 DEFAULT SUB-ACCOUNT CODE EUR 5,055,924.60 1,342,240.47cr 3,713,684.13
1023 USD BANK ACCOUNTS
1023-9122 DEFAULT SUB-ACCOUNT CODE USD 4,744,992.89 1,680,118.33cr 3,064,874.56
1042 Cash-Disb.-Non.0 Bal,NFC
1042-1162 CURR HK$ & CHINESE RMB 330.76 330.76cr .00
1100 Accounts Rec.-Trade:FC
1100-1311 CFS RECEIVABLES TRADE 23,103,558.73 4,369,946.25cr 18,733,612.48
1100-WBAB CFS ACCRUED RECEIVABLES 101,096.06cr 4,251.26 96,844.80cr
1100-WBAB-1501 MAXIFIT < 300MM 310,266.12cr 44,420.84 265,845.28cr
1100-1315 SALES REBATES 1,150,318.67cr 35,024.14cr 1,185,342.81cr
1100-1315-1093 Commpac 46,439.08cr 15,999.96cr 62,439.04cr
1100-1315-1102 HNH IRON BALANCING 654,359.47cr 156,251.52cr 810,610.99cr
1100-1315-1501 MAXIFIT < 300MM 351,099.82cr 63,893.90cr 414,993.72cr
1100-1316 CONTACTOR REBATES 3,804,172.43cr 2,073,515.44 1,730,656.99cr
1100-1316-1093 Commpac 382,263.81cr 19,739.11cr 402,002.92cr
1100-1316-1102 HNH IRON BALANCING 1,827,536.88cr 486,674.25 1,340,862.63cr
1100-1316-1501 MAXIFIT < 300MM 865,491.17cr 610,548.87cr 1,476,040.04cr
1100-1316-1502 MAXIFIT > 300MM 321,028.94cr 73,990.76 247,038.18cr
1100-1317 SALES REBATE CONTROL ACC 2,879,225.96cr 682,081.36 2,197,144.60cr
1100-1317-1093 Commpac 18,955.18cr 12,405.87 6,549.31cr
1100-1317-1102 HNH IRON BALANCING 1,499,613.14cr 377,041.56 1,122,571.58cr
1100-1318 Hattersley Rebates 22,470.58cr 4,449.48cr 26,920.06cr
1100-1318-1102 HNH IRON BALANCING 48,921.90cr 10,152.79cr 59,074.69cr
1100-1319 VAT TRANSFER CONTRA 2,981,496.28cr 1,243,140.41 1,738,355.87cr
1100-1319-1501 MAXIFIT < 300MM 1,627,262.29cr 134,977.01 1,492,285.28cr
1100-2315 AR/AP CREDIT BAL SWITCH 11,810,820.05cr 869,957.47 10,940,862.58cr
1100-2315-1501 MAXIFIT < 300MM 11,810,820.05 869,957.47cr 10,940,862.58
1100-9122 DEFAULT SUB-ACCOUNT CODE 485,594.24 254,072.72cr 231,521.52
1100-9122-1501 MAXIFIT < 300MM 294,857.51 5,354.08cr 289,503.43
1102 EURO ACCOUNTS RECEIVABLE
1102-9122 DEFAULT SUB-ACCOUNT CODE EUR 2,433,435.33 1,022,867.13cr 1,410,568.20
1103 USD ACCOUNTS RECEIVABLE
1103-9122 DEFAULT SUB-ACCOUNT CODE USD 1,801,882.57 250,490.33 2,052,372.90
1124 V.A.T. Receivable
1124-9122 DEFAULT SUB-ACCOUNT CODE 2,981,496.28 1,243,140.41cr 1,738,355.87
1124-9122-1501 MAXIFIT < 300MM 1,627,262.29 134,977.01cr 1,492,285.28
1132 Other Rec.-Charges Rebil
1132-1355 CLAIMS - RECOVERABLE 1,044.43 5,029.58cr 3,985.15cr
1138 Other Rec.-Employee Rec.
gltbrp.p 2+ 25.15.4 Trial Balance Summary Date: 10/02/20
Page: 2 COMP AB&E Time: 16:24:56
COMP BS & U Reporting Currency: NIS
Exchange Rate:
Beginning Balance Period Activity Ending Balance
Account Description 01/01/19 31/12/19 Adjust Balance
----------------------- ------------------------ ------------------- ------------------- ------------------- ------ -------
1138-1321 ADVANCES TO EMPLOYEES 100.00 100.00cr .00
1138-1323 TRAVEL ADV ALL EMPLOYEES 2,357.42 1,219.98cr 1,137.44
1156 Other Rec.-Pension Rec.
1156-9122 DEFAULT SUB-ACCOUNT CODE 8,008.59 1,914.69cr 6,093.90
1160 Other Rec.-Rent Rec.
1160-9122 DEFAULT SUB-ACCOUNT CODE 3,150.00 .00 3,150.00
1172 Other Rec.-Miscellaneous
1172-1333 COMP FUND .00 6,618.31 6,618.31
1172-9122 DEFAULT SUB-ACCOUNT CODE 26,242.01 115,117.97cr 88,875.96cr
Beginning Date: 01/01/19
Ending Date: 31/12/19
Summarize Sub-Accounts: No
Summarize Cost Centers: No
Currency: NIS
Suppress Zero Amounts: Yes
Round to Nearest Thousand: No
Round to Nearest Whole Unit: No
Reporting Currency: Output: text
Batch ID:
"""
账户1010的期望输出示例。在这种情况下,没有“中心”编号,因此为空。
re
当然是一个很好的工具,但是在这里没有用,因为我们有一个固定长度的字段文件
重要的是扔掉所有标题行,只处理相关的标题行:一个状态变量足以跟踪我们是在处理标题还是数据行
最后,熊猫数据帧应该在一次传递中被馈送,因为它们的底层容器是numpy数组,这确实允许快速处理,但添加新值是昂贵的
因此,我会为每个数据行构建一个列表或更好的字典,将所有这些记录存储在一个列表中,并在最后将该记录列表提供给一个数据帧。代码可能假设字符串包含示例数据):
对于df
,它给出:
Parent_Account Parent_Description Sub_Account Center Sub_Description Beginning_Balance Period_Activity Ending
0 1010 Cash-Deposit-0 Bal., FC 1111 CFS RECEIVABLES CASH BOO 848377.90 646932.39 1495310.29
1 1010 Cash-Deposit-0 Bal., FC 2611 INTER ACCOUNT TRANSFERS -4453872.12 15804424.27 20258296.39
2 1010 Cash-Deposit-0 Bal., FC 9122 DEFAULT SUB-ACCOUNT CODE 1088346.84 -1423931.41 2512278.25
3 1012 Cash-Disburse-0 Bal.,FC 1114 QUEENSMAIN ACCOUNT 9193838.58 -3141528.70 6052309.88
4 1014 EURO CONTROL ACCT 9122 DEFAULT SUB-ACCOUNT CODE EUR 2789.21 -11403.07 -8613.86
5 1016 USD CONTROL ACCT 9122 DEFAULT SUB-ACCOUNT CODE USD 0.00 78484.56 78484.56
6 1022 EURO BANK ACCOUNTS 9122 DEFAULT SUB-ACCOUNT CODE EUR 5055924.60 -1342240.47 3713684.13
7 1023 USD BANK ACCOUNTS 9122 DEFAULT SUB-ACCOUNT CODE USD 4744992.89 -1680118.33 3064874.56
8 1042 Cash-Disb.-Non.0 Bal,NFC 1162 CURR HK$ & CHINESE RMB 330.76 -330.76 0.00
9 1100 Accounts Rec.-Trade:FC 1311 CFS RECEIVABLES TRADE 23103558.73 -4369946.25 18733612.48
10 1100 Accounts Rec.-Trade:FC WBAB CFS ACCRUED RECEIVABLES -101096.06 4251.26 -96844.80
11 1100 Accounts Rec.-Trade:FC WBAB 1501 CFS ACCRUED RECEIVABLES -310266.12 44420.84 -265845.28
12 1100 Accounts Rec.-Trade:FC 1315 SALES REBATES -1150318.67 -35024.14 -1185342.81
13 1100 Accounts Rec.-Trade:FC 1315 1093 SALES REBATES -46439.08 -15999.96 -62439.04
14 1100 Accounts Rec.-Trade:FC 1315 1102 SALES REBATES -654359.47 -156251.52 -810610.99
15 1100 Accounts Rec.-Trade:FC 1315 1501 SALES REBATES -351099.82 -63893.90 -414993.72
16 1100 Accounts Rec.-Trade:FC 1316 CONTACTOR REBATES -3804172.43 2073515.44 -1730656.99
17 1100 Accounts Rec.-Trade:FC 1316 1093 CONTACTOR REBATES -382263.81 -19739.11 -402002.92
18 1100 Accounts Rec.-Trade:FC 1316 1102 CONTACTOR REBATES -1827536.88 486674.25 -1340862.63
19 1100 Accounts Rec.-Trade:FC 1316 1501 CONTACTOR REBATES -865491.17 -610548.87 -1476040.04
20 1100 Accounts Rec.-Trade:FC 1316 1502 CONTACTOR REBATES -321028.94 73990.76 -247038.18
21 1100 Accounts Rec.-Trade:FC 1317 SALES REBATE CONTROL ACC -2879225.96 682081.36 -2197144.60
22 1100 Accounts Rec.-Trade:FC 1317 1093 SALES REBATE CONTROL ACC -18955.18 12405.87 -6549.31
23 1100 Accounts Rec.-Trade:FC 1317 1102 SALES REBATE CONTROL ACC -1499613.14 377041.56 -1122571.58
24 1100 Accounts Rec.-Trade:FC 1318 Hattersley Rebates -22470.58 -4449.48 -26920.06
25 1100 Accounts Rec.-Trade:FC 1318 1102 Hattersley Rebates -48921.90 -10152.79 -59074.69
26 1100 Accounts Rec.-Trade:FC 1319 VAT TRANSFER CONTRA -2981496.28 1243140.41 -1738355.87
27 1100 Accounts Rec.-Trade:FC 1319 1501 VAT TRANSFER CONTRA -1627262.29 134977.01 -1492285.28
28 1100 Accounts Rec.-Trade:FC 2315 AR/AP CREDIT BAL SWITCH -11810820.05 869957.47 -10940862.58
29 1100 Accounts Rec.-Trade:FC 2315 1501 AR/AP CREDIT BAL SWITCH 11810820.05 -869957.47 10940862.58
30 1100 Accounts Rec.-Trade:FC 9122 DEFAULT SUB-ACCOUNT CODE 485594.24 -254072.72 231521.52
31 1100 Accounts Rec.-Trade:FC 9122 1501 DEFAULT SUB-ACCOUNT CODE 294857.51 -5354.08 289503.43
32 1102 EURO ACCOUNTS RECEIVABLE 9122 DEFAULT SUB-ACCOUNT CODE EUR 2433435.33 -1022867.13 1410568.20
33 1103 USD ACCOUNTS RECEIVABLE 9122 DEFAULT SUB-ACCOUNT CODE USD 1801882.57 250490.33 2052372.90
34 1124 V.A.T. Receivable 9122 DEFAULT SUB-ACCOUNT CODE 2981496.28 -1243140.41 1738355.87
35 1124 V.A.T. Receivable 9122 1501 DEFAULT SUB-ACCOUNT CODE 1627262.29 -134977.01 1492285.28
36 1132 Other Rec.-Charges Rebil 1355 CLAIMS - RECOVERABLE 1044.43 -5029.58 -3985.15
37 1138 Other Rec.-Employee Rec. 1321 ADVANCES TO EMPLOYEES 100.00 -100.00 0.00
38 1138 Other Rec.-Employee Rec. 1323 TRAVEL ADV ALL EMPLOYEES 2357.42 -1219.98 1137.44
39 1156 Other Rec.-Pension Rec. 9122 DEFAULT SUB-ACCOUNT CODE 8008.59 -1914.69 6093.90
40 1160 Other Rec.-Rent Rec. 9122 DEFAULT SUB-ACCOUNT CODE 3150.00 0.00 3150.00
41 1172 Other Rec.-Miscellaneous 1333 COMP FUND 0.00 6618.31 6618.31
42 1172 Other Rec.-Miscellaneous 9122 DEFAULT SUB-ACCOUNT CODE 26242.01 -115117.97 -88875.96
注意:根据您的要求,我的代码中省略了对中心的描述,但是为了将该列添加到数据帧中而对其进行更改是很简单的。正则表达式确实是一个强大的工具,但您应该阅读:-)顺便说一句,您的文本在
1100-WBAB-1501
或1100-1315-1093
处显示第二个子帐户级别。“那里会发生什么?”谢尔盖·巴列斯塔:谢谢你!我更新了它。是的,有父帐户,子帐户,然后是可选的“中心”号码。谢谢你花时间在这里。如果子账户行没有货币价值怎么办?当我在Python中运行它时,它抛出了一个错误“ValueError:cannotconvertingstringtofloat:”,我认为这是因为没有要转换的值。这是否需要一个if语句将其设置为0.0(如果为空)?St=行[54:74]条()=行[75∶95]条()=结束行[ 96:]条()边问题,这假设空白间隔是非常一致的?我想这是固定的。我添加了或“0”开始=行[54:74]。条带()或“0”句点=行[75:95]。条带()或“0”结束=行[96:]。条带()或“0”我在干草堆异常中找到了一根针。有一个具有货币价值的父帐户。在代码中发生这种情况时,我将如何解释?非常感谢。示例:2900留存收益45415424.56cr.00 45415424。56cr@Shmelky:我可以根据您的要求帮助您构建数据帧。但我既不知道数据的实际含义,也不知道您希望对数据框架做什么,因此我无法真正为您提供如何对其建模的建议。这些数据是不同账户的交易数据。唯一账户基于父-子_账户-中心组合(存在值时)。我们有期初余额,即起始金额。然后我们有周期活动,即每个账户的资金是如何产生或损失的。然后我们有期末余额,即期初余额加上/减去赚/亏的金额。这些数据是一种报告格式(txt文件),我试图将其放入一个数据框中,这样我就可以使用外部数据验证每个值,并通过连接表轻松地进行比较。
import pandas as pd
import io
header = True
def money(s):
"""Convert a number having ',' as thousand separator and a trailing cr
as negative sign to an ordinary float"""
neg = s.endswith('cr')
s = s.strip('cr').replace(',', '')
return float(s) if not neg else - float(s)
data = []
colnames = ['Parent_Account', 'Parent_Description', 'Sub_Account',
'Center', 'Sub_Description', 'Beginning_Balance',
'Period_Activity', 'Ending']
for line in io.StringIO(string):
if header:
if line.strip().startswith('----------'):
header = False
continue
else:
if line.strip().startswith('gltbrp') or len(line.strip()) == 0:
header = True
continue
# extract the fields from the line
acc = line[:23].strip()
desc = line[24:53].strip()
begin = line[54:74].strip()
period = line[75:95].strip()
end = line[96:].strip()
acc_details = len(acc.split('-', 2))
if acc_details == 1: # a parent record: only store parent values
parent_row = {'Parent_Account': acc, 'Parent_Description': desc}
else:
row = parent_row.copy() # initialize parent values
if acc_details == 2:
row['Sub_Account'] = acc[5:]
row['Sub_Description'] = desc
row['Center'] = ''
else:
row['Center'] = acc.split('-', 2)[2]
row['Beginning_Balance'] = money(begin)
row['Period_Activity'] = money(period)
row['Ending'] = money(end)
parent_row = row # keep relevant fields for following record
data.append(row)
df = pd.DataFrame(data, columns = colnames)
Parent_Account Parent_Description Sub_Account Center Sub_Description Beginning_Balance Period_Activity Ending
0 1010 Cash-Deposit-0 Bal., FC 1111 CFS RECEIVABLES CASH BOO 848377.90 646932.39 1495310.29
1 1010 Cash-Deposit-0 Bal., FC 2611 INTER ACCOUNT TRANSFERS -4453872.12 15804424.27 20258296.39
2 1010 Cash-Deposit-0 Bal., FC 9122 DEFAULT SUB-ACCOUNT CODE 1088346.84 -1423931.41 2512278.25
3 1012 Cash-Disburse-0 Bal.,FC 1114 QUEENSMAIN ACCOUNT 9193838.58 -3141528.70 6052309.88
4 1014 EURO CONTROL ACCT 9122 DEFAULT SUB-ACCOUNT CODE EUR 2789.21 -11403.07 -8613.86
5 1016 USD CONTROL ACCT 9122 DEFAULT SUB-ACCOUNT CODE USD 0.00 78484.56 78484.56
6 1022 EURO BANK ACCOUNTS 9122 DEFAULT SUB-ACCOUNT CODE EUR 5055924.60 -1342240.47 3713684.13
7 1023 USD BANK ACCOUNTS 9122 DEFAULT SUB-ACCOUNT CODE USD 4744992.89 -1680118.33 3064874.56
8 1042 Cash-Disb.-Non.0 Bal,NFC 1162 CURR HK$ & CHINESE RMB 330.76 -330.76 0.00
9 1100 Accounts Rec.-Trade:FC 1311 CFS RECEIVABLES TRADE 23103558.73 -4369946.25 18733612.48
10 1100 Accounts Rec.-Trade:FC WBAB CFS ACCRUED RECEIVABLES -101096.06 4251.26 -96844.80
11 1100 Accounts Rec.-Trade:FC WBAB 1501 CFS ACCRUED RECEIVABLES -310266.12 44420.84 -265845.28
12 1100 Accounts Rec.-Trade:FC 1315 SALES REBATES -1150318.67 -35024.14 -1185342.81
13 1100 Accounts Rec.-Trade:FC 1315 1093 SALES REBATES -46439.08 -15999.96 -62439.04
14 1100 Accounts Rec.-Trade:FC 1315 1102 SALES REBATES -654359.47 -156251.52 -810610.99
15 1100 Accounts Rec.-Trade:FC 1315 1501 SALES REBATES -351099.82 -63893.90 -414993.72
16 1100 Accounts Rec.-Trade:FC 1316 CONTACTOR REBATES -3804172.43 2073515.44 -1730656.99
17 1100 Accounts Rec.-Trade:FC 1316 1093 CONTACTOR REBATES -382263.81 -19739.11 -402002.92
18 1100 Accounts Rec.-Trade:FC 1316 1102 CONTACTOR REBATES -1827536.88 486674.25 -1340862.63
19 1100 Accounts Rec.-Trade:FC 1316 1501 CONTACTOR REBATES -865491.17 -610548.87 -1476040.04
20 1100 Accounts Rec.-Trade:FC 1316 1502 CONTACTOR REBATES -321028.94 73990.76 -247038.18
21 1100 Accounts Rec.-Trade:FC 1317 SALES REBATE CONTROL ACC -2879225.96 682081.36 -2197144.60
22 1100 Accounts Rec.-Trade:FC 1317 1093 SALES REBATE CONTROL ACC -18955.18 12405.87 -6549.31
23 1100 Accounts Rec.-Trade:FC 1317 1102 SALES REBATE CONTROL ACC -1499613.14 377041.56 -1122571.58
24 1100 Accounts Rec.-Trade:FC 1318 Hattersley Rebates -22470.58 -4449.48 -26920.06
25 1100 Accounts Rec.-Trade:FC 1318 1102 Hattersley Rebates -48921.90 -10152.79 -59074.69
26 1100 Accounts Rec.-Trade:FC 1319 VAT TRANSFER CONTRA -2981496.28 1243140.41 -1738355.87
27 1100 Accounts Rec.-Trade:FC 1319 1501 VAT TRANSFER CONTRA -1627262.29 134977.01 -1492285.28
28 1100 Accounts Rec.-Trade:FC 2315 AR/AP CREDIT BAL SWITCH -11810820.05 869957.47 -10940862.58
29 1100 Accounts Rec.-Trade:FC 2315 1501 AR/AP CREDIT BAL SWITCH 11810820.05 -869957.47 10940862.58
30 1100 Accounts Rec.-Trade:FC 9122 DEFAULT SUB-ACCOUNT CODE 485594.24 -254072.72 231521.52
31 1100 Accounts Rec.-Trade:FC 9122 1501 DEFAULT SUB-ACCOUNT CODE 294857.51 -5354.08 289503.43
32 1102 EURO ACCOUNTS RECEIVABLE 9122 DEFAULT SUB-ACCOUNT CODE EUR 2433435.33 -1022867.13 1410568.20
33 1103 USD ACCOUNTS RECEIVABLE 9122 DEFAULT SUB-ACCOUNT CODE USD 1801882.57 250490.33 2052372.90
34 1124 V.A.T. Receivable 9122 DEFAULT SUB-ACCOUNT CODE 2981496.28 -1243140.41 1738355.87
35 1124 V.A.T. Receivable 9122 1501 DEFAULT SUB-ACCOUNT CODE 1627262.29 -134977.01 1492285.28
36 1132 Other Rec.-Charges Rebil 1355 CLAIMS - RECOVERABLE 1044.43 -5029.58 -3985.15
37 1138 Other Rec.-Employee Rec. 1321 ADVANCES TO EMPLOYEES 100.00 -100.00 0.00
38 1138 Other Rec.-Employee Rec. 1323 TRAVEL ADV ALL EMPLOYEES 2357.42 -1219.98 1137.44
39 1156 Other Rec.-Pension Rec. 9122 DEFAULT SUB-ACCOUNT CODE 8008.59 -1914.69 6093.90
40 1160 Other Rec.-Rent Rec. 9122 DEFAULT SUB-ACCOUNT CODE 3150.00 0.00 3150.00
41 1172 Other Rec.-Miscellaneous 1333 COMP FUND 0.00 6618.31 6618.31
42 1172 Other Rec.-Miscellaneous 9122 DEFAULT SUB-ACCOUNT CODE 26242.01 -115117.97 -88875.96