Python 如何正则化多个数据块_Python_Regex

Python 如何正则化多个数据块

python regex

Python 如何正则化多个数据块,python,regex,Python,Regex,我们在uni得到了一个新的测试单元，它将测试结果保存到一个.txt文件中。该文件的格式为： header data (with data header) closing informations 我想做的是，用regex分离这些块，然后用python中的pandas对数据块执行进一步的数据分析数据集示例： MTM Test Data File Output file name: C:\2021\CRUDE\Testrun\Testrun.mtmd Profile file name:

我们在uni得到了一个新的测试单元，它将测试结果保存到一个.txt文件中。该文件的格式为：

header

data (with data header)

closing informations

我想做的是，用regex分离这些块，然后用python中的pandas对数据块执行进一步的数据分析

数据集示例：

MTM Test Data File
Output file name:   C:\2021\CRUDE\Testrun\Testrun.mtmd
Profile file name:  D:\2021\CRUDE\CRUDE.mtmp
Profile description:    CRUDE Testrun
Lubricant name: 
Comments:

Number of steps in profile: 12
Number of steps completed:  12
Test started at 09.02.2021 12:06:13

Step 1 started at 09.02.2021 12:16:22
Step type   Traction    Step description    SRR 1 -100%. 30N. 40C
Zero traction force (N) 0.9055  (measured at the start of this step)
Disc track radius used for this step (mm)   21.078
SRR (%) Traction Coeff (-)                          Step Time (s)   Pot RTD Temp (degC) Lube RTD Temp (degC)    Ball Load (N)   Pin Load (N)    Wear (um)   ECR (%)     Ball Speed 1    Ball Speed 2    Ball Speed 3    Ball Speed 4    Disc Speed 1    Disc Speed 2    Disc Speed 3    Disc Speed 4    Disc Frequency (Hz) Rolling Speed (mm/s)    Sliding Speed (mm/s)    SRR (%)     TF1 (N) TF2 (N) TF3 (N) TF4 (N) Traction Force (N)  Traction Coeff (-)  Rolling Force (N)   Rolling Coeff (-)   Traction Trace (-)
1.003   0.0022                          6   39.7    40.1    29.868      341 0       1990.03 2009.93         2009.74 1989.53             1999.81 20.05   1.003       1.0718  0.9433          0.0643  0.0022          
1.993   0.0041                          12  39.6    40.1    29.989      337 0       1979.91 2019.36         2020.27 1980.02             1999.89 39.85   1.993       1.1262  0.8794          0.1234  0.0041          
3.011   0.0060                          18  39.6    40.0    29.866      349 0       1969.89 2030.23         2030.18 1970.07             2000.09 60.22   3.011       1.1759  0.8182          0.1789  0.0060          
3.997   0.0075                          24  39.5    39.9    29.825      340 0       1959.39 2039.68         2039.81 1960.23             1999.78 79.94   3.997       1.2227  0.7727          0.2250  0.0075      

Step 2 started at 09.02.2021 12:18:53
Step type   Stribeck    Step description    50% SRR. 30N. 0-3.2m/s. 40C
Zero traction force (N) 0.7063  (measured at the start of this step)
Disc track radius used for this step (mm)   21.078
Rolling Speed (mm/s)    Traction Coeff (-)                          Step Time (s)   Pot RTD Temp (degC) Lube RTD Temp (degC)    Ball Load (N)   Pin Load (N)    Wear (um)   ECR (%)     Ball Speed 1    Ball Speed 2    Ball Speed 3    Ball Speed 4    Disc Speed 1    Disc Speed 2    Disc Speed 3    Disc Speed 4    Disc Frequency (Hz) Rolling Speed (mm/s)    Sliding Speed (mm/s)    SRR (%)     TF1 (N) TF2 (N) TF3 (N) TF4 (N) Traction Force (N)  Traction Coeff (-)  Rolling Force (N)   Rolling Coeff (-)   Traction Trace (-)
3199.615    0.0231                          6   39.4    40.1    29.986      341 0       2400.32 3998.23         3999.27 2400.64             3199.62 1598.27 49.952      1.6923  0.3072          0.6925  0.0231          
2999.596    0.0231                          12  39.4    40.1    29.980      335 0       2250.96 3748.61         3748.57 2250.24             2999.60 1497.99 49.940      1.6982  0.3154          0.6914  0.0231          
2799.767    0.0233                          19  39.5    40.1    29.971      341 0       2101.02 3498.66         3498.66 2100.73             2799.77 1397.79 49.925      1.7011  0.3028          0.6991  0.0233          
2599.883    0.0236                          25  39.5    40.1    30.035      338 0       1950.80 3249.04         3249.38 1950.31             2599.88 1298.66 49.951      1.7147  0.2943          0.7102  0.0236          

[keep on going with Step 3, Step 4, Step 5, etc. till Step 12]

All steps completed. test completed normally at 09.02.2021 13:10:23

到目前为止，我掌握的代码是：

import re

crodaFile = "C:\\2021\\CRUDE\\Testrun\\Testrun.txt"

with open(crodaFile) as f:
    fileread = f.read()

regex = r".+?(?=Step \d+)"
p = re.compile(regex, re.MULTILINE | re.DOTALL) 

m = p.finditer(fileread)
for match in m:
    print(match)

输出：

<re.Match object; span=(0, 321), match='MTM278 Test Data File\nOutput file name:\tC:\\202>
<re.Match object; span=(321, 4199), match='Step 1 started at 09.02.2021 12:16:22\nStep type\>
<re.Match object; span=(4199, 9379), match='Step 2 started at 09.02.2021 12:18:53\nStep type\>
<re.Match object; span=(9379, 13277), match='Step 3 started at 09.02.2021 12:22:31\nStep type\>
<re.Match object; span=(13277, 18484), match='Step 4 started at 09.02.2021 12:25:09\nStep type\>
<re.Match object; span=(18484, 22361), match='Step 5 started at 09.02.2021 12:36:11\nStep type\>
<re.Match object; span=(22361, 27544), match='Step 6 started at 09.02.2021 12:38:45\nStep type\>
<re.Match object; span=(27544, 31433), match='Step 7 started at 09.02.2021 12:42:21\nStep type\>
<re.Match object; span=(31433, 36638), match='Step 8 started at 09.02.2021 12:44:57\nStep type\>
<re.Match object; span=(36638, 40561), match='Step 9 started at 09.02.2021 12:55:58\nStep type\>
<re.Match object; span=(40561, 45816), match='Step 10 started at 09.02.2021 12:58:28\nStep type>
<re.Match object; span=(45816, 49744), match='Step 11 started at 09.02.2021 13:04:22\nStep type>

我错过了最后一步（步骤12）和结束信息

所有步骤都已完成。测试于2021年2月9日13:10:23正常完成
有没有关于我如何到达最后一步和结束信息的建议？
您错过了最后一步，因为“向前看”断言（不匹配）右侧应该有另一步
您可以匹配步骤和以下所有不以步骤和数字开头的行，而不是向前看
如果文件的布局始终相同，并且需要所有单独的部分，则可以排除后跟数字或空行的匹配步骤，并通过指定这些特定匹配来获取标题和结束信息
^(?:Step \d+.*(?:\r?\n(?!Step \d|$).*)*|(?:MTM|All steps completed.).*(?:\r?\n(?!Step \d).*)*)


示例输出缩短为3个部分
<re.Match object; span=(0, 283), match='MTM Test Data File\nOutput file name:   C:\\2021\>
<re.Match object; span=(284, 1862), match='Step 1 started at 09.02.2021 12:16:22\nStep type >
<re.Match object; span=(1864, 3484), match='Step 2 started at 09.02.2021 12:18:53\nStep type >
<re.Match object; span=(3487, 5107), match='Step 3 started at 09.02.2021 12:18:53\nStep type >
<re.Match object; span=(5110, 5177), match='All steps completed. test completed normally at 0>


回答得很好，谢谢！文件的布局“应该”始终相同。我还没有发现异常值。
<re.Match object; span=(0, 283), match='MTM Test Data File\nOutput file name:   C:\\2021\>
<re.Match object; span=(284, 1862), match='Step 1 started at 09.02.2021 12:16:22\nStep type >
<re.Match object; span=(1864, 3484), match='Step 2 started at 09.02.2021 12:18:53\nStep type >
<re.Match object; span=(3487, 5107), match='Step 3 started at 09.02.2021 12:18:53\nStep type >
<re.Match object; span=(5110, 5177), match='All steps completed. test completed normally at 0>