Python 如何正则化多个数据块
我们在uni得到了一个新的测试单元,它将测试结果保存到一个.txt文件中。该文件的格式为:Python 如何正则化多个数据块,python,regex,Python,Regex,我们在uni得到了一个新的测试单元,它将测试结果保存到一个.txt文件中。该文件的格式为: header data (with data header) closing informations 我想做的是,用regex分离这些块,然后用python中的pandas对数据块执行进一步的数据分析 数据集示例: MTM Test Data File Output file name: C:\2021\CRUDE\Testrun\Testrun.mtmd Profile file name:
header
data (with data header)
closing informations
我想做的是,用regex分离这些块,然后用python中的pandas对数据块执行进一步的数据分析
数据集示例:
MTM Test Data File
Output file name: C:\2021\CRUDE\Testrun\Testrun.mtmd
Profile file name: D:\2021\CRUDE\CRUDE.mtmp
Profile description: CRUDE Testrun
Lubricant name:
Comments:
Number of steps in profile: 12
Number of steps completed: 12
Test started at 09.02.2021 12:06:13
Step 1 started at 09.02.2021 12:16:22
Step type Traction Step description SRR 1 -100%. 30N. 40C
Zero traction force (N) 0.9055 (measured at the start of this step)
Disc track radius used for this step (mm) 21.078
SRR (%) Traction Coeff (-) Step Time (s) Pot RTD Temp (degC) Lube RTD Temp (degC) Ball Load (N) Pin Load (N) Wear (um) ECR (%) Ball Speed 1 Ball Speed 2 Ball Speed 3 Ball Speed 4 Disc Speed 1 Disc Speed 2 Disc Speed 3 Disc Speed 4 Disc Frequency (Hz) Rolling Speed (mm/s) Sliding Speed (mm/s) SRR (%) TF1 (N) TF2 (N) TF3 (N) TF4 (N) Traction Force (N) Traction Coeff (-) Rolling Force (N) Rolling Coeff (-) Traction Trace (-)
1.003 0.0022 6 39.7 40.1 29.868 341 0 1990.03 2009.93 2009.74 1989.53 1999.81 20.05 1.003 1.0718 0.9433 0.0643 0.0022
1.993 0.0041 12 39.6 40.1 29.989 337 0 1979.91 2019.36 2020.27 1980.02 1999.89 39.85 1.993 1.1262 0.8794 0.1234 0.0041
3.011 0.0060 18 39.6 40.0 29.866 349 0 1969.89 2030.23 2030.18 1970.07 2000.09 60.22 3.011 1.1759 0.8182 0.1789 0.0060
3.997 0.0075 24 39.5 39.9 29.825 340 0 1959.39 2039.68 2039.81 1960.23 1999.78 79.94 3.997 1.2227 0.7727 0.2250 0.0075
Step 2 started at 09.02.2021 12:18:53
Step type Stribeck Step description 50% SRR. 30N. 0-3.2m/s. 40C
Zero traction force (N) 0.7063 (measured at the start of this step)
Disc track radius used for this step (mm) 21.078
Rolling Speed (mm/s) Traction Coeff (-) Step Time (s) Pot RTD Temp (degC) Lube RTD Temp (degC) Ball Load (N) Pin Load (N) Wear (um) ECR (%) Ball Speed 1 Ball Speed 2 Ball Speed 3 Ball Speed 4 Disc Speed 1 Disc Speed 2 Disc Speed 3 Disc Speed 4 Disc Frequency (Hz) Rolling Speed (mm/s) Sliding Speed (mm/s) SRR (%) TF1 (N) TF2 (N) TF3 (N) TF4 (N) Traction Force (N) Traction Coeff (-) Rolling Force (N) Rolling Coeff (-) Traction Trace (-)
3199.615 0.0231 6 39.4 40.1 29.986 341 0 2400.32 3998.23 3999.27 2400.64 3199.62 1598.27 49.952 1.6923 0.3072 0.6925 0.0231
2999.596 0.0231 12 39.4 40.1 29.980 335 0 2250.96 3748.61 3748.57 2250.24 2999.60 1497.99 49.940 1.6982 0.3154 0.6914 0.0231
2799.767 0.0233 19 39.5 40.1 29.971 341 0 2101.02 3498.66 3498.66 2100.73 2799.77 1397.79 49.925 1.7011 0.3028 0.6991 0.0233
2599.883 0.0236 25 39.5 40.1 30.035 338 0 1950.80 3249.04 3249.38 1950.31 2599.88 1298.66 49.951 1.7147 0.2943 0.7102 0.0236
[keep on going with Step 3, Step 4, Step 5, etc. till Step 12]
All steps completed. test completed normally at 09.02.2021 13:10:23
到目前为止,我掌握的代码是:
import re
crodaFile = "C:\\2021\\CRUDE\\Testrun\\Testrun.txt"
with open(crodaFile) as f:
fileread = f.read()
regex = r".+?(?=Step \d+)"
p = re.compile(regex, re.MULTILINE | re.DOTALL)
m = p.finditer(fileread)
for match in m:
print(match)
输出:
<re.Match object; span=(0, 321), match='MTM278 Test Data File\nOutput file name:\tC:\\202>
<re.Match object; span=(321, 4199), match='Step 1 started at 09.02.2021 12:16:22\nStep type\>
<re.Match object; span=(4199, 9379), match='Step 2 started at 09.02.2021 12:18:53\nStep type\>
<re.Match object; span=(9379, 13277), match='Step 3 started at 09.02.2021 12:22:31\nStep type\>
<re.Match object; span=(13277, 18484), match='Step 4 started at 09.02.2021 12:25:09\nStep type\>
<re.Match object; span=(18484, 22361), match='Step 5 started at 09.02.2021 12:36:11\nStep type\>
<re.Match object; span=(22361, 27544), match='Step 6 started at 09.02.2021 12:38:45\nStep type\>
<re.Match object; span=(27544, 31433), match='Step 7 started at 09.02.2021 12:42:21\nStep type\>
<re.Match object; span=(31433, 36638), match='Step 8 started at 09.02.2021 12:44:57\nStep type\>
<re.Match object; span=(36638, 40561), match='Step 9 started at 09.02.2021 12:55:58\nStep type\>
<re.Match object; span=(40561, 45816), match='Step 10 started at 09.02.2021 12:58:28\nStep type>
<re.Match object; span=(45816, 49744), match='Step 11 started at 09.02.2021 13:04:22\nStep type>
我错过了最后一步(步骤12)和结束信息所有步骤都已完成。测试于2021年2月9日13:10:23正常完成
有没有关于我如何到达最后一步和结束信息的建议?您错过了最后一步,因为“向前看”断言(不匹配)右侧应该有另一步
您可以匹配步骤和以下所有不以步骤和数字开头的行,而不是向前看
如果文件的布局始终相同,并且需要所有单独的部分,则可以排除后跟数字或空行的匹配步骤,并通过指定这些特定匹配来获取标题和结束信息
^(?:Step \d+.*(?:\r?\n(?!Step \d|$).*)*|(?:MTM|All steps completed.).*(?:\r?\n(?!Step \d).*)*)
示例输出缩短为3个部分
<re.Match object; span=(0, 283), match='MTM Test Data File\nOutput file name: C:\\2021\>
<re.Match object; span=(284, 1862), match='Step 1 started at 09.02.2021 12:16:22\nStep type >
<re.Match object; span=(1864, 3484), match='Step 2 started at 09.02.2021 12:18:53\nStep type >
<re.Match object; span=(3487, 5107), match='Step 3 started at 09.02.2021 12:18:53\nStep type >
<re.Match object; span=(5110, 5177), match='All steps completed. test completed normally at 0>
回答得很好,谢谢!文件的布局“应该”始终相同。我还没有发现异常值。
<re.Match object; span=(0, 283), match='MTM Test Data File\nOutput file name: C:\\2021\>
<re.Match object; span=(284, 1862), match='Step 1 started at 09.02.2021 12:16:22\nStep type >
<re.Match object; span=(1864, 3484), match='Step 2 started at 09.02.2021 12:18:53\nStep type >
<re.Match object; span=(3487, 5107), match='Step 3 started at 09.02.2021 12:18:53\nStep type >
<re.Match object; span=(5110, 5177), match='All steps completed. test completed normally at 0>