Python函数,用于在打开时从文件中读取可变长度的数据块
我的数据文件包含许多时间步的数据,每个时间步的格式如下所示:Python函数,用于在打开时从文件中读取可变长度的数据块,python,function,numpy,file-processing,Python,Function,Numpy,File Processing,我的数据文件包含许多时间步的数据,每个时间步的格式如下所示: TIMESTEP PARTICLES 0.00500103 1262 ID GROUP VOLUME MASS PX PY PZ VX VY VZ 651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903 430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.012104
TIMESTEP PARTICLES
0.00500103 1262
ID GROUP VOLUME MASS PX PY PZ VX VY VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
....
with open(fileOpenPath, 'r') as file:
startWallTime = time.clock()
Timestep, numParticles, particleData = readBlock(file)
print(Timestep)
## Do processing stuff here
print("Timestep Processed")
endWallTime = time.clock()
每个块由3个标题行和若干与时间步相关的数据行组成(第2行为int)。与块关联的数据行数可以从0到1000万不等。每个块之间可能有一个空行,但有时会丢失
我希望能够逐块读取文件,在读取数据块后处理数据-文件很大(通常超过200GB),一个时间步就可以轻松加载到内存中
由于文件格式的原因,我认为编写一个函数来读取3个头行,读取实际数据,然后返回一个漂亮的numpy数组进行数据处理是非常容易的。
我习惯于MATLAB,在这里,您可以简单地读取块,而不是在文件末尾。我不太确定如何使用python实现这一点
我创建了以下函数来读取数据块:
def readBlock(f):
particleData = []
Timestep = []
numParticles = []
linesProcessed = 0
line = f.readline().strip()
if line.startswith('TIMESTEP'):
timestepHeaders = line.strip()
varData = f.readline().strip()
headerStrings = f.readline().strip().split(' ')
parts = varData.strip().split(' ')
Timestep = float(parts[0])
numParticles = int(parts[1])
while linesProcessed < numParticles:
particleData.append(tuple(f.readline().strip().split(' ')))
linesProcessed += 1
mydt = np.dtype([ ('ID',int),
('GROUP', int),
('Vol', float),
('Mass', float),
('Px', float),
('Py', float),
('Pz', float),
('Vx', float),
('Vy', float),
('Vz', float),
] )
particleData = np.array(particleData, dtype=mydt)
return Timestep, numParticles, particleData
问题是,它只从文件中读取第一个数据块并在那里停止-我不知道如何使它在文件中循环,直到它到达末尾并停止
任何关于如何使这项工作的建议都将是非常好的。我想我可以编写一种方法,使用单行处理和大量的if检查来查看我是否在时间步的末尾,但简单的函数似乎更简单、更清晰。with不会循环,它只会确保文件在之后正确关闭 要循环,您需要在with语句之后添加一段时间(参见下面的代码)。但在此之前,您需要在readBlock(f)函数中签入文件结尾(EOF)。用以下代码替换
line=f.readline().strip()
:
line = f.readline()
if not line:
# EOF: returning None's.
return None, None, None
# We do the strip after the check.
# Otherwise a blank line "\n" might be interpreted as EOF.
line = line.strip()
因此,在with块中添加while循环,并检查是否返回None
指示EOF,这样我们就可以中断while循环:
with open('file1') as file_handle:
while True:
startWallTime = time.clock()
Timestep, numParticles, particleData = readBlock(file_handle)
if Timestep == None:
break
print(Timestep)
## Do processing stuff here
print("Timestep Processed")
endWallTime = time.clock()
这里是一个快速的n脏测试(第二次测试成功!)
试运行:
1458:~/mypy$ python3 stack41091659.py
b'0.00500103' 4
5
(4,) [('ID', '<i4'), ('GROUP', '<i4'), ('VOLUME', '<f8'), ('MASS', '<f8'), ('PX', '<f8'), ('PY', '<f8'), ('PZ', '<f8'), ('VX', '<i4'), ('VY', '<i4'), ('VZ', '<f8')]
b'0.00500103' 3
4
(3,) [('ID', '<i4'), ('GROUP', '<i4'), ('VOLUME', '<f8'), ('MASS', '<f8'), ('PX', '<f8'), ('PY', '<f8'), ('PZ', '<f8'), ('VX', '<i4'), ('VY', '<i4'), ('VZ', '<f8')]
b'0.00500103' 2
3
(2,) [('ID', '<i4'), ('GROUP', '<i4'), ('VOLUME', '<f8'), ('MASS', '<f8'), ('PX', '<f8'), ('PY', '<f8'), ('PZ', '<f8'), ('VX', '<i4'), ('VY', '<i4'), ('VZ', '<f8')]
b'0.00500103' 4
5
(4,) [('ID', '<i4'), ('GROUP', '<i4'), ('VOLUME', '<f8'), ('MASS', '<f8'), ('PX', '<f8'), ('PY', '<f8'), ('PZ', '<f8'), ('VX', '<i4'), ('VY', '<i4'), ('VZ', '<f8')]
我使用的事实是genfromtxt
对任何给它提供行块的东西都很满意。在这里,我收集列表中的下一个块,并将其传递给genfromtxt
使用genfromtxt
的max\u rows
参数,我可以告诉它直接读取下一个n
行:
with open('stack41091659.txt','rb') as f:
while f.readline():
time, n = f.readline().strip().split()
n = int(n)
print(time, n)
data = np.genfromtxt(f, dtype=None, names=True, max_rows=n)
print(data.shape, len(data.dtype.names))
我没有考虑那个可选的空行。可能会在块读取开始时将其压缩。也就是读行,直到我得到一个带有有效
float int
字符串对的字符串。您可以使用以下参数的max\u rows
参数:
下面是一个示例文件:
TIMESTEP PARTICLES
0.00500103 4
ID GROUP VOLUME MASS PX PY PZ VX VY VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
TIMESTEP PARTICLES
0.00500103 3
ID GROUP VOLUME MASS PX PY PZ VX VY VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
TIMESTEP PARTICLES
0.00500103 2
ID GROUP VOLUME MASS PX PY PZ VX VY VZ
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
TIMESTEP PARTICLES
0.00500103 4
ID GROUP VOLUME MASS PX PY PZ VX VY VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
TIMESTEP PARTICLES
0.00500103 4
ID GROUP VOLUME MASS PX PY PZ VX VY VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
TIMESTEP PARTICLES
0.00500103 5
ID GROUP VOLUME MASS PX PY PZ VX VY VZ
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
652 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
431 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
385 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
972 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
TIMESTEP PARTICLES
0.00500103 3
ID GROUP VOLUME MASS PX PY PZ VX VY VZ
222 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
333 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
444 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
以下是输出:
Timestep: 0.00500103
Particles: 4
Data:
[ (651, 0, 5.23599e-07, 0.000397935, -0.084626, -0.0347849, 0.00188164, 0, 0, -1.04903)
(430, 0, 5.23599e-07, 0.000397935, -0.0837742, -0.0442293, 0.0121046, 0, 0, -1.04903)
(384, 0, 5.23599e-07, 0.000397935, -0.0749234, -0.0395652, 0.0143401, 0, 0, -1.04903)
(971, 0, 5.23599e-07, 0.000397935, -0.0954931, -0.0159607, 0.0100155, 0, 0, -1.04903)]
Timestep: 0.00500103
Particles: 5
Data:
[ (971, 0, 5.23599e-07, 0.000397935, -0.0954931, -0.0159607, 0.0100155, 0, 0, -1.04903)
(652, 0, 5.23599e-07, 0.000397935, -0.084626, -0.0347849, 0.00188164, 0, 0, -1.04903)
(431, 0, 5.23599e-07, 0.000397935, -0.0837742, -0.0442293, 0.0121046, 0, 0, -1.04903)
(385, 0, 5.23599e-07, 0.000397935, -0.0749234, -0.0395652, 0.0143401, 0, 0, -1.04903)
(972, 0, 5.23599e-07, 0.000397935, -0.0954931, -0.0159607, 0.0100155, 0, 0, -1.04903)]
Timestep: 0.00500103
Particles: 3
Data:
[ (222, 0, 5.23599e-07, 0.000397935, -0.0837742, -0.0442293, 0.0121046, 0, 0, -1.04903)
(333, 0, 5.23599e-07, 0.000397935, -0.0749234, -0.0395652, 0.0143401, 0, 0, -1.04903)
(444, 0, 5.23599e-07, 0.000397935, -0.0954931, -0.0159607, 0.0100155, 0, 0, -1.04903)]
我还是不明白你的问题是什么。在这段代码中,您只读取一个块。当你试图阅读下一篇时会发生什么?另外,我认为pandas.read\u csv(f,num\u rows=X,sep='')将使此函数更好这就是问题所在-它只读取一个块-文件中有数百个时间步,我希望它一直返回一个块,直到到达文件末尾。如果再次调用同一文件上的readBlock,会发生什么情况?看起来您使用的是Python 3。对吗?@marat只要我考虑到可能的空行,它就会读取下一个时间步-只要我没有关闭文件。谢谢。这项工作现在几乎完美无瑕。。。。。它不处理粒子数和行数为零的时间步。genfromtxt不喜欢max particles=0,所以我把它放在一个“如果大于零”的检查中,但这似乎没有帮助-它可能已经打破了空行检查。这似乎工作得相当好,除了两件事-
返回None,None,None
使输出在每个块之间都有空行时非常混乱,但我的主要问题是,一旦'while True'循环结束,函数返回的值就会消失
TIMESTEP PARTICLES
0.00500103 4
ID GROUP VOLUME MASS PX PY PZ VX VY VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
TIMESTEP PARTICLES
0.00500103 5
ID GROUP VOLUME MASS PX PY PZ VX VY VZ
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
652 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
431 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
385 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
972 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
TIMESTEP PARTICLES
0.00500103 3
ID GROUP VOLUME MASS PX PY PZ VX VY VZ
222 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
333 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
444 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
Timestep: 0.00500103
Particles: 4
Data:
[ (651, 0, 5.23599e-07, 0.000397935, -0.084626, -0.0347849, 0.00188164, 0, 0, -1.04903)
(430, 0, 5.23599e-07, 0.000397935, -0.0837742, -0.0442293, 0.0121046, 0, 0, -1.04903)
(384, 0, 5.23599e-07, 0.000397935, -0.0749234, -0.0395652, 0.0143401, 0, 0, -1.04903)
(971, 0, 5.23599e-07, 0.000397935, -0.0954931, -0.0159607, 0.0100155, 0, 0, -1.04903)]
Timestep: 0.00500103
Particles: 5
Data:
[ (971, 0, 5.23599e-07, 0.000397935, -0.0954931, -0.0159607, 0.0100155, 0, 0, -1.04903)
(652, 0, 5.23599e-07, 0.000397935, -0.084626, -0.0347849, 0.00188164, 0, 0, -1.04903)
(431, 0, 5.23599e-07, 0.000397935, -0.0837742, -0.0442293, 0.0121046, 0, 0, -1.04903)
(385, 0, 5.23599e-07, 0.000397935, -0.0749234, -0.0395652, 0.0143401, 0, 0, -1.04903)
(972, 0, 5.23599e-07, 0.000397935, -0.0954931, -0.0159607, 0.0100155, 0, 0, -1.04903)]
Timestep: 0.00500103
Particles: 3
Data:
[ (222, 0, 5.23599e-07, 0.000397935, -0.0837742, -0.0442293, 0.0121046, 0, 0, -1.04903)
(333, 0, 5.23599e-07, 0.000397935, -0.0749234, -0.0395652, 0.0143401, 0, 0, -1.04903)
(444, 0, 5.23599e-07, 0.000397935, -0.0954931, -0.0159607, 0.0100155, 0, 0, -1.04903)]