Python 多处理情况下读取方法的意外行为
我正在以二进制模式读取具有多个进程的同一文件。首先在父进程中打开文件,然后创建子进程。读取文件给定部分的实际代码为:Python 多处理情况下读取方法的意外行为,python,file,multiprocessing,critical-section,Python,File,Multiprocessing,Critical Section,我正在以二进制模式读取具有多个进程的同一文件。首先在父进程中打开文件,然后创建子进程。读取文件给定部分的实际代码为: def _line(self, n: int) -> str: offset = self._lineOffsets[n] lineSize = self._lineSizes[n] self._criticalSectionLock.acquire() line = b"" try:
def _line(self, n: int) -> str:
offset = self._lineOffsets[n]
lineSize = self._lineSizes[n]
self._criticalSectionLock.acquire()
line = b""
try:
self._datasetFile.seek(offset)
while len(line) < lineSize:
# because read may return less than required number of bytes we must read in while loop
block = self._datasetFile.read(lineSize-len(line))
line += block
if len(block) == 0:
raise IOError(f"Failed to read whole sample on line {n} (indexed from 0).")
finally:
self._criticalSectionLock.release()
return line.decode("utf-8")
def_行(self,n:int)->str:
偏移量=自身。\u线偏移量[n]
lineSize=self.\u lineSize[n]
self.\u criticalSectionLock.acquire()
行=b“”
尝试:
self._datasetFile.seek(偏移量)
而len(line)
问题是read方法并不总是返回整个块。不是任意的,它总是发生在同一块上。当我以以下方式编辑代码时:
def _line(self, n: int) -> str:
offset = self._lineOffsets[n]
lineSize = self._lineSizes[n]
self._criticalSectionLock.acquire()
line = b""
try:
while len(line) < lineSize:
# because read may return less than required number of bytes we must read in while loop
self._datasetFile.seek(offset+len(line))
block = self._datasetFile.read(lineSize-len(line))
line += block
if len(block) == 0:
raise IOError(f"Failed to read whole sample on line {n} (indexed from 0).")
finally:
self._criticalSectionLock.release()
return line.decode("utf-8")
def_行(self,n:int)->str:
偏移量=自身。\u线偏移量[n]
lineSize=self.\u lineSize[n]
self.\u criticalSectionLock.acquire()
行=b“”
尝试:
而len(line)
问题消失了。发生的事情是,文件偏移量增加了我真正想要读取的大小,甚至我实际上得到了一个更小的块
所以我想问,这是怎么发生的
为了提前省去一些问题,我在这里提出了一些观点,说明为什么文件不是以标准的逐行方式读取的:
- 大文件
- 将整个文件存储在内存中不是一种理想的方法
- 需要对线路进行非顺序访问
- 二进制模式下的读取速度更快
- 线的已知尺寸及其偏移量