Python 无法在cStringIO上迭代
在脚本中,我正在向文件中写入行,但其中一些行可能是重复的。所以我创建了一个临时的Python 无法在cStringIO上迭代,python,stringio,cstringio,Python,Stringio,Cstringio,在脚本中,我正在向文件中写入行,但其中一些行可能是重复的。所以我创建了一个临时的cStringIO类文件对象,我称之为“中间文件”。我首先将行写入中间文件,删除重复项,然后写入真实文件 因此,我编写了一个简单的for循环来迭代我的中间文件中的每一行,并删除任何重复项 def remove_duplicates(f_temp, dir_out): # f_temp is the cStringIO object. """Function to remove duplicates from
cStringIO
类文件对象,我称之为“中间文件”。我首先将行写入中间文件,删除重复项,然后写入真实文件
因此,我编写了一个简单的for循环来迭代我的中间文件中的每一行,并删除任何重复项
def remove_duplicates(f_temp, dir_out): # f_temp is the cStringIO object.
"""Function to remove duplicates from the intermediate file and write to physical file."""
lines_seen = set() # Define a set to hold lines already seen.
f_out = define_outputs(dir_out) # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.
cStringIO.OutputType.getvalue(f_temp) # From: https://stackoverflow.com/a/40553378/8117081
for line in f_temp: # Iterate through the cStringIO file-like object.
line = compute_md5(line) # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
if line not in lines_seen: # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
f_out.write(line)
lines_seen.add(line)
f_out.close()
我的问题是,for
循环永远不会执行。我可以通过在调试器中设置断点来验证这一点;该行代码被跳过,函数退出。我甚至阅读并插入了代码cStringIO.OutputType.getvalue(f_temp),但这并没有解决我的问题
我不明白为什么我不能读取和遍历类似文件的对象。您引用的答案有点不完整。它告诉您如何将cStringIO缓冲区作为字符串获取,但是您必须对该字符串执行一些操作。您可以这样做:
def remove_duplicates(f_temp, dir_out): # f_temp is the cStringIO object.
"""Function to remove duplicates from the intermediate file and write to physical file."""
lines_seen = set() # Define a set to hold lines already seen.
f_out = define_outputs(dir_out) # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.
# contents = cStringIO.OutputType.getvalue(f_temp) # From: https://stackoverflow.com/a/40553378/8117081
contents = f_temp.getvalue() # simpler approach
contents = contents.strip('\n') # remove final newline to avoid adding an extra row
lines = contents.split('\n') # convert to iterable
for line in lines: # Iterate through the list of lines.
line = compute_md5(line) # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
if line not in lines_seen: # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
f_out.write(line + '\n')
lines_seen.add(line)
f_out.close()
def remove_duplicates(f_temp, dir_out): # f_temp is the cStringIO object.
"""Function to remove duplicates from the intermediate file and write to physical file."""
lines_seen = set() # Define a set to hold lines already seen.
f_out = define_outputs(dir_out) # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.
# move f_temp's pointer back to the start of the file, to allow reading
f_temp.seek(0)
for line in f_temp: # Iterate through the cStringIO file-like object.
line = compute_md5(line) # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
if line not in lines_seen: # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
f_out.write(line)
lines_seen.add(line)
f_out.close()
但最好在f_temp“文件句柄”上使用常规IO操作,如下所示:
def remove_duplicates(f_temp, dir_out): # f_temp is the cStringIO object.
"""Function to remove duplicates from the intermediate file and write to physical file."""
lines_seen = set() # Define a set to hold lines already seen.
f_out = define_outputs(dir_out) # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.
# contents = cStringIO.OutputType.getvalue(f_temp) # From: https://stackoverflow.com/a/40553378/8117081
contents = f_temp.getvalue() # simpler approach
contents = contents.strip('\n') # remove final newline to avoid adding an extra row
lines = contents.split('\n') # convert to iterable
for line in lines: # Iterate through the list of lines.
line = compute_md5(line) # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
if line not in lines_seen: # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
f_out.write(line + '\n')
lines_seen.add(line)
f_out.close()
def remove_duplicates(f_temp, dir_out): # f_temp is the cStringIO object.
"""Function to remove duplicates from the intermediate file and write to physical file."""
lines_seen = set() # Define a set to hold lines already seen.
f_out = define_outputs(dir_out) # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.
# move f_temp's pointer back to the start of the file, to allow reading
f_temp.seek(0)
for line in f_temp: # Iterate through the cStringIO file-like object.
line = compute_md5(line) # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
if line not in lines_seen: # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
f_out.write(line)
lines_seen.add(line)
f_out.close()
下面是一个测试(使用其中一个):
你提到的答案有点不完整。它告诉您如何将cStringIO缓冲区作为字符串获取,但是您必须对该字符串执行一些操作。您可以这样做:
def remove_duplicates(f_temp, dir_out): # f_temp is the cStringIO object.
"""Function to remove duplicates from the intermediate file and write to physical file."""
lines_seen = set() # Define a set to hold lines already seen.
f_out = define_outputs(dir_out) # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.
# contents = cStringIO.OutputType.getvalue(f_temp) # From: https://stackoverflow.com/a/40553378/8117081
contents = f_temp.getvalue() # simpler approach
contents = contents.strip('\n') # remove final newline to avoid adding an extra row
lines = contents.split('\n') # convert to iterable
for line in lines: # Iterate through the list of lines.
line = compute_md5(line) # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
if line not in lines_seen: # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
f_out.write(line + '\n')
lines_seen.add(line)
f_out.close()
def remove_duplicates(f_temp, dir_out): # f_temp is the cStringIO object.
"""Function to remove duplicates from the intermediate file and write to physical file."""
lines_seen = set() # Define a set to hold lines already seen.
f_out = define_outputs(dir_out) # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.
# move f_temp's pointer back to the start of the file, to allow reading
f_temp.seek(0)
for line in f_temp: # Iterate through the cStringIO file-like object.
line = compute_md5(line) # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
if line not in lines_seen: # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
f_out.write(line)
lines_seen.add(line)
f_out.close()
但最好在f_temp“文件句柄”上使用常规IO操作,如下所示:
def remove_duplicates(f_temp, dir_out): # f_temp is the cStringIO object.
"""Function to remove duplicates from the intermediate file and write to physical file."""
lines_seen = set() # Define a set to hold lines already seen.
f_out = define_outputs(dir_out) # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.
# contents = cStringIO.OutputType.getvalue(f_temp) # From: https://stackoverflow.com/a/40553378/8117081
contents = f_temp.getvalue() # simpler approach
contents = contents.strip('\n') # remove final newline to avoid adding an extra row
lines = contents.split('\n') # convert to iterable
for line in lines: # Iterate through the list of lines.
line = compute_md5(line) # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
if line not in lines_seen: # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
f_out.write(line + '\n')
lines_seen.add(line)
f_out.close()
def remove_duplicates(f_temp, dir_out): # f_temp is the cStringIO object.
"""Function to remove duplicates from the intermediate file and write to physical file."""
lines_seen = set() # Define a set to hold lines already seen.
f_out = define_outputs(dir_out) # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.
# move f_temp's pointer back to the start of the file, to allow reading
f_temp.seek(0)
for line in f_temp: # Iterate through the cStringIO file-like object.
line = compute_md5(line) # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
if line not in lines_seen: # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
f_out.write(line)
lines_seen.add(line)
f_out.close()
下面是一个测试(使用其中一个):
f_temp
是文件对象吗?cStringIO.OutputType.getvalue(f_temp)
…@juanpa.arrivillaga的用途是什么?是的,它是一个类似文件的对象。显然,cStringIO.OutputType.getvalue(f_temp)
的目的是将cStringIO
类文件对象转换为Output
类型,以便读取。请参见注释。f_temp
是文件对象吗?cStringIO.OutputType.getvalue(f_temp)
…@juanpa.arrivillaga的用途是什么?是的,它是一个类似文件的对象。显然,cStringIO.OutputType.getvalue(f_temp)
的目的是将cStringIO
类文件对象转换为Output
类型,以便读取。请参阅注释。f_temp.seek(0)
works!非常感谢。我还有一个简短的问题。既然f_temp
(或任何cStringIO
对象)是一个“类似文件”的对象,那么在我读取完它的所有行之后是否有必要写入f_temp.close()
?我当然会在您使用完它之后关闭它。对于文件或StringIO,当最后一个引用超出范围时,垃圾回收器会自动释放相关资源,但依赖它并不被认为是好的形式。最好在处理完对象后显式关闭它。如果您要快速创建和关闭很多,这一点尤其重要。通常,在open
步骤中使用with
子句最容易实现这一点。f_temp.seek(0)
有效!非常感谢。我还有一个简短的问题。既然f_temp
(或任何cStringIO
对象)是一个“类似文件”的对象,那么在我读取完它的所有行之后是否有必要写入f_temp.close()
?我当然会在您使用完它之后关闭它。对于文件或StringIO,当最后一个引用超出范围时,垃圾回收器会自动释放相关资源,但依赖它并不被认为是好的形式。最好在处理完对象后显式关闭它。如果您要快速创建和关闭很多,这一点尤其重要。通常,在open
步骤中使用with
子句最容易实现这一点。