Python 无法在cStringIO上迭代

Python 无法在cStringIO上迭代,python,stringio,cstringio,Python,Stringio,Cstringio,在脚本中,我正在向文件中写入行,但其中一些行可能是重复的。所以我创建了一个临时的cStringIO类文件对象,我称之为“中间文件”。我首先将行写入中间文件,删除重复项,然后写入真实文件 因此,我编写了一个简单的for循环来迭代我的中间文件中的每一行,并删除任何重复项 def remove_duplicates(f_temp, dir_out): # f_temp is the cStringIO object. """Function to remove duplicates from

在脚本中,我正在向文件中写入行,但其中一些行可能是重复的。所以我创建了一个临时的
cStringIO
类文件对象,我称之为“中间文件”。我首先将行写入中间文件,删除重复项,然后写入真实文件

因此,我编写了一个简单的for循环来迭代我的中间文件中的每一行,并删除任何重复项

def remove_duplicates(f_temp, dir_out):  # f_temp is the cStringIO object.
    """Function to remove duplicates from the intermediate file and write to physical file."""
    lines_seen = set()  # Define a set to hold lines already seen.
    f_out = define_outputs(dir_out)  # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.

    cStringIO.OutputType.getvalue(f_temp)  # From: https://stackoverflow.com/a/40553378/8117081

    for line in f_temp:  # Iterate through the cStringIO file-like object.
        line = compute_md5(line)  # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
        if line not in lines_seen:  # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
            f_out.write(line)
            lines_seen.add(line)
    f_out.close()
我的问题是,
for
循环永远不会执行。我可以通过在调试器中设置断点来验证这一点;该行代码被跳过,函数退出。我甚至阅读并插入了代码cStringIO.OutputType.getvalue(f_temp),但这并没有解决我的问题


我不明白为什么我不能读取和遍历类似文件的对象。

您引用的答案有点不完整。它告诉您如何将cStringIO缓冲区作为字符串获取,但是您必须对该字符串执行一些操作。您可以这样做:

def remove_duplicates(f_temp, dir_out):  # f_temp is the cStringIO object.
    """Function to remove duplicates from the intermediate file and write to physical file."""
    lines_seen = set()  # Define a set to hold lines already seen.
    f_out = define_outputs(dir_out)  # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.

    # contents = cStringIO.OutputType.getvalue(f_temp)  # From: https://stackoverflow.com/a/40553378/8117081
    contents = f_temp.getvalue()     # simpler approach
    contents = contents.strip('\n')  # remove final newline to avoid adding an extra row
    lines = contents.split('\n')     # convert to iterable

    for line in lines:  # Iterate through the list of lines.
        line = compute_md5(line)  # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
        if line not in lines_seen:  # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
            f_out.write(line + '\n')
            lines_seen.add(line)
    f_out.close()
def remove_duplicates(f_temp, dir_out):  # f_temp is the cStringIO object.
    """Function to remove duplicates from the intermediate file and write to physical file."""
    lines_seen = set()  # Define a set to hold lines already seen.
    f_out = define_outputs(dir_out)  # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.

    # move f_temp's pointer back to the start of the file, to allow reading
    f_temp.seek(0)

    for line in f_temp:  # Iterate through the cStringIO file-like object.
        line = compute_md5(line)  # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
        if line not in lines_seen:  # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
            f_out.write(line)
            lines_seen.add(line)
    f_out.close()
但最好在f_temp“文件句柄”上使用常规IO操作,如下所示:

def remove_duplicates(f_temp, dir_out):  # f_temp is the cStringIO object.
    """Function to remove duplicates from the intermediate file and write to physical file."""
    lines_seen = set()  # Define a set to hold lines already seen.
    f_out = define_outputs(dir_out)  # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.

    # contents = cStringIO.OutputType.getvalue(f_temp)  # From: https://stackoverflow.com/a/40553378/8117081
    contents = f_temp.getvalue()     # simpler approach
    contents = contents.strip('\n')  # remove final newline to avoid adding an extra row
    lines = contents.split('\n')     # convert to iterable

    for line in lines:  # Iterate through the list of lines.
        line = compute_md5(line)  # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
        if line not in lines_seen:  # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
            f_out.write(line + '\n')
            lines_seen.add(line)
    f_out.close()
def remove_duplicates(f_temp, dir_out):  # f_temp is the cStringIO object.
    """Function to remove duplicates from the intermediate file and write to physical file."""
    lines_seen = set()  # Define a set to hold lines already seen.
    f_out = define_outputs(dir_out)  # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.

    # move f_temp's pointer back to the start of the file, to allow reading
    f_temp.seek(0)

    for line in f_temp:  # Iterate through the cStringIO file-like object.
        line = compute_md5(line)  # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
        if line not in lines_seen:  # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
            f_out.write(line)
            lines_seen.add(line)
    f_out.close()
下面是一个测试(使用其中一个):


你提到的答案有点不完整。它告诉您如何将cStringIO缓冲区作为字符串获取,但是您必须对该字符串执行一些操作。您可以这样做:

def remove_duplicates(f_temp, dir_out):  # f_temp is the cStringIO object.
    """Function to remove duplicates from the intermediate file and write to physical file."""
    lines_seen = set()  # Define a set to hold lines already seen.
    f_out = define_outputs(dir_out)  # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.

    # contents = cStringIO.OutputType.getvalue(f_temp)  # From: https://stackoverflow.com/a/40553378/8117081
    contents = f_temp.getvalue()     # simpler approach
    contents = contents.strip('\n')  # remove final newline to avoid adding an extra row
    lines = contents.split('\n')     # convert to iterable

    for line in lines:  # Iterate through the list of lines.
        line = compute_md5(line)  # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
        if line not in lines_seen:  # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
            f_out.write(line + '\n')
            lines_seen.add(line)
    f_out.close()
def remove_duplicates(f_temp, dir_out):  # f_temp is the cStringIO object.
    """Function to remove duplicates from the intermediate file and write to physical file."""
    lines_seen = set()  # Define a set to hold lines already seen.
    f_out = define_outputs(dir_out)  # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.

    # move f_temp's pointer back to the start of the file, to allow reading
    f_temp.seek(0)

    for line in f_temp:  # Iterate through the cStringIO file-like object.
        line = compute_md5(line)  # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
        if line not in lines_seen:  # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
            f_out.write(line)
            lines_seen.add(line)
    f_out.close()
但最好在f_temp“文件句柄”上使用常规IO操作,如下所示:

def remove_duplicates(f_temp, dir_out):  # f_temp is the cStringIO object.
    """Function to remove duplicates from the intermediate file and write to physical file."""
    lines_seen = set()  # Define a set to hold lines already seen.
    f_out = define_outputs(dir_out)  # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.

    # contents = cStringIO.OutputType.getvalue(f_temp)  # From: https://stackoverflow.com/a/40553378/8117081
    contents = f_temp.getvalue()     # simpler approach
    contents = contents.strip('\n')  # remove final newline to avoid adding an extra row
    lines = contents.split('\n')     # convert to iterable

    for line in lines:  # Iterate through the list of lines.
        line = compute_md5(line)  # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
        if line not in lines_seen:  # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
            f_out.write(line + '\n')
            lines_seen.add(line)
    f_out.close()
def remove_duplicates(f_temp, dir_out):  # f_temp is the cStringIO object.
    """Function to remove duplicates from the intermediate file and write to physical file."""
    lines_seen = set()  # Define a set to hold lines already seen.
    f_out = define_outputs(dir_out)  # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.

    # move f_temp's pointer back to the start of the file, to allow reading
    f_temp.seek(0)

    for line in f_temp:  # Iterate through the cStringIO file-like object.
        line = compute_md5(line)  # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
        if line not in lines_seen:  # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
            f_out.write(line)
            lines_seen.add(line)
    f_out.close()
下面是一个测试(使用其中一个):


f_temp
是文件对象吗?
cStringIO.OutputType.getvalue(f_temp)
…@juanpa.arrivillaga的用途是什么?是的,它是一个类似文件的对象。显然,
cStringIO.OutputType.getvalue(f_temp)
的目的是将
cStringIO
类文件对象转换为
Output
类型,以便读取。请参见注释。
f_temp
是文件对象吗?
cStringIO.OutputType.getvalue(f_temp)
…@juanpa.arrivillaga的用途是什么?是的,它是一个类似文件的对象。显然,
cStringIO.OutputType.getvalue(f_temp)
的目的是将
cStringIO
类文件对象转换为
Output
类型,以便读取。请参阅注释。
f_temp.seek(0)
works!非常感谢。我还有一个简短的问题。既然
f_temp
(或任何
cStringIO
对象)是一个“类似文件”的对象,那么在我读取完它的所有行之后是否有必要写入
f_temp.close()
?我当然会在您使用完它之后关闭它。对于文件或StringIO,当最后一个引用超出范围时,垃圾回收器会自动释放相关资源,但依赖它并不被认为是好的形式。最好在处理完对象后显式关闭它。如果您要快速创建和关闭很多,这一点尤其重要。通常,在
open
步骤中使用
with
子句最容易实现这一点。
f_temp.seek(0)
有效!非常感谢。我还有一个简短的问题。既然
f_temp
(或任何
cStringIO
对象)是一个“类似文件”的对象,那么在我读取完它的所有行之后是否有必要写入
f_temp.close()
?我当然会在您使用完它之后关闭它。对于文件或StringIO,当最后一个引用超出范围时,垃圾回收器会自动释放相关资源,但依赖它并不被认为是好的形式。最好在处理完对象后显式关闭它。如果您要快速创建和关闭很多,这一点尤其重要。通常,在
open
步骤中使用
with
子句最容易实现这一点。