Python 无法在cStringIO上迭代_Python_Stringio_Cstringio

Python 无法在cStringIO上迭代

python

Python 无法在cStringIO上迭代,python,stringio,cstringio,Python,Stringio,Cstringio,在脚本中，我正在向文件中写入行，但其中一些行可能是重复的。所以我创建了一个临时的cStringIO类文件对象，我称之为“中间文件”。我首先将行写入中间文件，删除重复项，然后写入真实文件因此，我编写了一个简单的for循环来迭代我的中间文件中的每一行，并删除任何重复项 def remove_duplicates(f_temp, dir_out): # f_temp is the cStringIO object. """Function to remove duplicates from

在脚本中，我正在向文件中写入行，但其中一些行可能是重复的。所以我创建了一个临时的

cStringIO

类文件对象，我称之为“中间文件”。我首先将行写入中间文件，删除重复项，然后写入真实文件

因此，我编写了一个简单的for循环来迭代我的中间文件中的每一行，并删除任何重复项

def remove_duplicates(f_temp, dir_out):  # f_temp is the cStringIO object.
    """Function to remove duplicates from the intermediate file and write to physical file."""
    lines_seen = set()  # Define a set to hold lines already seen.
    f_out = define_outputs(dir_out)  # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.

    cStringIO.OutputType.getvalue(f_temp)  # From: https://stackoverflow.com/a/40553378/8117081

    for line in f_temp:  # Iterate through the cStringIO file-like object.
        line = compute_md5(line)  # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
        if line not in lines_seen:  # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
            f_out.write(line)
            lines_seen.add(line)
    f_out.close()

我的问题是，

for

循环永远不会执行。我可以通过在调试器中设置断点来验证这一点；该行代码被跳过，函数退出。我甚至阅读并插入了代码cStringIO.OutputType.getvalue（f_temp），但这并没有解决我的问题

我不明白为什么我不能读取和遍历类似文件的对象。

您引用的答案有点不完整。它告诉您如何将cStringIO缓冲区作为字符串获取，但是您必须对该字符串执行一些操作。您可以这样做：

def remove_duplicates(f_temp, dir_out):  # f_temp is the cStringIO object.
    """Function to remove duplicates from the intermediate file and write to physical file."""
    lines_seen = set()  # Define a set to hold lines already seen.
    f_out = define_outputs(dir_out)  # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.

    # contents = cStringIO.OutputType.getvalue(f_temp)  # From: https://stackoverflow.com/a/40553378/8117081
    contents = f_temp.getvalue()     # simpler approach
    contents = contents.strip('\n')  # remove final newline to avoid adding an extra row
    lines = contents.split('\n')     # convert to iterable

    for line in lines:  # Iterate through the list of lines.
        line = compute_md5(line)  # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
        if line not in lines_seen:  # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
            f_out.write(line + '\n')
            lines_seen.add(line)
    f_out.close()

def remove_duplicates(f_temp, dir_out):  # f_temp is the cStringIO object.
    """Function to remove duplicates from the intermediate file and write to physical file."""
    lines_seen = set()  # Define a set to hold lines already seen.
    f_out = define_outputs(dir_out)  # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.

    # move f_temp's pointer back to the start of the file, to allow reading
    f_temp.seek(0)

    for line in f_temp:  # Iterate through the cStringIO file-like object.
        line = compute_md5(line)  # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
        if line not in lines_seen:  # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
            f_out.write(line)
            lines_seen.add(line)
    f_out.close()

但最好在f_temp“文件句柄”上使用常规IO操作，如下所示：

def remove_duplicates(f_temp, dir_out):  # f_temp is the cStringIO object.
    """Function to remove duplicates from the intermediate file and write to physical file."""
    lines_seen = set()  # Define a set to hold lines already seen.
    f_out = define_outputs(dir_out)  # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.

    # contents = cStringIO.OutputType.getvalue(f_temp)  # From: https://stackoverflow.com/a/40553378/8117081
    contents = f_temp.getvalue()     # simpler approach
    contents = contents.strip('\n')  # remove final newline to avoid adding an extra row
    lines = contents.split('\n')     # convert to iterable

    for line in lines:  # Iterate through the list of lines.
        line = compute_md5(line)  # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
        if line not in lines_seen:  # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
            f_out.write(line + '\n')
            lines_seen.add(line)
    f_out.close()

def remove_duplicates(f_temp, dir_out):  # f_temp is the cStringIO object.
    """Function to remove duplicates from the intermediate file and write to physical file."""
    lines_seen = set()  # Define a set to hold lines already seen.
    f_out = define_outputs(dir_out)  # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.

    # move f_temp's pointer back to the start of the file, to allow reading
    f_temp.seek(0)

    for line in f_temp:  # Iterate through the cStringIO file-like object.
        line = compute_md5(line)  # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
        if line not in lines_seen:  # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
            f_out.write(line)
            lines_seen.add(line)
    f_out.close()

下面是一个测试（使用其中一个）：

你提到的答案有点不完整。它告诉您如何将cStringIO缓冲区作为字符串获取，但是您必须对该字符串执行一些操作。您可以这样做：

def remove_duplicates(f_temp, dir_out):  # f_temp is the cStringIO object.
    """Function to remove duplicates from the intermediate file and write to physical file."""
    lines_seen = set()  # Define a set to hold lines already seen.
    f_out = define_outputs(dir_out)  # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.

    # contents = cStringIO.OutputType.getvalue(f_temp)  # From: https://stackoverflow.com/a/40553378/8117081
    contents = f_temp.getvalue()     # simpler approach
    contents = contents.strip('\n')  # remove final newline to avoid adding an extra row
    lines = contents.split('\n')     # convert to iterable

    for line in lines:  # Iterate through the list of lines.
        line = compute_md5(line)  # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
        if line not in lines_seen:  # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
            f_out.write(line + '\n')
            lines_seen.add(line)
    f_out.close()

def remove_duplicates(f_temp, dir_out):  # f_temp is the cStringIO object.
    """Function to remove duplicates from the intermediate file and write to physical file."""
    lines_seen = set()  # Define a set to hold lines already seen.
    f_out = define_outputs(dir_out)  # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.

    # move f_temp's pointer back to the start of the file, to allow reading
    f_temp.seek(0)

    for line in f_temp:  # Iterate through the cStringIO file-like object.
        line = compute_md5(line)  # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
        if line not in lines_seen:  # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
            f_out.write(line)
            lines_seen.add(line)
    f_out.close()

但最好在f_temp“文件句柄”上使用常规IO操作，如下所示：

def remove_duplicates(f_temp, dir_out):  # f_temp is the cStringIO object.
    """Function to remove duplicates from the intermediate file and write to physical file."""
    lines_seen = set()  # Define a set to hold lines already seen.
    f_out = define_outputs(dir_out)  # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.

    # contents = cStringIO.OutputType.getvalue(f_temp)  # From: https://stackoverflow.com/a/40553378/8117081
    contents = f_temp.getvalue()     # simpler approach
    contents = contents.strip('\n')  # remove final newline to avoid adding an extra row
    lines = contents.split('\n')     # convert to iterable

    for line in lines:  # Iterate through the list of lines.
        line = compute_md5(line)  # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
        if line not in lines_seen:  # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
            f_out.write(line + '\n')
            lines_seen.add(line)
    f_out.close()

def remove_duplicates(f_temp, dir_out):  # f_temp is the cStringIO object.
    """Function to remove duplicates from the intermediate file and write to physical file."""
    lines_seen = set()  # Define a set to hold lines already seen.
    f_out = define_outputs(dir_out)  # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.

    # move f_temp's pointer back to the start of the file, to allow reading
    f_temp.seek(0)

    for line in f_temp:  # Iterate through the cStringIO file-like object.
        line = compute_md5(line)  # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
        if line not in lines_seen:  # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
            f_out.write(line)
            lines_seen.add(line)
    f_out.close()

下面是一个测试（使用其中一个）：

f_temp

是文件对象吗？

cStringIO.OutputType.getvalue（f_temp）

…@juanpa.arrivillaga的用途是什么？是的，它是一个类似文件的对象。显然，

cStringIO.OutputType.getvalue（f_temp）

的目的是将

cStringIO

类文件对象转换为

Output

类型，以便读取。请参见注释。

f_temp

是文件对象吗？

cStringIO.OutputType.getvalue（f_temp）

…@juanpa.arrivillaga的用途是什么？是的，它是一个类似文件的对象。显然，

cStringIO.OutputType.getvalue（f_temp）

的目的是将

cStringIO

类文件对象转换为

Output

类型，以便读取。请参阅注释。

f_temp.seek（0）

works！非常感谢。我还有一个简短的问题。既然

f_temp

（或任何

cStringIO

对象）是一个“类似文件”的对象，那么在我读取完它的所有行之后是否有必要写入

f_temp.close（）

？我当然会在您使用完它之后关闭它。对于文件或StringIO，当最后一个引用超出范围时，垃圾回收器会自动释放相关资源，但依赖它并不被认为是好的形式。最好在处理完对象后显式关闭它。如果您要快速创建和关闭很多，这一点尤其重要。通常，在

open

步骤中使用

with

子句最容易实现这一点。

f_temp.seek（0）

有效！非常感谢。我还有一个简短的问题。既然

f_temp

（或任何

cStringIO

对象）是一个“类似文件”的对象，那么在我读取完它的所有行之后是否有必要写入

f_temp.close（）

open

步骤中使用

with

子句最容易实现这一点。