在Python中修改大型文本文件最后一行的最有效方法_Python_Io

在Python中修改大型文本文件最后一行的最有效方法

python io

在Python中修改大型文本文件最后一行的最有效方法,python,io,Python,Io,我需要从几个2GB以上的文件中更新最后一行，这些文件由无法用readlines（）读取的文本行组成。目前，它可以通过一行一行地循环工作。然而，我想知道是否有任何编译库可以更有效地实现这一点？谢谢当前方法如果这确实是基于行的（不需要真正的XML解析器，这是最好的解决方案），可以在这里提供帮助 mmap文件，然后在结果对象上调用.rfind（'\n'）（可能需要进行调整，以便在确实需要文件前面的非空行而不是后面的空“行”时处理以换行结尾的文件）。然后可以单独切出最后一行。如果需要就地修改文件，可

我需要从几个2GB以上的文件中更新最后一行，这些文件由无法用

readlines（）

读取的文本行组成。目前，它可以通过一行一行地循环工作。然而，我想知道是否有任何编译库可以更有效地实现这一点？谢谢

当前方法

如果这确实是基于行的（不需要真正的XML解析器，这是最好的解决方案），可以在这里提供帮助

mmap

文件，然后在结果对象上调用

.rfind（'\n'）

（可能需要进行调整，以便在确实需要文件前面的非空行而不是后面的空“行”时处理以换行结尾的文件）。然后可以单独切出最后一行。如果需要就地修改文件，可以调整文件大小，以减少（或添加）与切片行和新行之间的差异相对应的字节数，然后写回新行。避免读取或写入超出需要的文件

示例代码（如果我犯了错误，请评论）：

显然，在一些没有

mremap

的系统（例如OSX）上，

mm.resize

不起作用，因此为了支持这些系统，您可能需要将

与分开（因此mmap
在文件对象之前关闭），并使用基于文件对象的查找、写入和截断来修复文件。以下示例包括我前面提到的Python 3.1和早期的特定调整，以使用contextlib.closing
，以确保完整性：
import mmap
from contextlib import closing

with open("large.XML", 'r+b') as myfile:
    with closing(mmap.mmap(myfile.fileno(), 0, access=mmap.ACCESS_WRITE)) as mm:
        startofline = mm.rfind(b'\n', 0, len(mm) - 1) + 1
        line = mm[startofline:].rstrip(b'\r\n')
        new_line = do_something(line.decode('utf-8')).encode('utf-8')

    myfile.seek(startofline)  # Move to where old line began
    myfile.write(new_line)  # Overwrite existing line with new line
    myfile.truncate()  # If existing line longer than new line, get rid of the excess

mmap
比任何其他方法的优点是：
不需要再读取超出行本身的文件（意味着文件的1-2页，其余部分永远不会被读取或写入）
使用rfind
意味着您可以让Python在C层（在CPython中）快速查找换行符；文件对象的显式seek
s和read
s可能与“仅读取一页左右”匹配，但您必须手动执行换行搜索
注意事项：如果您在32位系统上，并且文件太大，无法映射到内存中，则此方法将不起作用（至少，在不进行修改的情况下，避免映射超过2 GB，并在可能无法映射整个文件时处理大小调整）。在大多数32位系统上，即使是在新生成的进程中，也只有1-2GB的连续地址空间可用；在某些特殊情况下，您可能有多达3-3.5 GB的用户虚拟地址（尽管您会丢失一些到堆、堆栈、可执行映射等的连续空间）mmap
不需要太多的物理RAM，但需要连续的地址空间；64位操作系统的一个巨大好处是，除了最荒谬的情况外，您不再担心虚拟地址空间，因此mmap
可以解决一般情况下的问题，如果不增加32位操作系统的复杂性，它将无法处理这些问题。在这一点上，大多数现代计算机都是64位的，但如果您的目标是32位系统，那么一定要记住这一点（在Windows上，即使操作系统是64位的，他们也可能错误地安装了32位版本的Python，因此同样的问题也会出现）。这里还有一个示例（假设最后一行不是100+MB长）适用于32位Python（省略关闭
和导入以简化），即使是对于大型文件：
with open("large.XML", 'r+b') as myfile:
    filesize = myfile.seek(0, 2)
    # Get an offset that only grabs the last 100 MB or so of the file aligned properly
    offset = max(0, filesize - 100 * 1024 ** 2) & ~(mmap.ALLOCATIONGRANULARITY - 1)
    with mmap.mmap(myfile.fileno(), 0, access=mmap.ACCESS_WRITE, offset=offset) as mm:
        startofline = mm.rfind(b'\n', 0, len(mm) - 1) + 1
        # If line might be > 100 MB long, probably want to check if startofline
        # follows a newline here
        line = mm[startofline:].rstrip(b'\r\n')
        new_line = do_something(line.decode('utf-8')).encode('utf-8')

    myfile.seek(startofline + offset)  # Move to where old line began, adjusted for offset
    myfile.write(new_line)  # Overwrite existing line with new line
    myfile.truncate()  # If existing line longer than new line, get rid of the excess

更新：使用。它短得多，结实得多
为子孙后代：
读取文件的最后N个字节并向后搜索换行符
#!/usr/bin/env python

with open("test.txt", "wb") as testfile:
    testfile.write('\n'.join(["one", "two", "three"]) + '\n')

with open("test.txt", "r+b") as myfile:
    # Read the last 1kiB of the file
    # we could make this be dynamic, but chances are there's
    # a number like 1kiB that'll work 100% of the time for you
    myfile.seek(0,2)
    filesize = myfile.tell()
    blocksize = min(1024, filesize)
    myfile.seek(-blocksize, 2)
    # search backwards for a newline (excluding very last byte
    # in case the file ends with a newline)
    index = myfile.read().rindex('\n', 0, blocksize - 1)
    # seek to the character just after the newline
    myfile.seek(index + 1 - blocksize, 2)
    # read in the last line of the file
    lastline = myfile.read()
    # modify last_line
    lastline = "Brand New Line!\n"
    # seek back to the start of the last line
    myfile.seek(index + 1 - blocksize, 2)
    # write out new version of the last line
    myfile.write(lastline)
    myfile.truncate()

如果是XML，为什么不使用XML解析器？您应该能够实现一个更高效的ElementTree，我使用过ElementTree，并且很喜欢它。@也许OP不需要解析XML？另外，这不是应该避免的将一大块文件读入内存吗？相关：不完全是重复的，因为这个人想重写文件的结尾，而不是将其加载到ram中。@Two-Bitalchest:正如你所建议的。结束这个问题。谢谢只需使用mm.rfind（b'\n'，0，len（mm）-1）
。如果最后一个字节是换行符，则将跳过它。如果是任何其他内容，包括一个字符行或零个字符行，代码仍然有效。Bummer，在OSX上：“SystemError:mmap:resizing not available--no mremap（）”。看起来解决方案是关闭文件，重新打开，寻找startofline
，然后写入。应该是startofline=mm.rfind（b'\n'，0，len（mm）-1）+1
（注意+1）以保留以前的换行。它还有一个意外的效果，即消除了对“未找到”的测试需求。@Harvey:谢谢！我已经为没有mmap.resize
支持的系统提供了另一位代码，并修复了startofline
计算。一个错误就把你干掉！老实说，我不知道这是不是更好，更坏，或相同的答案，在复制品，但我认为你应该考虑适应这个问题的答案（除此之外）。也就是说，假设它还没有出现。这个问题有19个答案，我还没有通读所有答案。可能想使用rfind
，而不是rindex
，或者在可以重写单行的情况下，通过抛出异常来处理单行文件。假设这取决于是否已知存在多条线。@ShadowRanger:我开始这样做，但你不知道你是否真的找到了线的开始还是块的开始。我推荐你的答案，同时把我的答案留给大家看。啊，对。忘了读块了。mmap的最大优势在于您不必担心这类事情。：-）这有一个陷阱，您需要建立一个n，该n保证足够大以始终包含最终换行符，或者安排一个回退到其他方法（用越来越大的数据块重复）
with open("large.XML", 'r+b') as myfile:
    filesize = myfile.seek(0, 2)
    # Get an offset that only grabs the last 100 MB or so of the file aligned properly
    offset = max(0, filesize - 100 * 1024 ** 2) & ~(mmap.ALLOCATIONGRANULARITY - 1)
    with mmap.mmap(myfile.fileno(), 0, access=mmap.ACCESS_WRITE, offset=offset) as mm:
        startofline = mm.rfind(b'\n', 0, len(mm) - 1) + 1
        # If line might be > 100 MB long, probably want to check if startofline
        # follows a newline here
        line = mm[startofline:].rstrip(b'\r\n')
        new_line = do_something(line.decode('utf-8')).encode('utf-8')

    myfile.seek(startofline + offset)  # Move to where old line began, adjusted for offset
    myfile.write(new_line)  # Overwrite existing line with new line
    myfile.truncate()  # If existing line longer than new line, get rid of the excess

#!/usr/bin/env python

with open("test.txt", "wb") as testfile:
    testfile.write('\n'.join(["one", "two", "three"]) + '\n')

with open("test.txt", "r+b") as myfile:
    # Read the last 1kiB of the file
    # we could make this be dynamic, but chances are there's
    # a number like 1kiB that'll work 100% of the time for you
    myfile.seek(0,2)
    filesize = myfile.tell()
    blocksize = min(1024, filesize)
    myfile.seek(-blocksize, 2)
    # search backwards for a newline (excluding very last byte
    # in case the file ends with a newline)
    index = myfile.read().rindex('\n', 0, blocksize - 1)
    # seek to the character just after the newline
    myfile.seek(index + 1 - blocksize, 2)
    # read in the last line of the file
    lastline = myfile.read()
    # modify last_line
    lastline = "Brand New Line!\n"
    # seek back to the start of the last line
    myfile.seek(index + 1 - blocksize, 2)
    # write out new version of the last line
    myfile.write(lastline)
    myfile.truncate()