Python-我可以在不打开文件的情况下将UTF8 BOM添加到文件中吗？_Python_Unicode_Utf 8_Byte Order Mark

Python-我可以在不打开文件的情况下将UTF8 BOM添加到文件中吗？

python unicode utf-8

Python-我可以在不打开文件的情况下将UTF8 BOM添加到文件中吗？,python,unicode,utf-8,byte-order-mark,Python,Unicode,Utf 8,Byte Order Mark,如何将utf8 bom表添加到文本文件而不打开它理论上，我们只需要将utf8 bom添加到文件的开头，而不需要读入“所有”内容您需要读入数据，因为您需要移动所有数据以为BOM表腾出空间。文件不能只是预先添加任意数据。在适当的位置执行此操作比仅使用BOM表和原始数据编写新文件，然后替换原始文件更困难，因此最简单的解决方案通常是： import os import shutil from os.path import dirname, realpath from tempfile import

如何将utf8 bom表添加到文本文件而不打开它

理论上，我们只需要将utf8 bom添加到文件的开头，而不需要读入“所有”内容

您需要读入数据，因为您需要移动所有数据以为BOM表腾出空间。文件不能只是预先添加任意数据。在适当的位置执行此操作比仅使用BOM表和原始数据编写新文件，然后替换原始文件更困难，因此最简单的解决方案通常是：

import os
import shutil

from os.path import dirname, realpath
from tempfile import NamedTemporaryFile

infile = ...

# Open original file as UTF-8 and tempfile in same directory to add sig
indir = dirname(realpath(infile))
with NamedTemporaryFile(dir=indir, mode='w', encoding='utf-8-sig') as tf:
    with open(infile, encoding='utf-8') as f:
        # Copy from one file to the other by blocks 
        # (avoids memory use of slurping whole file at once)
        shutil.copyfileobj(f, tf)

    # Optional: Replicate metadata of original file
    tf.flush()
    shutil.copystat(f.name, tf.name) # Replicate permissions of original file

    # Atomically replace original file with BOM marked file
    os.replace(tf.name, f.name)

    # Don't try to delete temp file if everything worked
    tf.delete = False

这还验证了输入文件实际上是UTF-8的副作用，并且原始文件从未处于不一致状态；它是旧的或新的数据，而不是中间的工作副本

如果您的文件很大，并且磁盘空间有限，所以不能同时在磁盘上保存两个副本，那么就地变异可能是可以接受的。最简单的方法是mmap模块，它大大简化了移动数据的过程，而不是使用就地文件对象操作：

import codecs
import mmap

# Open file for read and write and then immediately map the whole file for write
with open(infile, 'r+b') as f, mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_WRITE) as mm:
    origsize = mm.size()
    bomlen = len(codecs.BOM_UTF8)
    # Allocate additional space for BOM
    mm.resize(origsize+bomlen)

    # Copy file contents down to make room for BOM
    # This reads and writes the whole file, and is unavoidable
    mm.move(bomlen, 0, origsize)

    # Insert the BOM before the shifted data
    mm[:bomlen] = codecs.BOM_UTF8

import os
import shutil

from os.path import dirname, realpath
from tempfile import NamedTemporaryFile

infile = ...

# Open original file as UTF-8 and tempfile in same directory to add sig
indir = dirname(realpath(infile))
with NamedTemporaryFile(dir=indir, mode='w', encoding='utf-8-sig') as tf:
    with open(infile, encoding='utf-8') as f:
        # Copy from one file to the other by blocks 
        # (avoids memory use of slurping whole file at once)
        shutil.copyfileobj(f, tf)

    # Optional: Replicate metadata of original file
    tf.flush()
    shutil.copystat(f.name, tf.name) # Replicate permissions of original file

    # Atomically replace original file with BOM marked file
    os.replace(tf.name, f.name)

    # Don't try to delete temp file if everything worked
    tf.delete = False

这还验证了输入文件实际上是UTF-8的副作用，并且原始文件从未处于不一致状态；它是旧的或新的数据，而不是中间的工作副本

import codecs
import mmap

# Open file for read and write and then immediately map the whole file for write
with open(infile, 'r+b') as f, mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_WRITE) as mm:
    origsize = mm.size()
    bomlen = len(codecs.BOM_UTF8)
    # Allocate additional space for BOM
    mm.resize(origsize+bomlen)

    # Copy file contents down to make room for BOM
    # This reads and writes the whole file, and is unavoidable
    mm.move(bomlen, 0, origsize)

    # Insert the BOM before the shifted data
    mm[:bomlen] = codecs.BOM_UTF8

如果您需要就地更新，请执行以下操作

def add_bom(fname, bom=None, buf_size=None):
    bom = bom or BOM
    buf_size = buf_size or max(resource.getpagesize(), len(bom))
    buf = bytearray(buf_size)
    with open(fname, 'rb', 0) as in_fd, open(fname, 'rb+', 0) as out_fd:
        # we cannot just just read until eof, because we
        # will be writing to that very same file, extending it.
        out_fd.seek(0, 2)
        nbytes = out_fd.tell()
        out_fd.seek(0)
        # Actually, we want to pass buf[0:n_bytes], but 
        # that doesn't result in in-place updates.
        in_bytes = in_fd.readinto(buf)
        if in_bytes < len(bom) or not buf.startswith(bom):
            # don't write the BOM if it's already there
            out_fd.write(bom)
        while nbytes > 0:
            # if we still need to write data, do so.
            # but only write as much data as we need
            out_fd.write(buffer(buf, 0, min(in_bytes, nbytes)))
            nbytes -= in_bytes
            in_bytes = in_fd.readinto(buf)

我们应该做到这一点

正如您所看到的，就地更新有点过时，因为您是

将数据写入刚刚读取的位置。读取必须始终位于写入之前，否则将覆盖尚未处理的数据。扩展您正在读取的文件，因此读取到EOF不起作用。

此外，这可能会使文件处于不一致的状态。如果可能的话，最好使用复制到临时->移动临时到原始方法。

如果需要就地更新，例如

def add_bom(fname, bom=None, buf_size=None):
    bom = bom or BOM
    buf_size = buf_size or max(resource.getpagesize(), len(bom))
    buf = bytearray(buf_size)
    with open(fname, 'rb', 0) as in_fd, open(fname, 'rb+', 0) as out_fd:
        # we cannot just just read until eof, because we
        # will be writing to that very same file, extending it.
        out_fd.seek(0, 2)
        nbytes = out_fd.tell()
        out_fd.seek(0)
        # Actually, we want to pass buf[0:n_bytes], but 
        # that doesn't result in in-place updates.
        in_bytes = in_fd.readinto(buf)
        if in_bytes < len(bom) or not buf.startswith(bom):
            # don't write the BOM if it's already there
            out_fd.write(bom)
        while nbytes > 0:
            # if we still need to write data, do so.
            # but only write as much data as we need
            out_fd.write(buffer(buf, 0, min(in_bytes, nbytes)))
            nbytes -= in_bytes
            in_bytes = in_fd.readinto(buf)

我们应该做到这一点

正如您所看到的，就地更新有点过时，因为您是

将数据写入刚刚读取的位置。读取必须始终位于写入之前，否则将覆盖尚未处理的数据。扩展您正在读取的文件，因此读取到EOF不起作用。

此外，这可能会使文件处于不一致的状态。如果可能的话，最好使用复制到临时->移动临时到原始方法。

在文件开头添加内容需要重写整个文件，您只能将内容附加到文件末尾，而不能在某处插入内容。您也不能在不打开文件的情况下修改该文件。所以不：你想要的是不可能的。@dhke“不打开它”确实是不准确的。我有很多大文件，比如1giga字节。添加utf8 bom的最佳方法是什么？@minion:您无法避免读取和写入完整的1GB。您只能选择具有原子性和安全性的临时文件，但临时磁盘空间要求较高，或者就地修改通常较慢，如果中途中断，可能会损坏数据，但所需的额外磁盘空间最小。在文件开头添加内容需要重写整个文件，您只能附加到文件的末尾，不能在某处插入内容。您也不能在不打开文件的情况下修改该文件。所以不：你想要的是不可能的。@dhke“不打开它”确实是不准确的。我有很多大文件，比如1giga字节。添加utf8 bom的最佳方法是什么？@minion:您无法避免读取和写入完整的1GB。您唯一的选择是在具有原子性和安全性的临时文件（但临时磁盘空间要求较高）和就地修改（通常较慢）之间进行选择，如果中途中断，可能会损坏数据，但需要的额外磁盘空间最少。我为添加了一个替代就地解决方案，使用mmap简化工作。我发现这比尝试使用文件对象操作要容易得多。@ShadowRanger不错，我也考虑过这一点，但是如果你看一下cp的源代码，你会发现它还使用分块mmap来避免对大文件的VM破坏。64位操作系统上没有太大问题，但是如果有3+GB的文件，你将无法在32位机器上对其进行mmap。我假设它是使用分块mmap来避免VM内存限制，而不是颠簸，但是的，它在32位机器上不可扩展

通常为1.5 GB左右。解决方案：2016年，运行64位操作系统和Python安装：-我添加了一个替代的就地解决方案，使用mmap来简化工作。我发现这比尝试使用文件对象操作要容易得多。@ShadowRanger不错，我也考虑过这一点，但是如果你看一下cp的源代码，你会发现它还使用分块mmap来避免对大文件的VM破坏。64位操作系统上没有太大问题，但是如果有3+GB的文件，你将无法在32位机器上对其进行mmap。我假设它使用分块mmap来避免VM内存限制，而不是颠簸，但是的，它在32位机器上的可扩展性通常不超过1.5 GB左右。解决方案：2016年，运行64位操作系统和Python安装：-