Python3-pickle可以处理大于4GB的字节对象吗？_Python_Python 3.x_Size_Pickle

Python3-pickle可以处理大于4GB的字节对象吗？

python python-3.x

Python3-pickle可以处理大于4GB的字节对象吗？,python,python-3.x,size,pickle,Python,Python 3.x,Size,Pickle,基于此和参考文档，Python 3.4+中的Pickle 4.0+应该能够Pickle大于4 GB的字节对象但是，在Mac OS X 10.10.4上使用python 3.4.3或python 3.5.0b2时，我在尝试pickle大字节数组时出错： >>> import pickle >>> x = bytearray(8 * 1000 * 1000 * 1000) >>> fp = open("x.dat", "wb") >>

基于此和参考文档，Python 3.4+中的Pickle 4.0+应该能够Pickle大于4 GB的字节对象

但是，在Mac OS X 10.10.4上使用python 3.4.3或python 3.5.0b2时，我在尝试pickle大字节数组时出错：

>>> import pickle
>>> x = bytearray(8 * 1000 * 1000 * 1000)
>>> fp = open("x.dat", "wb")
>>> pickle.dump(x, fp, protocol = 4)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 22] Invalid argument

导入pickle >>>x=字节数组（8*1000*1000*1000） >>>fp=开放（“x.dat”、“wb”） >>>pickle.dump（x，fp，protocol=4）回溯（最近一次呼叫最后一次）：文件“”，第1行，在 OSError:[Errno 22]参数无效

我的代码中是否有错误，或者我是否误解了文档？

总结一下评论中的回答：

是的，Python可以pickle大于4GB的字节对象。观察到的错误是由实现中的错误引起的（请参阅）。

下面是一个简单的解决方法。使用

pickle.load

或

pickle.dumps

并将bytes对象分成大小为

2**31-1

的块，以将其放入或移出文件

import pickle
import os.path

file_path = "pkl.pkl"
n_bytes = 2**31
max_bytes = 2**31 - 1
data = bytearray(n_bytes)

## write
bytes_out = pickle.dumps(data)
with open(file_path, 'wb') as f_out:
    for idx in range(0, len(bytes_out), max_bytes):
        f_out.write(bytes_out[idx:idx+max_bytes])

## read
bytes_in = bytearray(0)
input_size = os.path.getsize(file_path)
with open(file_path, 'rb') as f_in:
    for _ in range(0, input_size, max_bytes):
        bytes_in += f_in.read(max_bytes)
data2 = pickle.loads(bytes_in)

assert(data == data2)

通过2GB块读取文件需要两倍于所需的内存如果执行

字节

串联，我加载pickle的方法基于bytearray：

class MacOSFile(object):
    def __init__(self, f):
        self.f = f

    def __getattr__(self, item):
        return getattr(self.f, item)

    def read(self, n):
        if n >= (1 << 31):
            buffer = bytearray(n)
            pos = 0
            while pos < n:
                size = min(n - pos, 1 << 31 - 1)
                chunk = self.f.read(size)
                buffer[pos:pos + size] = chunk
                pos += size
            return buffer
        return self.f.read(n)

这里是完整的解决方法，尽管pickle.load似乎不再尝试转储一个大文件（我使用的是Python 3.5.2），所以严格地说，只有pickle.dumps需要这样才能正常工作

import pickle

class MacOSFile(object):

    def __init__(self, f):
        self.f = f

    def __getattr__(self, item):
        return getattr(self.f, item)

    def read(self, n):
        # print("reading total_bytes=%s" % n, flush=True)
        if n >= (1 << 31):
            buffer = bytearray(n)
            idx = 0
            while idx < n:
                batch_size = min(n - idx, 1 << 31 - 1)
                # print("reading bytes [%s,%s)..." % (idx, idx + batch_size), end="", flush=True)
                buffer[idx:idx + batch_size] = self.f.read(batch_size)
                # print("done.", flush=True)
                idx += batch_size
            return buffer
        return self.f.read(n)

    def write(self, buffer):
        n = len(buffer)
        print("writing total_bytes=%s..." % n, flush=True)
        idx = 0
        while idx < n:
            batch_size = min(n - idx, 1 << 31 - 1)
            print("writing bytes [%s, %s)... " % (idx, idx + batch_size), end="", flush=True)
            self.f.write(buffer[idx:idx + batch_size])
            print("done.", flush=True)
            idx += batch_size


def pickle_dump(obj, file_path):
    with open(file_path, "wb") as f:
        return pickle.dump(obj, MacOSFile(f), protocol=pickle.HIGHEST_PROTOCOL)


def pickle_load(file_path):
    with open(file_path, "rb") as f:
        return pickle.load(MacOSFile(f))

导入pickle
类MacOSFile（对象）：
定义初始化（self，f）：
self.f=f
def _uGetAttr _;（自身，项目）：
return getattr（self.f，item）
def读取（自身，n）：
#打印（“读取的总字节数=%s”%n，flush=True）
如果n>=（1我也发现了这个问题，为了解决这个问题，我将代码分为几个迭代。假设在这种情况下，我有50.000个数据，我必须计算tf idf并进行knn分类。当我运行并直接迭代50.000时，它会给我“那个错误”。因此，为了解决这个问题，我将它分块
tokenized_documents = self.load_tokenized_preprocessing_documents()
    idf = self.load_idf_41227()
    doc_length = len(documents)
    for iteration in range(0, 9):
        tfidf_documents = []
        for index in range(iteration, 4000):
            doc_tfidf = []
            for term in idf.keys():
                tf = self.term_frequency(term, tokenized_documents[index])
                doc_tfidf.append(tf * idf[term])
            doc = documents[index]
            tfidf = [doc_tfidf, doc[0], doc[1]]
            tfidf_documents.append(tfidf)
            print("{} from {} document {}".format(index, doc_length, doc[0]))

        self.save_tfidf_41227(tfidf_documents, iteration)

有相同的问题，并通过升级到Python 3.6.8修复了它
这似乎是执行此操作的PR:您可以为转储指定协议。
如果你做了pickle.dump（obj，file，protocol=4）

它应该可以工作。

对我来说没有问题。Windows上的Python 3.4.1在OS X上中断。这实际上与pickle没有任何关系。

打开（'/dev/null'，wb'）。写入（bytearray（2**31-1））

可以工作，但是

打开（'/dev/null'，wb'）。写入（bytearray（2**3））

抛出该错误。Python 2没有这个问题。@Blender:为您抛出错误的内容适用于Python 2.7.10和Python 3.4.3（在OS X、MacPorts版本上）。@Blender、@EOL

打开（'/dev/null'，'wb'）。write（bytearray（2**31）

对我来说也是失败的，因为MacPort的Python3.4.3.Bug报告：。上面的代码对任何平台都有效吗？如果是，上面的代码更像是“FileThatAlsoCanbedByPickleonSX”对吧？只是试图理解……这与我使用

pickle.load（MacOSFile（fin））不同

在linux上，这会中断，对吗？@markhorAlso，你会实现一个

写方法吗？谢谢。这非常有帮助。有一件事：对于写应该对于idx在范围内（0，n字节，最大字节）：
对于idx在范围内（0，len（字节），最大字节）：
@lunguini，对于写块，不是范围（0，n字节，max字节）
，而是范围（0，len（bytes，out），max字节）
？我建议这样做的原因是（无论如何，在我的机器上），n字节=1024
，但是len（bytes，out）=1062，对于使用此解决方案的其他人，您只使用示例文件的长度，这对于现实世界的场景不一定有用。这个问题怎么还没有解决？InSAEIT的2018年，漏洞仍然存在。有人知道为什么吗？它已在2018年10月修复；问题仍然存在，因为或者想回到2.7版本。6周后，随着Python2.x达到EOL，这将是毫无意义的。我所做的是：pickle.dump（data，w，protocol=pickle.HIGHEST_protocol）。它成功了！
tokenized_documents = self.load_tokenized_preprocessing_documents()
    idf = self.load_idf_41227()
    doc_length = len(documents)
    for iteration in range(0, 9):
        tfidf_documents = []
        for index in range(iteration, 4000):
            doc_tfidf = []
            for term in idf.keys():
                tf = self.term_frequency(term, tokenized_documents[index])
                doc_tfidf.append(tf * idf[term])
            doc = documents[index]
            tfidf = [doc_tfidf, doc[0], doc[1]]
            tfidf_documents.append(tfidf)
            print("{} from {} document {}".format(index, doc_length, doc[0]))

        self.save_tfidf_41227(tfidf_documents, iteration)