Python pickle:缓慢的dict反序列化

Python pickle:缓慢的dict反序列化,python,serialization,pickle,Python,Serialization,Pickle,让我们假设我有一个“相当大”的字典,其中键是具有重\uuuuueq\uuuu功能的对象 class MyObject(): def __eq__(self, other): return <very heavy function call> def __hash__(self): return <not so heavy hash calculation> mydict = {<MyObject>:

让我们假设我有一个“相当大”的字典,其中键是具有重
\uuuuueq\uuuu
功能的对象

class MyObject():

     def __eq__(self, other):
         return <very heavy function call>

     def __hash__(self):
         return <not so heavy hash calculation>

mydict = {<MyObject>:<Int>}
结果:

    0: \x80 PROTO      3
    2: }    EMPTY_DICT
    3: q    BINPUT     0
    5: (    MARK
    6: X        BINUNICODE 'a'
   12: q        BINPUT     1
   14: K        BININT1    0
   16: X        BINUNICODE 'b'
   22: q        BINPUT     2
   24: ]        EMPTY_LIST
   25: q        BINPUT     3
   27: (        MARK
   28: K            BININT1    1
   30: K            BININT1    2
   32: K            BININT1    3
   34: e            APPENDS    (MARK at 27)
   35: u        SETITEMS   (MARK at 5)
   36: .    STOP
highest protocol among opcodes = 2

没有哈希树的迹象。这意味着
Pickle Machine
必须在反序列化后重新计算hashtree。但它真的是假的吗?为什么pickle不能保存字典的内部状态,以及如何与之抗争

Pickle机器是跨各种Python对象实现的,某些对象可能不喜欢散列实现,可能会停止

如果您所要做的就是保存Json对象,请使用
Json
对象,因为它将非常有效。

使用pickle对象作为字典键。 通过使用pickle对象作为键,我们可以绕过
MyObject.\uuuuu hash.\uuuuu
MyObject.\uuuuu eq.\uuuu
方法,并可能加快取消pickle过程

一旦有了未拾取的字典,就可以根据需要慢慢地将拾取的键更改为实际的MyObject实例

p1.py 创建一个字典,其中键的类型为MyObject,然后将它们pickle到一个文件中。 创建一个字典,其中键是pickle MyObject字节。 比较取消勾选每个字典所需的时间

#!/usr/bin/env python3

import pickle
import pickletools
from pprint import pprint
import random
import string
import time

random.seed(19891225)


class MyObject(object):

    def __init__(self, value):
        self.value = value

    def __eq__(self, other):
        exit(1)
        return self.value == other.value

    def __hash__(self):
        time.sleep(10)
        return hash(tuple(self.value))


def key_value():
    value_len = random.randint(0,len(string.ascii_letters))
    value = random.sample(string.ascii_letters, value_len)
    return (MyObject(value), value_len)

def object_key_dict(size=10):
    return dict([key_value() for i in range(0, size)])

def pickle_key_dict(d):
    return {pickle.dumps(k):v for k, v in d.items()}

def pickle_unpickle(obj, filename):
    with open(filename, "wb") as pout:
        pickle.dump(obj, pout)

    print(f"Unpickle Start: {time.strftime('%H:%M:%s')}")
    with open(filename, "rb") as pin:
        newobj = pickle.load(pin)
    print(f"Unpickle complete: {time.strftime('%H:%M:%s')}")
    print("-"*20)

    return newobj


if __name__ == "__main__":
    d = object_key_dict(size=10)

    print("Use MyObject instances as dictionary keys")
    unpickled_d = pickle_unpickle(d, 'doc.p')

    print("\nUse pickled ojbects as dicitonary keys")
    pd = pickle_key_dict(unpickled_d)
    upd = pickle_unpickle(pd, 'docp.p')
输出
  • \uuuuu eq\uuuu
    方法仅在发生哈希冲突时插入到
    dict
    中时调用(不同对象的
    \uuuuuuuuu hash\uuu
    方法返回相同的答案);如果这种情况经常发生,导致明显的速度减慢,那么它也会大大减慢所有其他操作的速度-在极端情况下,如果
    \uuuuuuuuuuuuuu散列
    方法总是返回相同的答案,而不管对象的值如何,那么
    \uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu eq
    方法将被调用½n²次,只用于取消勾选

    如果发生这种情况,您需要改进
    \uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu散列
    方法,以便它返回更好的答案(对象的不同值的重复次数更少)。您可以通过手动收集对象的代表性样本上的
    \uuuuu散列\uuuuu
    方法的结果,并确保它们大部分是不同的来进行检查

  • 如果上述方法不能解决问题,我建议对代码进行分析,以确认瓶颈在哪里;然后你可以看看你是否可以加速这个瓶颈函数


pickle格式在Python实现中是标准化的,它们不能(仅)在一个解释器的一个版本中存储足够的信息来实现
dict
。是否需要将其另存为pickle?如果您可以将其保存为json,它似乎更有效。@Gust,它对我不起作用,因为加载后我需要python
dict
。如果我将使用
json
构建python
dict
后加载将需要相同的时间。如果是这样,我需要一种方法来初始化python
dict
,使用预先计算好的哈希值。如果它在加载过程中多次调用
\uuuuuuueq\uucode>,那么它在使用过程中也会多次调用
\uuueq\uueq
,程序的其余部分也会非常缓慢。在取消勾选期间,对
\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu
的大量调用表明
\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu;您可以通过从散列返回一个常量来测试这一点;但是,如果将pickle对象用作键,它仍将绕过对象哈希和eq方法。此外,我尝试了使用重复对象的代码,在取消勾选过程中仍然没有调用eq方法。这是这个问题的关键。我刚才仔细检查了一下,实际上在取消勾选时调用了
\uuuuuueq\uuuuuuu
;这适用于不同的对象(不相等,
\uuuuueq\uuuu
返回
False
),但具有相同的散列值,请使用
def\uuu散列(self):返回1来确保这一点。(如果对象相等,您的
object\u key\u dict
函数将返回一个只有一个元素的dict。)我并不是说永远不会调用eq。但是,通过构建一个总是返回1的哈希函数,您不是在强迫一个更糟糕的情况吗?你和我完全同意。是散列函数引起了这个问题。如果没有更多与业务规则、实现和数据量相关的信息,我们就无能为力了。是的,我正在强制执行一个最坏情况的场景,以验证在取消勾选过程中是否可以调用
\uuuuuueq\uuuuuuu
。然而,如果
\uuuu eq\uuuu
方法在取消勾选过程中大大减慢了速度,我怀疑真正的场景离这并不遥远。。。
#!/usr/bin/env python3

import pickle
import pickletools
from pprint import pprint
import random
import string
import time

random.seed(19891225)


class MyObject(object):

    def __init__(self, value):
        self.value = value

    def __eq__(self, other):
        exit(1)
        return self.value == other.value

    def __hash__(self):
        time.sleep(10)
        return hash(tuple(self.value))


def key_value():
    value_len = random.randint(0,len(string.ascii_letters))
    value = random.sample(string.ascii_letters, value_len)
    return (MyObject(value), value_len)

def object_key_dict(size=10):
    return dict([key_value() for i in range(0, size)])

def pickle_key_dict(d):
    return {pickle.dumps(k):v for k, v in d.items()}

def pickle_unpickle(obj, filename):
    with open(filename, "wb") as pout:
        pickle.dump(obj, pout)

    print(f"Unpickle Start: {time.strftime('%H:%M:%s')}")
    with open(filename, "rb") as pin:
        newobj = pickle.load(pin)
    print(f"Unpickle complete: {time.strftime('%H:%M:%s')}")
    print("-"*20)

    return newobj


if __name__ == "__main__":
    d = object_key_dict(size=10)

    print("Use MyObject instances as dictionary keys")
    unpickled_d = pickle_unpickle(d, 'doc.p')

    print("\nUse pickled ojbects as dicitonary keys")
    pd = pickle_key_dict(unpickled_d)
    upd = pickle_unpickle(pd, 'docp.p')
Use MyObject instances as dictionary keys
Unpickle Start: 09:25:1589808319
Unpickle complete: 09:26:1589808419
--------------------

Use pickled ojbects as dicitonary keys
Unpickle Start: 09:26:1589808419
Unpickle complete: 09:26:1589808419
--------------------