Python pickle:缓慢的dict反序列化
让我们假设我有一个“相当大”的字典,其中键是具有重Python pickle:缓慢的dict反序列化,python,serialization,pickle,Python,Serialization,Pickle,让我们假设我有一个“相当大”的字典,其中键是具有重\uuuuueq\uuuu功能的对象 class MyObject(): def __eq__(self, other): return <very heavy function call> def __hash__(self): return <not so heavy hash calculation> mydict = {<MyObject>:
\uuuuueq\uuuu
功能的对象
class MyObject():
def __eq__(self, other):
return <very heavy function call>
def __hash__(self):
return <not so heavy hash calculation>
mydict = {<MyObject>:<Int>}
结果:
0: \x80 PROTO 3
2: } EMPTY_DICT
3: q BINPUT 0
5: ( MARK
6: X BINUNICODE 'a'
12: q BINPUT 1
14: K BININT1 0
16: X BINUNICODE 'b'
22: q BINPUT 2
24: ] EMPTY_LIST
25: q BINPUT 3
27: ( MARK
28: K BININT1 1
30: K BININT1 2
32: K BININT1 3
34: e APPENDS (MARK at 27)
35: u SETITEMS (MARK at 5)
36: . STOP
highest protocol among opcodes = 2
没有哈希树的迹象。这意味着
Pickle Machine
必须在反序列化后重新计算hashtree。但它真的是假的吗?为什么pickle不能保存字典的内部状态,以及如何与之抗争 Pickle机器是跨各种Python对象实现的,某些对象可能不喜欢散列实现,可能会停止
如果您所要做的就是保存Json对象,请使用Json
对象,因为它将非常有效。使用pickle对象作为字典键。
通过使用pickle对象作为键,我们可以绕过MyObject.\uuuuu hash.\uuuuu
和MyObject.\uuuuu eq.\uuuu
方法,并可能加快取消pickle过程
一旦有了未拾取的字典,就可以根据需要慢慢地将拾取的键更改为实际的MyObject实例
p1.py
创建一个字典,其中键的类型为MyObject,然后将它们pickle到一个文件中。
创建一个字典,其中键是pickle MyObject字节。
比较取消勾选每个字典所需的时间
#!/usr/bin/env python3
import pickle
import pickletools
from pprint import pprint
import random
import string
import time
random.seed(19891225)
class MyObject(object):
def __init__(self, value):
self.value = value
def __eq__(self, other):
exit(1)
return self.value == other.value
def __hash__(self):
time.sleep(10)
return hash(tuple(self.value))
def key_value():
value_len = random.randint(0,len(string.ascii_letters))
value = random.sample(string.ascii_letters, value_len)
return (MyObject(value), value_len)
def object_key_dict(size=10):
return dict([key_value() for i in range(0, size)])
def pickle_key_dict(d):
return {pickle.dumps(k):v for k, v in d.items()}
def pickle_unpickle(obj, filename):
with open(filename, "wb") as pout:
pickle.dump(obj, pout)
print(f"Unpickle Start: {time.strftime('%H:%M:%s')}")
with open(filename, "rb") as pin:
newobj = pickle.load(pin)
print(f"Unpickle complete: {time.strftime('%H:%M:%s')}")
print("-"*20)
return newobj
if __name__ == "__main__":
d = object_key_dict(size=10)
print("Use MyObject instances as dictionary keys")
unpickled_d = pickle_unpickle(d, 'doc.p')
print("\nUse pickled ojbects as dicitonary keys")
pd = pickle_key_dict(unpickled_d)
upd = pickle_unpickle(pd, 'docp.p')
输出
方法仅在发生哈希冲突时插入到\uuuuu eq\uuuu
中时调用(不同对象的dict
方法返回相同的答案);如果这种情况经常发生,导致明显的速度减慢,那么它也会大大减慢所有其他操作的速度-在极端情况下,如果\uuuuuuuuu hash\uuu
方法总是返回相同的答案,而不管对象的值如何,那么\uuuuuuuuuuuuuu散列
方法将被调用½n²次,只用于取消勾选 如果发生这种情况,您需要改进\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu eq
方法,以便它返回更好的答案(对象的不同值的重复次数更少)。您可以通过手动收集对象的代表性样本上的\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu散列
方法的结果,并确保它们大部分是不同的来进行检查\uuuuu散列\uuuuu
- 如果上述方法不能解决问题,我建议对代码进行分析,以确认瓶颈在哪里;然后你可以看看你是否可以加速这个瓶颈函数
dict
。是否需要将其另存为pickle?如果您可以将其保存为json,它似乎更有效。@Gust,它对我不起作用,因为加载后我需要pythondict
。如果我将使用json
构建pythondict
后加载将需要相同的时间。如果是这样,我需要一种方法来初始化pythondict
,使用预先计算好的哈希值。如果它在加载过程中多次调用\uuuuuuueq\uucode>,那么它在使用过程中也会多次调用\uuueq\uueq
,程序的其余部分也会非常缓慢。在取消勾选期间,对\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu
的大量调用表明\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu;您可以通过从散列返回一个常量来测试这一点;但是,如果将pickle对象用作键,它仍将绕过对象哈希和eq方法。此外,我尝试了使用重复对象的代码,在取消勾选过程中仍然没有调用eq方法。这是这个问题的关键。我刚才仔细检查了一下,实际上在取消勾选时调用了\uuuuuueq\uuuuuuu
;这适用于不同的对象(不相等,\uuuuueq\uuuu
返回False
),但具有相同的散列值,请使用def\uuu散列(self):返回1来确保这一点。(如果对象相等,您的object\u key\u dict
函数将返回一个只有一个元素的dict。)我并不是说永远不会调用eq。但是,通过构建一个总是返回1的哈希函数,您不是在强迫一个更糟糕的情况吗?你和我完全同意。是散列函数引起了这个问题。如果没有更多与业务规则、实现和数据量相关的信息,我们就无能为力了。是的,我正在强制执行一个最坏情况的场景,以验证在取消勾选过程中是否可以调用\uuuuuueq\uuuuuuu
。然而,如果\uuuu eq\uuuu
方法在取消勾选过程中大大减慢了速度,我怀疑真正的场景离这并不遥远。。。
#!/usr/bin/env python3
import pickle
import pickletools
from pprint import pprint
import random
import string
import time
random.seed(19891225)
class MyObject(object):
def __init__(self, value):
self.value = value
def __eq__(self, other):
exit(1)
return self.value == other.value
def __hash__(self):
time.sleep(10)
return hash(tuple(self.value))
def key_value():
value_len = random.randint(0,len(string.ascii_letters))
value = random.sample(string.ascii_letters, value_len)
return (MyObject(value), value_len)
def object_key_dict(size=10):
return dict([key_value() for i in range(0, size)])
def pickle_key_dict(d):
return {pickle.dumps(k):v for k, v in d.items()}
def pickle_unpickle(obj, filename):
with open(filename, "wb") as pout:
pickle.dump(obj, pout)
print(f"Unpickle Start: {time.strftime('%H:%M:%s')}")
with open(filename, "rb") as pin:
newobj = pickle.load(pin)
print(f"Unpickle complete: {time.strftime('%H:%M:%s')}")
print("-"*20)
return newobj
if __name__ == "__main__":
d = object_key_dict(size=10)
print("Use MyObject instances as dictionary keys")
unpickled_d = pickle_unpickle(d, 'doc.p')
print("\nUse pickled ojbects as dicitonary keys")
pd = pickle_key_dict(unpickled_d)
upd = pickle_unpickle(pd, 'docp.p')
Use MyObject instances as dictionary keys
Unpickle Start: 09:25:1589808319
Unpickle complete: 09:26:1589808419
--------------------
Use pickled ojbects as dicitonary keys
Unpickle Start: 09:26:1589808419
Unpickle complete: 09:26:1589808419
--------------------