Python 2.7 序列化python 2.x/3.x的性能差异
在将pandas帧序列化为CSV时,我遇到了python 2.7和3.5之间的一些性能差异 于是在谷歌上快速搜索,找到了这个基准: 并根据我的需要对其进行了一些修改:Python 2.7 序列化python 2.x/3.x的性能差异,python-2.7,python-3.x,csv,pandas,serialization,Python 2.7,Python 3.x,Csv,Pandas,Serialization,在将pandas帧序列化为CSV时,我遇到了python 2.7和3.5之间的一些性能差异 于是在谷歌上快速搜索,找到了这个基准: 并根据我的需要对其进行了一些修改: import pandas as pd from time import time import platform def timeit(func, n=5): start = time() for i in range(n): func() end = time() retur
import pandas as pd
from time import time
import platform
def timeit(func, n=5):
start = time()
for i in range(n):
func()
end = time()
return (end - start) / n
def csvdumps(s):
s.to_csv('foo')
return 'foo'
def csvloads(fn):
return pd.read_csv(fn)
def hdfdumps(s):
s.to_hdf('foo', 'bar', mode='w')
return ('foo', 'bar')
def hdfloads(path):
return pd.read_hdf('foo', 'bar')
df = pd.DataFrame({'text': [str(i % 1000) for i in range(1000000)],
'numbers': range(1000000)})
keys = ['csv', 'hdfstore']
d = {'csv': [csvloads, csvdumps],
'hdfstore': [hdfloads, hdfdumps]}
result = dict()
for name, (loads, dumps) in d.items():
text = dumps(df.text)
numbers = dumps(df.numbers)
result[name] = {'text': {'dumps': timeit(lambda: dumps(df.text)),
'loads': timeit(lambda: loads(text))},
'numbers': {'dumps': timeit(lambda: dumps(df.numbers)),
'loads': timeit(lambda: loads(numbers))}}
########
# Plot #
########
# Much of this was taken from
# http://nbviewer.ipython.org/gist/mwaskom/886b4e5cb55fed35213d
# by Michael Waskom
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid", font_scale=1.3)
w, h = 7, 7
f, (left, right) = plt.subplots(nrows=1, ncols=2, sharex=True, figsize=(w*2, h), squeeze=True)
df = pd.DataFrame({'loads': [result[key]['text']['loads'] for key in keys],
'dumps': [result[key]['text']['dumps'] for key in keys],
'storage': keys})
df = pd.melt(df, "storage", value_name="duration", var_name="operation")
sns.barplot("duration", "storage", "operation", data=df, ax=left)
left.set(xlabel="Duration (s)", ylabel="")
sns.despine(bottom=True)
left.set_title('Cost to Serialize Text')
left.legend(loc="lower center", ncol=2, frameon=True, title="operation")
df = pd.DataFrame({'loads': [result[key]['numbers']['loads'] for key in keys],
'dumps': [result[key]['numbers']['dumps'] for key in keys],
'storage': keys})
df = pd.melt(df, "storage", value_name="duration", var_name="operation")
sns.barplot("duration", "storage", "operation", data=df, ax=right)
right.set(xlabel="Duration (s)", ylabel="")
sns.despine(bottom=True)
right.set_title('Cost to Serialize Numerical Data')
right.legend(loc="lower center", ncol=2, frameon=True, title="operation")
plt.savefig('serialize_py'+'.'.join(platform.python_version_tuple())+'.png')
正如您在python 3的结果中所看到的,序列化速度要慢得多:
python 2.7 python 3.5 diff
load 0.3504s 0.329005s +06.50%
dump 1.2784s 3.333152s -61.65%
有人知道为什么吗