Python 内存使用情况，使用Dict填充数据帧与使用键和值列表_Python_Performance_List_Pandas_Dictionary

Python 内存使用情况，使用Dict填充数据帧与使用键和值列表

python performance list pandas dictionary

Python 内存使用情况，使用Dict填充数据帧与使用键和值列表,python,performance,list,pandas,dictionary,Python,Performance,List,Pandas,Dictionary,我正在制作一个包，用于读取二进制文件并返回可用于初始化数据帧的数据，现在我想知道是否最好返回一个dict或两个列表（一个包含键，另一个包含值）我正在制作的包不应该完全依赖于DataFrame对象，这就是为什么我的包当前以dict的形式输出数据（便于访问）。如果可以节省一些内存和速度（这对于我的应用程序来说是至关重要的，因为我正在处理数百万个数据点），我希望输出键和值列表。然后，这些可重用项将用于初始化数据帧下面是一个简单的例子： In [1]: d = {(1,1,1): '111',

我正在制作一个包，用于读取二进制文件并返回可用于初始化

数据帧的数据，现在我想知道是否最好返回一个dict
或两个列表（一个包含键，另一个包含值）
我正在制作的包不应该完全依赖于DataFrame
对象，这就是为什么我的包当前以dict
的形式输出数据（便于访问）。如果可以节省一些内存和速度（这对于我的应用程序来说是至关重要的，因为我正在处理数百万个数据点），我希望输出键和值列表。然后，这些可重用项将用于初始化数据帧

下面是一个简单的例子：
In [1]: d = {(1,1,1): '111',
   ...: (2,2,2): '222',
   ...: (3,3,3): '333',
   ...: (4,4,4): '444'}

In [2]: keyslist=[(1,1,1),(2,2,2),(3,3,3),(4,4,4)]

In [3]: valslist=['111','222','333','444']

In [4]: import pandas as pd

In [5]: dfdict=pd.DataFrame(d.values(),  index=pd.MultiIndex.from_tuples(d.keys(), names=['a','b','c']))

In [6]: dfdict
Out[6]: 
         0
a b c     
3 3 3  333
2 2 2  222
1 1 1  111
4 4 4  444

In [7]: dflist=pd.DataFrame(valslist,  index=pd.MultiIndex.from_tuples(keyslist, names=['a','b','c']))

In [8]: dfpair
Out[8]: 
         0
a b c     
1 1 1  111
2 2 2  222
3 3 3  333
4 4 4  444

据我所知，d.values（）
和d.keys（）
正在创建数据的新副本。如果我们忽略adict
比alist
占用更多内存的事实，那么使用d.values（）
和d.keys（）
会比list
pair实现占用更多内存吗？
我对1M行进行了内存分析。获胜的结构是对每个数字索引使用array.array，并对字符串使用列表（147MB数据和310MB到熊猫的转换）
根据Python手册
数组是序列类型，除了
存储在其中的对象类型受到约束
它们甚至有append方法，并且很可能具有非常快的append速度
第二名是两份单独的清单。（308MB和450MB）
另外两个选项，使用dict和使用四元组的列表，是最糟糕的。记录：339MB，524MB。四个列表：308MB，514MB
以下是array.array的用法：
In [1]: from array import array
In [2]: import gc
In [3]: import pandas as pd
In [4]: %load_ext memory_profiler
In [5]: a1=array("l",range(1000000))
In [6]: a2=array("l",range(1000000))
In [7]: a3=array("l",range(1000000))
In [8]: b=[str(x*111) for x in list(range(1000000))]
In [9]: gc.collect()
Out[9]: 0
In [10]: %memit a1,a2,a3,b
peak memory: 147.64 MiB, increment: 0.32 MiB
In [11]: %memit dfpair=pd.DataFrame(b,  index=pd.MultiIndex.from_arrays([a1,a2,a3], names=['a','b','c']))
peak memory: 310.60 MiB, increment: 162.91 MiB

下面是代码的其余部分（很长）：
四元组列表：
In [1]: import gc
In [2]: import pandas as pd
In [3]: %load_ext memory_profiler
In [4]: a=list(zip(list(range(1000000)),list(range(1000000)),list(range(1000000))))
In [5]: b=[str(x*111) for x in list(range(1000000))]
In [6]: d2=[x+(b[i],) for i,x in enumerate(a)]
In [7]: del a
In [8]: del b
In [9]: gc.collect()
Out[9]: 0
In [10]: %memit d2
peak memory: 308.40 MiB, increment: 0.28 MiB
In [11]: %memit df = pd.DataFrame(d2, columns=['a','b','c','d']).set_index(['a','b','c'])
peak memory: 514.21 MiB, increment: 205.80 MiB

字典：
In [1]: import gc
In [2]: import pandas as pd
In [3]: %load_ext memory_profiler
In [4]: a=list(zip(list(range(1000000)),list(range(1000000)),list(range(1000000))))
In [5]: b=[str(x*111) for x in list(range(1000000))]
In [6]: d = dict(zip(a, b))
In [7]: del a
In [8]: del b
In [9]: gc.collect()
Out[9]: 0
In [10]: %memit d
peak memory: 339.14 MiB, increment: 0.23 MiB
In [11]: %memit dfdict=pd.DataFrame(list(d.values()),  index=pd.MultiIndex.from_tuples(d.keys(), names=['a','b','c']))
peak memory: 524.10 MiB, increment: 184.95 MiB

两个阵列：
In [1]: import gc
In [2]: import pandas as pd
In [3]: %load_ext memory_profiler
In [4]: a=list(zip(list(range(1000000)),list(range(1000000)),list(range(1000000))))
In [5]: b=[str(x*111) for x in list(range(1000000))]
In [6]: gc.collect()
Out[6]: 0
In [7]: %memit a,b
peak memory: 307.75 MiB, increment: 0.19 MiB
In [8]: %memit dfpair=pd.DataFrame(b,  index=pd.MultiIndex.from_tuples(a, names=['a','b','c']))
peak memory: 459.94 MiB, increment: 152.19 MiB

以下是使用memory\u profiler
的基准测试：
Filename: testdict.py

Line #    Mem usage    Increment   Line Contents
================================================
     4     66.2 MiB      0.0 MiB   @profile
     5                             def testdict():
     6
     7     66.2 MiB      0.0 MiB        d = {}
     8
     9    260.6 MiB    194.3 MiB        for i in xrange(0,1000000):
    10    260.6 MiB      0.0 MiB                d[(i,i,i)]=str(i)*3
    11
    12    400.2 MiB    139.6 MiB        dfdict=pd.DataFrame(d.values(),  index=
pd.MultiIndex.from_tuples(d.keys(), names=['a','b','c']))

Filename: testlist.py

Line #    Mem usage    Increment   Line Contents
================================================
     4     66.5 MiB      0.0 MiB   @profile
     5                             def testlist():
     6
     7     66.5 MiB      0.0 MiB        keyslist=[]
     8     66.5 MiB      0.0 MiB        valslist=[]
     9
    10    229.3 MiB    162.8 MiB        for i in xrange(0,1000000):
    11    229.3 MiB      0.0 MiB                keyslist.append((i,i,i))
    12    229.3 MiB      0.0 MiB                valslist.append(str(i)*3)
    13
    14    273.6 MiB     44.3 MiB        dflist=pd.DataFrame(valslist,  index=
pd.MultiIndex.from_tuples(keyslist, names=['a','b','c']))

对于相同的任务和内存类型，字典实现似乎没有内存效率高
编辑
出于某种原因，当我将值更改为数字数组（更能代表我的数据）时，我得到了非常相似的性能，有人知道为什么会发生这种情况吗
Filename: testdict.py

Line #    Mem usage    Increment   Line Contents
================================================
     4     66.9 MiB      0.0 MiB   @profile
     5                             def testdict():
     6
     7     66.9 MiB      0.0 MiB        d = {}
     8
     9    345.6 MiB    278.7 MiB        for i in xrange(0,1000000):
    10    345.6 MiB      0.0 MiB                d[(i,i,i)]=[0]*9
    11
    12    546.2 MiB    200.6 MiB        dfdict=pd.DataFrame(d.values(),  index=
pd.MultiIndex.from_tuples(d.keys(), names=['a','b','c']))

Filename: testlist.py

Line #    Mem usage    Increment   Line Contents
================================================
     4     66.3 MiB      0.0 MiB   @profile
     5                             def testlist():
     6
     7     66.3 MiB      0.0 MiB        keyslist=[]
     8     66.3 MiB      0.0 MiB        valslist=[]
     9
    10    314.7 MiB    248.4 MiB        for i in xrange(0,1000000):
    11    314.7 MiB      0.0 MiB                keyslist.append((i,i,i))
    12    314.7 MiB      0.0 MiB                valslist.append([0]*9)
    13
    14    515.2 MiB    200.6 MiB        dflist=pd.DataFrame(valslist,  index=
pd.MultiIndex.from_tuples(keyslist, names=['a','b','c']))

为什么不改用numpy数组呢？它们的内存占用比列表和字典低得多我不使用numpy，因为我不知道数据的大小，所以我必须填充一个列表或dict，然后初始化一个numpy数组或pandas数据帧。我将编写一个列表与dictsDoesn的内存使用基准，这不也取决于数据类型--str，int和floats..您可以使用dfdict=pd.DataFrame.from_dict（d，orient='index'）
直接将dict转换为数据帧，谢谢！我还发现，这条格言的表现不如列表。我将更深入地研究数组，感谢所有令人敬畏的指针。我也会发布我的发现。python中的字符串会记住以前的字符串。因此，“000”与另一个“000”占用相同的内存。可能“1000”实际上是一个字节链接到“000”字符串。数字不能做到这一点。