Python列表/字典与numpy数组：性能与内存控制_Python_Performance_Memory Management

Python列表/字典与numpy数组：性能与内存控制

python performance memory-management

Python列表/字典与numpy数组：性能与内存控制,python,performance,memory-management,Python,Performance,Memory Management,我必须迭代读取数据文件并将数据存储到（numpy）数组中。我选择将数据存储到“数据字段”字典中：{'field1'：array1，'field2'：array2，…} 案例1（清单）：使用列表（或collections.deque（））来“附加”新的数据数组，代码是高效的。但是，当我连接存储在列表中的数组时，内存会增长，我没有再次释放它。例如： filename = 'test' # data file with a matrix of shape (98, 56) nFields = 56

我必须迭代读取数据文件并将数据存储到（numpy）数组中。我选择将数据存储到“数据字段”字典中：

{'field1'：array1，'field2'：array2，…}

案例1（清单）：使用列表（或

collections.deque（）

）来“附加”新的数据数组，代码是高效的。但是，当我连接存储在列表中的数组时，内存会增长，我没有再次释放它。例如：

filename = 'test' # data file with a matrix of shape (98, 56) nFields = 56 # Initialize data dictionary and list of fields dataDict = {} # data directory: each entry contains a list field_names = [] for i in xrange(nFields): field_names.append(repr(i)) dataDict[repr(i)] = [] # Read a data file N times (it represents N files reading) # file contains 56 fields of arbitrary length in the example # Append each time the data fields to the lists (in the data dictionary) N = 10000 for j in xrange(N): xy = np.loadtxt(filename) for i,field in enumerate(field_names): dataDict[field].append(xy[:,i]) # concatenate list members (arrays) to a numpy array for key,value in dataDict.iteritems(): dataDict[key] = np.concatenate(value,axis=0)

nFields = 56 dataDict = {} # data directory: each entry contains a list field_names = [] for i in xrange(nFields): field_names.append(repr(i)) dataDict[repr(i)] = np.array([]) # Read a data file N times (it represents N files reading) # Concatenate data fields to numpy arrays (in the data dictionary) N = 10000 for j in xrange(N): xy = np.loadtxt(filename) for i,field in enumerate(field_names): dataDict[field] = np.concatenate((dataDict[field],xy[:,i]))
计算时间：63.4秒
内存使用率（顶部）：13862 gime_se 20 01042m934m4148 S 0 5.8 1:00.44 python
案例2（numpy阵列）：每次读取numpy数组时，直接将它们串联起来效率很低，但内存仍处于控制之下。例如：

filename = 'test' # data file with a matrix of shape (98, 56) nFields = 56 # Initialize data dictionary and list of fields dataDict = {} # data directory: each entry contains a list field_names = [] for i in xrange(nFields): field_names.append(repr(i)) dataDict[repr(i)] = [] # Read a data file N times (it represents N files reading) # file contains 56 fields of arbitrary length in the example # Append each time the data fields to the lists (in the data dictionary) N = 10000 for j in xrange(N): xy = np.loadtxt(filename) for i,field in enumerate(field_names): dataDict[field].append(xy[:,i]) # concatenate list members (arrays) to a numpy array for key,value in dataDict.iteritems(): dataDict[key] = np.concatenate(value,axis=0)

nFields = 56 dataDict = {} # data directory: each entry contains a list field_names = [] for i in xrange(nFields): field_names.append(repr(i)) dataDict[repr(i)] = np.array([]) # Read a data file N times (it represents N files reading) # Concatenate data fields to numpy arrays (in the data dictionary) N = 10000 for j in xrange(N): xy = np.loadtxt(filename) for i,field in enumerate(field_names): dataDict[field] = np.concatenate((dataDict[field],xy[:,i]))
计算时间：1377.8秒
内存使用率（顶部）：14850 gime_se 20 0650m542m4144 s0 3.4 22:31.21 python
问题:

是否有任何方法可以使情况1的性能保持在情况2的控制之下

在案例1中，当串联列表成员时，内存似乎会增长（
np.concatenate（value，axis=0）
）。有更好的办法吗

以下是根据我观察到的情况所发生的事情。实际上没有内存泄漏。相反，Python的内存管理代码（可能与您所处的任何操作系统的内存管理有关）决定在程序中保留原始字典（没有连接数组的字典）使用的空间。但是，它可以自由地重复使用。我通过以下方式证明了这一点：

将您作为答案给出的代码生成一个返回dataDict的函数

调用函数两次，并将结果分配给两个不同的变量
当我这样做时，我发现使用的内存量只从约900 GB增加到约1.3 GB。如果没有额外的字典内存，Numpy数据本身在我的计算中应该占用大约427MB的空间，所以加起来就是这样。我们的函数创建的第二个初始的、未连接的字典刚刚使用了已经分配的内存

如果您真的无法使用超过~600 MB的内存，那么我建议您使用Numpy数组，就像Python列表内部所做的那样：分配一个具有一定数量列的数组，当您使用完这些列后，创建一个包含更多列的放大数组，并复制数据。这将减少连接的数量，这意味着它将更快（尽管仍然没有列表快），同时保持内存使用量较低。当然，实现起来也比较麻烦。
Numpy的concatenate在每次使用时都会创建一个全新的Numpy数组。Numpy数组的作用是预先分配内存。如果您不这样做，那么您就没有非常明智地使用Numpy。这就是Numpy示例中速度缓慢的原因。@Justin:我无法预先分配Numpy数组，因为我不知道它们以前的长度。这就是为什么我更喜欢使用列表。当我将列表转换为numpy数组时，问题就出现了：内存使用量无可挽回地增长。我得出了类似的结论，对代码进行了不同的测试：使用中间字典。Python的内存管理代码（可能与您所处的任何操作系统的内存管理有关）决定在程序中保留原始字典（没有连接数组的字典）使用的空间。我在Linux和MacOS上尝试了这段代码，得到了相同的结果。我仍然想知道为什么python决定保留已删除字典的空间。也许，这不是内存泄漏，但我仍然看不到它的实际用途。谢谢你的推荐！