Python 为什么numpy narray从文件中读取会消耗这么多内存?

Python 为什么numpy narray从文件中读取会消耗这么多内存?,python,arrays,file-io,numpy,Python,Arrays,File Io,Numpy,该文件包含2000000行: 每行包含208列,以逗号分隔,如下所示: 0.0863314058048,0.0208767447842,0.03358010485,0.0,1.0,0.0,0.314285714286,0.336293217457,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

该文件包含2000000行: 每行包含208列,以逗号分隔,如下所示:

0.0863314058048,0.0208767447842,0.03358010485,0.0,1.0,0.0,0.314285714286,0.336293217457,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0 0.0863314058048,0.0208767447842,0.03358010485,0.0,1.0,0.0,0.314285714286,0.336293217457,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0 程序将此文件读取到numpy narray,我预计它将消耗大约
(2000000*208*8B)=3.2GB的内存。
然而,当程序读取该文件时,我发现该程序消耗了大约20GB的内存


我不明白为什么我的程序占用了如此多的内存,以至于达不到预期?

我认为你应该尝试
pandas
来处理大数据(文本文件)。熊猫就像是python的高手。它在内部使用
numpy
来表示数据

HDF5文件也是将大数据保存到HDF5二进制文件的另一种方法


这个问题会让我们了解如何处理大文件-

我使用的是Numpy 1.9.0,而
np.loadtxt()
np.genfromtxt()
的内存不足似乎与它们基于临时列表来存储数据的事实直接相关:

  • 有关
    np.loadtxt()
  • 对于
    np.genfromtxt()
通过预先了解阵列的
形状
,您可以想象一个文件读取器,它将使用相应的
数据类型存储数据,从而消耗非常接近理论内存量(本例为3.2 GB):

def read_large_txt(path, delimiter=None, dtype=None):
    with open(path) as f:
        nrows = sum(1 for line in f)
        f.seek(0)
        ncols = len(f.next().split(delimiter))
        out = np.empty((nrows, ncols), dtype=dtype)
        f.seek(0)
        for i, line in enumerate(f):
            out[i] = line.split(delimiter)
    return out

您能显示从文件中读取数据的确切代码行吗?如果我们必须猜测,很难回答。@BasSwinckels谢谢,我使用np.loadtxt()读取数据。Saullo Castro指出了这个问题,并粗略地解释了这个问题。看过示例行后,如果使用稀疏矩阵,可能会节省大量内存,不是吗?@user3666197肯定是的,但这需要更复杂的读取器函数……当然,OP问题似乎是内存限制,因此,这是一个折衷潜在阻塞内存限制问题和CPU限制工作的方向,这使得输入本身和进一步处理在更大的数据集上都是可行的(我的直觉告诉OP并不是在寻找一个一行或几个SLOC-s,而是一个可行的方法来输入和处理类似批次的数据,而且非常舒适,因此将支付更智能的输入预处理器的成本)@user3666197我在这里测试了
np.loadtxt()
np.genfromtxt()的问题
不知道形状,被迫使用临时列表和
列表。append()
(参见和)这是不可能的,Saullo,正如您在回答中提到的,输入处理器相关问题。请原谅我的评论,它只是触及了正确的(更有效的)数据集的矩阵表示。我没有使用熊猫,谢谢你的建议,我会学习的。