Python 如何从内存太大的文件中构建（或预计算）直方图？_Python_Numpy_Matplotlib

Python 如何从内存太大的文件中构建（或预计算）直方图？

python numpy matplotlib

Python 如何从内存太大的文件中构建（或预计算）直方图？,python,numpy,matplotlib,Python,Numpy,Matplotlib,是否有一个python图形库，它不需要将所有原始数据点存储为numpy数组或列表来绘制直方图我有一个数据集对于内存来说太大了，我不想使用子采样来减小数据大小我正在寻找的是一个库，它可以获取生成器的输出（从文件生成的每个数据点，作为浮点值），并动态构建直方图这包括在生成器从文件中生成每个数据点时计算箱子大小如果这样的库不存在，我想知道numpy是否能够从生成的数据点预计算{bin_1:count_1，bin_2:count_2…bin_x:count_x} 数据点作为垂直矩阵保存在选项卡文

是否有一个python图形库，它不需要将所有原始数据点存储为numpy
数组或列表来绘制直方图

我有一个数据集对于内存来说太大了，我不想使用子采样来减小数据大小
我正在寻找的是一个库，它可以获取生成器的输出（从文件生成的每个数据点，作为浮点值），并动态构建直方图这包括在生成器从文件中生成每个数据点时计算箱子大小如果这样的库不存在，我想知道numpy 是否能够从生成的数据点预计算{bin_1:count_1，bin_2:count_2…bin_x:count_x} 数据点作为垂直矩阵保存在选项卡文件中，按节点得分排列，如下所示： node node 5.55555 更多信息： low = np.inf high = -np.inf # find the overall min/max chunksize = 1000 loop = 0 for chunk in pd.read_table('gsl_test_1.txt', header=None, chunksize=chunksize, delimiter='\t'): low = np.minimum(chunk.iloc[:, 2].min(), low) high = np.maximum(chunk.iloc[:, 2].max(), high) loop += 1 lines = loop*chunksize nbins = math.ceil(math.sqrt(lines)) bin_edges = np.linspace(low, high, nbins + 1) total = np.zeros(nbins, np.int64) # np.ndarray filled with np.uint32 zeros, CHANGED TO int64 # iterate over your dataset in chunks of 1000 lines (increase or decrease this # according to how much you can hold in memory) for chunk in pd.read_table('gsl_test_1.txt', header=None, chunksize=2, delimiter='\t'): # compute bin counts over the 3rd column subtotal, e = np.histogram(chunk.iloc[:, 2], bins=bin_edges) # np.ndarray filled with np.int64 # accumulate bin counts over chunks total += subtotal plt.hist(bin_edges[:-1], bins=bin_edges, weights=total) # plt.bar(np.arange(total.shape[0]), total, width=1) plt.savefig('gsl_test_hist.svg') 104301133行数据（到目前为止）我不知道最小值或最大值料仓宽度应相同垃圾箱的数量可能是1000个尝试回答： low = np.inf high = -np.inf # find the overall min/max chunksize = 1000 loop = 0 for chunk in pd.read_table('gsl_test_1.txt', header=None, chunksize=chunksize, delimiter='\t'): low = np.minimum(chunk.iloc[:, 2].min(), low) high = np.maximum(chunk.iloc[:, 2].max(), high) loop += 1 lines = loop*chunksize nbins = math.ceil(math.sqrt(lines)) bin_edges = np.linspace(low, high, nbins + 1) total = np.zeros(nbins, np.int64) # np.ndarray filled with np.uint32 zeros, CHANGED TO int64 # iterate over your dataset in chunks of 1000 lines (increase or decrease this # according to how much you can hold in memory) for chunk in pd.read_table('gsl_test_1.txt', header=None, chunksize=2, delimiter='\t'): # compute bin counts over the 3rd column subtotal, e = np.histogram(chunk.iloc[:, 2], bins=bin_edges) # np.ndarray filled with np.int64 # accumulate bin counts over chunks total += subtotal plt.hist(bin_edges[:-1], bins=bin_edges, weights=total) # plt.bar(np.arange(total.shape[0]), total, width=1) plt.savefig('gsl_test_hist.svg') 输出：您可以迭代数据集中的数据块，并使用将bin计数累积到单个向量中（您需要事先定义bin边，并使用bins= 参数将它们传递给np.直方图），例如：如果您希望确保您的存储箱覆盖数组中的所有值，但您还不知道最小值和最大值，则需要在计算这些值之前循环一次（例如，使用np.min /np.max ），例如： low = np.inf high = -np.inf # find the overall min/max for chunk in pd.read_table('/path/to/my/dataset.txt', header=None, chunksize=1000): low = np.minimum(chunk.iloc[:, 2].min(), low) high = np.maximum(chunk.iloc[:, 2].max(), high) 一旦有了仓位计数数组，就可以直接使用以下方法生成条形图：也可以使用weights= 参数从计数向量而不是样本生成直方图，例如： plt.hist(bin_edges[:-1], bins=bin_edges, weights=total, ...) @AlexHall更新了帖子，并回答了您的评论questions@ali_m当生成器从文件对象生成一个新数据点时，计算bin大小。0是公平的最小值吗？您在结尾处使用“（和任意）”是什么意思？您关心前两列吗？@ThomasMatthew您可能对一个非常强大的解决方案感兴趣，该解决方案仅使用一个纯numpy （即不使用任何额外的内存消耗导入（s））>>（一些选美游行投票的歇斯底里浪潮使得这篇文章在几分钟内就被隐藏了，即使在编辑内容时也是如此，因此此链接为访问大规模数据集的解决方案提供了最后手段）。向您致意，托马斯。这对于一个分布是有意义的，但假设我想绘制两个分布的叠加直方图。如果我将您的示例嵌套在包含两个文件列表的循环中，我可以将预计算的存储箱合并到一个历史中吗？单历史是指一个图形，其中两个分布作为不同文件的直方图绘制例如，颜色。此解决方案产生TypeError:无法将ufunc add输出从dtype（'uint32'）转换为dtype（'int64'）在第total+=subtotal 行使用强制转换规则“same_kind” 。我将totals 数组更改为包含int64 类型的零，现在解决方案挂起。错误的原因是np。直方图返回一个带符号的整型bin计数数组（我不知道为什么会这样，因为它们永远不应该是负数…），而且没有安全的方法将有符号整数转换为无符号整数，以便将其添加到total 中。您可以创建total 有符号整数数组（就像您所做的那样），或者在将其添加到total 之前，您可以将subtotal 转换为无符号整数（正如我在编辑中所做的那样）。但是，我不知道它为什么应该挂起。可能是由于选择了太大的chunksize 而导致内存不足？chunk[2] 应该是chunk.iloc[：，2]（您当前索引的是每个块的一行，而不是一列） plt.hist(bin_edges[:-1], bins=bin_edges, weights=total, ...)