Python 如何在不将67GB文件/Dask数据帧完全加载到内存的情况下有效地转置它？_Python_Dataframe_Dask

Python 如何在不将67GB文件/Dask数据帧完全加载到内存的情况下有效地转置它？

python dataframe dask

Python 如何在不将67GB文件/Dask数据帧完全加载到内存的情况下有效地转置它？,python,dataframe,dask,Python,Dataframe,Dask,我有3个相当大的文件（67gb、36gb、30gb），我需要在这些文件上训练模型。但是，要素是行，样本是列。因为Dask没有实现转置和按行分割存储数据帧，所以我需要自己写一些东西来实现这一点。有没有一种方法可以高效地转置而不加载到内存中我有16 gb的内存可供使用，我正在使用jupyter笔记本电脑。我已经写了一些相当慢的代码，但我真的希望有一个更快的解决方案。以下代码的速度需要一个月才能完成所有文件。最慢的几个数量级是awk import dask.dataframe as dd impor

我有3个相当大的文件（67gb、36gb、30gb），我需要在这些文件上训练模型。但是，要素是行，样本是列。因为Dask没有实现转置和按行分割存储数据帧，所以我需要自己写一些东西来实现这一点。有没有一种方法可以高效地转置而不加载到内存中
我有16 gb的内存可供使用，我正在使用jupyter笔记本电脑。我已经写了一些相当慢的代码，但我真的希望有一个更快的解决方案。以下代码的速度需要一个月才能完成所有文件。最慢的几个数量级是awk

import dask.dataframe as dd import subprocess from IPython.display import clear_output df = dd.read_csv('~/VeryLarge.tsv') with open('output.csv','wb') as fout: for i in range(1, len(df.columns)+1): print('AWKing') #read a column from the original data and store it elsewhere x = "awk '{print $"+str(i)+"}' ~/VeryLarge.tsv > ~/file.temp" subprocess.check_call([x], shell=True) print('Reading') #load and transpose the column col = pd.read_csv('~/file.temp') row = col.T display(row) print('Deleting') #remove the temporary file created !rm ../file.temp print('Storing') #store the row in its own csv just to be safe. not entirely necessary row.to_csv('~/columns/col_{:09d}'.format(i), header=False) print('Appending') #append the row (transposed column) to the new file with open('~/columns/col_{:09d}', 'rb') as fin: for line in fin: fout.write(line) clear_output() #Just a measure of progress print(i/len(df.columns))

数据本身有1000万行（特征）和2000列（示例）。它只需要被转换。目前，它看起来是这样的：
我将创建一个中间文件，并使用fp.seek以新的顺序以二进制格式写入它们，然后再将其转换回新的CSV。给定行，列变为列，行-sys.float\u info将给出每个元素的大小，每个元素的位置（（是列*旧行\长度+行）*浮点大小）

然后将它们重新组合为CSV，方法是将它们转换回文本，并按每行读取旧的行数。
我已修改了原始脚本，以便部署在任意数量的CPU上。因为我可以使用多个线程并部署在aws上，所以它工作得更快。我使用了一台96核的机器，大约在8小时内完成了任务。我很惊讶，因为这几乎是线性缩放！这样做的目的是使一些重复的任务可以分配。然后，您将能够将任务分配给CPU。这里使用命令
pool.map（）
完成并行化
从命令行使用此脚本非常简单：

python3 transposer.py -i largeFile.tsv
如果需要，还可以指定其他参数

import argparse, subprocess import numpy as np import pandas as pd import dask.dataframe as dd from IPython.display import clear_output from contextlib import closing from os import cpu_count from multiprocessing import Pool parser = argparse.ArgumentParser(description='Transpose csv') parser.add_argument('-i', '--infile', help='Path to input folder', default=None) parser.add_argument('-s', '--sep', help='input separator', default='\t') args = parser.parse_args() infile = args.infile sep = args.sep df = pd.read_csv(infile, sep='\t', nrows=3) def READ_COL(item): print(item) outfile = 'outfile{}.temp'.format(item) if item !=0: x = "awk '{print $"+str(item)+"}' "+infile+" > "+outfile subprocess.check_call([x], shell=True) col = pd.read_csv(outfile) row = col.T display(row) row.to_csv('col_{:09d}.csv'.format(item), header=False) subprocess.check_call(['rm '+outfile], shell=True) print(item/len(df.columns)) with closing(Pool(processes=cpu_count())) as pool: pool.map(READ_COL, list(range(1, len(df.columns)+1)))

在此之后，您应该有许多被转置的列。您只需要使用
cat
或其他一些命令行工具将它们连接在一起。我刚刚运行了
cat col\u*>full\u file\u transposed.csv
您能将输入样本发布到所需的输出吗？刚刚编辑以显示数据样本。数据是真实的，但是由于公司私有化，我改变了特征和样本的名称。在偏离主题的风险中，如果您的数据集具有很高的零点比例，那么您可以考虑使用稀疏矩阵表示。许多常用的矩阵运算都可以通过这种方式更加高效。所以您想要10m列和2000行？试试看