Python Pandas-使用read_csv使用混合列数据指定数据类型_Python_Pandas_Import_Dask_Large Data

Python Pandas-使用read_csv使用混合列数据指定数据类型

python pandas import dask

Python Pandas-使用read_csv使用混合列数据指定数据类型,python,pandas,import,dask,large-data,Python,Pandas,Import,Dask,Large Data,我正在尝试加载几个相当大的CSV（总计：大约3000万行/7GB）。有些列是混合的ints和float-我希望这些列是np.float16 理想情况下，可以使用read\u csv的dtype参数来提高整个导入过程的效率。但是，这些混合数据列会引发错误以下是代码和相应的错误： def import_processing(filepath, cols, null_cols): result = pd.read_csv(filepath, header = None, names = c

我正在尝试加载几个相当大的CSV（总计：大约3000万行/7GB）。有些列是混合的

ints

和

float

-我希望这些列是

np.float16

理想情况下，可以使用

read\u csv

的

dtype

参数来提高整个导入过程的效率。但是，这些混合数据列会引发错误

以下是代码和相应的错误：

def import_processing(filepath, cols, null_cols):
    result = pd.read_csv(filepath, header = None, names = cols.keys(), dtype = cols)
    result.drop(null_cols, axis = 1, inplace = True)
    return result

data_cols = { 'feature_0' : np.float32,
              'feature_1' : np.float32,
              'feature_2' : np.uint32,
              'feature_3' : np.uint64,
              'feature_4' : np.uint64,
              'feature_5' : np.float16,
              'feature_6' : np.float16,
              'feature_7' : np.float16,
              'feature_8' : np.float16,
              'feature_9' : np.float16,
              'feature_10' : np.float16,
              'feature_11' : np.float16,
              'feature_12' : np.float16,
              'feature_13' : np.float16,
              'feature_14' : np.float16,
              'feature_15' : np.float16,
              'feature_16' : np.float16,
              'feature_17' : np.float16,
              'feature_18' : np.float16,
              'feature_19' : np.float16,
              'feature_20' : np.float16,
              'feature_21' : np.float16,
              'feature_22' : np.float16,
              'feature_23' : np.float16,
              'feature_24' : np.float16,
              'feature_25' : 'M8[ns]',
              'feature_26' : 'M8[ns]',
              'feature_27' : np.uint64,
              'feature_28' : np.uint32,
              'feature_29' : np.uint64,
              'feature_30' : np.uint32}

files = ['./file_0.csv', './file_1.csv', './file_2.csv']
all_data = [import_processing(f, data_cols, ['feature_0', 'feature_1']) for f in files]

但是，如果我不使用

dtype

参数，导入速度会大大降低，因为所有混合数据类型列都作为

dtype（'O'）

而不是

np.float16

导入

我一直在解决这个问题，首先将

pd.to_numeric

（不确定为什么这不会引发相同的错误），它将所有列转换为

np.float64

，然后使用

astype（）

转换将每个列转换为我想要的类型（包括那些混合数据类型列到

np.float16

）

这个过程非常缓慢，所以我想知道是否有更好的方法。目前，我的（非常慢）工作函数如下所示：

def import_processing(filepath, cols, null_cols):
    result = pd.read_csv(filepath, header = None, names = cols.keys())
    result.drop(null_cols, axis = 1, inplace = True)

    for c in null_cols:
        cols.pop(c, None)

    result[result.columns] = result[result.columns].apply(pd.to_numeric, errors='coerce')
    result = result.astype(cols)
    return result

编辑：我已经读到使用Dask（通常）是用Python管理大型数据集的一种更有效的方法。我以前从未使用过它，据我所知，它基本上使用对Pandas的调用来处理许多操作，所以我认为它会有相同的数据类型问题。

从错误中我猜您的一列不是严格意义上的数字，并且您的数据中有一些文本，因此，将其解释为对象数据类型列。无法强制此数据为float16类型。这只是一个猜测。

这是危险的，除非你使用有序字典：

names=cols.keys（）

如果你不给我们看

cols

我们也帮不了你。@IanS-我已经添加了更多细节。有没有更好的方法可以做到这一点？老实说，我会把所有内容都读成

float64

，然后再转换成

float16

。它能避免打字错误吗？

def import_processing(filepath, cols, null_cols):
    result = pd.read_csv(filepath, header = None, names = cols.keys())
    result.drop(null_cols, axis = 1, inplace = True)

    for c in null_cols:
        cols.pop(c, None)

    result[result.columns] = result[result.columns].apply(pd.to_numeric, errors='coerce')
    result = result.astype(cols)
    return result