Python Dask工作人员在编写拼花文件时,在完成之前内存不足

Python Dask工作人员在编写拼花文件时,在完成之前内存不足,python,parquet,dask,Python,Parquet,Dask,我正在尝试使用python和dask将大量的.csv文件转换为拼花地板格式 这是我使用的代码: trans = dd.read_csv(os.path.join(TRANS_PATH, "*.TXT"), sep=";", dtype=col_types, parse_dates=['salesdate']) trans = trans.drop('salestime', axis=1) trans['month_year'] = trans[

我正在尝试使用python和dask将大量的.csv文件转换为拼花地板格式

这是我使用的代码:

trans = dd.read_csv(os.path.join(TRANS_PATH, "*.TXT"), 
                        sep=";", dtype=col_types, parse_dates=['salesdate'])

trans = trans.drop('salestime', axis=1)
trans['month_year'] = trans['salesdate'].dt.strftime('M_%Y_%m')

trans['chainid'] = '41'
trans['key'] = trans['chainid'] + trans['barcode']
trans = trans.join(attribs[['catcode']], on=['key'])

start = default_timer()
trans.to_parquet(PARQUET_PATH, engine="fastparquet", compression='snappy',
                     partition_on=['catcode', 'month_year'], append=False)
end = default_timer()
print("Done in {} secs.".format(end - start) )
代码似乎运行良好,所有拼花文件都在正确的目录下创建,几乎直到最后都没有警告。图形执行将一直持续到以下位置:

此时,程序会被卡住一分钟左右,然后出现以下警告:

distributed.nanny - WARNING - Worker exceeded 95% memory budget.  Restarting
distributed.nanny - WARNING - Worker process 16996 was killed by signal 15
distributed.nanny - WARNING - Restarting worker
警告后,数百个进程重新启动并再次执行:

这种情况发生三次,程序最终崩溃:

---------------------------------------------------------------------------
KilledWorker                              Traceback (most recent call last)
<ipython-input-8-f644e4fa53ea> in <module>()
     27 start = default_timer()
     28 trans.to_parquet(PARQUET_PATH, engine="fastparquet", compression='snappy',
---> 29                      partition_on=['catcode', 'month_year'], append=False)
---------------------------------------------------------------------------
KilledWorker回溯(最近一次呼叫上次)
在()
27开始=默认计时器()
28转换至拼花地板(拼花地板路径,engine=“fastparquet”,compression='snappy',
--->29分区_on=['catcode','MOUNT\u year',append=False)

有人知道为什么会这样吗

同样的问题,有什么帮助吗?