Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/git/21.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Pyspark 保存拼花文件时丢失执行器_Pyspark_Parquet - Fatal编程技术网

Pyspark 保存拼花文件时丢失执行器

Pyspark 保存拼花文件时丢失执行器,pyspark,parquet,Pyspark,Parquet,我加载了一个大小约为20GB的数据集-集群有1TB的可用空间,所以内存不应该是问题 保存仅由字符串组成的原始数据对我来说没有问题: df_data.write.parquet(os.path.join(data_SET_BASE,'concatenated.parquet'),mode='overwrite') 但是,当我转换数据时: df_transformed=df_data.drop('bri').join( df_数据[['docId','bri']].rdd\ .map(lambda

我加载了一个大小约为20GB的数据集-集群有1TB的可用空间,所以内存不应该是问题

保存仅由字符串组成的原始数据对我来说没有问题:

df_data.write.parquet(os.path.join(data_SET_BASE,'concatenated.parquet'),mode='overwrite')
但是,当我转换数据时:

df_transformed=df_data.drop('bri').join(
df_数据[['docId','bri']].rdd\
.map(lambda x:(x.docId,json.load(x.bri))
如果x.bri不是其他(x.docId,dict())\
.toDF()\
.WithColumnRename(“”“1”,“docId”)\
.将列重命名为(“2”,“bri”),
['dokumentId']
)
然后保存它:

df_transformed.parquet(os.path.join(DATA_SET_BASE,'concatenated.parquet'),mode='overwrite')
日志输出将告诉我已超过内存限制:

18/03/08 10:23:09 WARN TaskSetManager: Lost task 17.0 in stage 18.3 (TID 2866, worker06.hadoop.know-center.at): ExecutorLostFailure (executor 40 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 15.2 GB of 13.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/03/08 10:23:09 WARN TaskSetManager: Lost task 29.0 in stage 18.3 (TID 2878, worker06.hadoop.know-center.at): ExecutorLostFailure (executor 40 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 15.2 GB of 13.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/03/08 10:23:09 WARN TaskSetManager: Lost task 65.0 in stage 18.3 (TID 2914, worker06.hadoop.know-center.at): ExecutorLostFailure (executor 40 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 15.2 GB of 13.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
我不太确定是什么问题。即使将每个执行器的内存设置为60GB RAM也不能解决问题


因此,很明显,问题来自于转型。你知道到底是什么导致了这个问题吗?

你在这里的帮助怎么样了?@pault嗯,事实上我认为问题出在
数据帧中的一列上-有一列名为
volltext
,这似乎就是问题所在。此列包含的数据量最大(但仍低于20GB),但由于某种原因,如果我不删除它,我会遇到此异常。