Dataframe 火花检查点导致连接问题_Dataframe_Apache Spark_Pyspark_Checkpoint_Spark Checkpoint

Dataframe 火花检查点导致连接问题

dataframe apache-spark pyspark

Dataframe 火花检查点导致连接问题,dataframe,apache-spark,pyspark,checkpoint,spark-checkpoint,Dataframe,Apache Spark,Pyspark,Checkpoint,Spark Checkpoint,我有一段代码基本上完成了以下功能： df=spark.read.parquet("very large dataset") df.select(columns) df.filter("some rows I dont want") df2=df.groupBy('keys').agg("max of a column") df=df.drop("columns that will be got from df2")

我有一段代码基本上完成了以下功能：

df=spark.read.parquet("very large dataset")
df.select(columns)
df.filter("some rows I dont want")

df2=df.groupBy('keys').agg("max of a column")
df=df.drop("columns that will be got from df2")
df=df.join(df2, on=["key cols"], "left")

spark.sparkContext.setCheckpointDir("checkpoint/path")
df3=df.checkpoint()

df4=df3.filter("condition 1").groupBy('key').agg("perform aggregations")
df5=df3.filter("condition 2").select(certain columns).alias(rename them)

df6=df4.join(df5, on=["key cols"], how="outer") #perform full outer join to get all columns and rows

此时，我得到以下错误：

已解析属性UL#28099从中缺失工具ID 27908、LL 27913、LW 27915、UL 27236、UW 27914、序列ID 27907、结果27911、时间戳27909、日期27910 接线员！项目[序列号27907，工具号27908，时间戳27909，日期#27910，结果#27911，UL#28099，强制转换（如果为空（LL#27913）然后-无穷远，否则LL#27913结束为两倍）作为LL#27246，UW#27914， LW#27915]。具有相同名称的属性出现在操作中： UL。请检查是否使用了正确的属性\nJoin 完全外部\n

但是，当我删除检查点时，就像正常缓存的数据帧一样运行它，它工作得很好。如果我的数据集很小，这是可以的，但我需要检查点，因为与可用的EMR资源相比，我有一个非常大的数据集

有没有人遇到过类似的问题