Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark Spark如何基于年和月划分数据帧_Apache Spark - Fatal编程技术网

Apache spark Spark如何基于年和月划分数据帧

Apache spark Spark如何基于年和月划分数据帧,apache-spark,Apache Spark,我想按年/月/日划分数据帧。我还想删除所有空分区,并将输出保存在本地计算机上的年/月/日等文件夹下 我尝试了以下方法,但仍然创建了200多个分区 val sqldf = spark.sql("SELECT year(EventDate) AS Year_EventDate, month(EventDate) as Month_EventDate FROM table CLUSTER BY Year_EventDate,Month_EventDate") sqldf.write.format(

我想按年/月/日划分数据帧。我还想删除所有空分区,并将输出保存在本地计算机上的年/月/日等文件夹下

我尝试了以下方法,但仍然创建了200多个分区

val sqldf = spark.sql("SELECT year(EventDate) AS Year_EventDate, month(EventDate) as Month_EventDate FROM table CLUSTER BY Year_EventDate,Month_EventDate")


sqldf.write.format("com.databricks.spark.csv").option("header", "true").mode("overwrite").save(destinationFolder)
您得到200个(我想是吧?)分区的原因是,这是Spark中任务的默认并行级别。根据数据的大小,如果需要,可以将其合并到更少的分区中

saldf.coalesce(10)
要写入所需的文件夹,应首先将数据重新分区到所需的文件夹中,然后将提示传递给编写器

sqldf.repartition($"year", $"month", $"day").write.format("com.databricks.spark.csv").option("header", "true").mode("overwrite").partitionBy("year", "month", "day").save(destinationFolder)
确保列year、month、day是您想要给它们命名的任何名称,并且是数据中的列