Apache spark PySpark将两个数据帧写入同一分区，但以文件夹分隔_Apache Spark_Pyspark_Apache Spark Sql_Pyspark Dataframes

Apache spark PySpark将两个数据帧写入同一分区，但以文件夹分隔

apache-spark pyspark

Apache spark PySpark将两个数据帧写入同一分区，但以文件夹分隔,apache-spark,pyspark,apache-spark-sql,pyspark-dataframes,Apache Spark,Pyspark,Apache Spark Sql,Pyspark Dataframes,我使用Spark将两个不同的数据帧写入同一个分区，但我希望它们在分区的末尾被文件夹分隔开。i、 e.第一个数据帧将写入yyyy/mm/dd/，第二个数据帧将写入yyyy/mm/dd/拒绝/ 目前，我能够使用以下代码将第一个数据帧写入yyyy/mm/dd/，将第二个数据帧写入rejected/yyyy/mm/dd： first_df.repartition('year', 'month', 'day').write \ .partitionBy('year', 'month', 'da

我使用Spark将两个不同的数据帧写入同一个分区，但我希望它们在分区的末尾被文件夹分隔开。i、 e.第一个数据帧将写入

yyyy/mm/dd/

，第二个数据帧将写入

yyyy/mm/dd/拒绝/

目前，我能够使用以下代码将第一个数据帧写入

yyyy/mm/dd/

，将第二个数据帧写入

rejected/yyyy/mm/dd

：

  first_df.repartition('year', 'month', 'day').write \
    .partitionBy('year', 'month', 'day') \
    .mode("append") \
    .csv(f"{output_path}/")

  second_df.repartition('year', 'month', 'day').write \
    .partitionBy('year', 'month', 'day') \
    .mode("append") \
    .csv(f"{output_path}/rejected")

赞赏的任何建议

将被拒绝的添加为第二个df的文字值，然后包含在分区中，即 second_df.withColumn("rej",lit("rejected")) \ .repartition('year', 'month', 'day').write \ .partitionBy('year', 'month', 'day','rej') \ .mode("append") \ .csv(f"{output_path}") 另一种方法是使用将文件移动到受尊重的目录中更新： URI = sc._gateway.jvm.java.net.URI Path = sc._gateway.jvm.org.apache.hadoop.fs.Path FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration fs = FileSystem.get(URI("hdfs://<name_node>:8020"), Configuration()) #rename the directory fs.rename(Path(f'{output_path}/rej=rejected'),Path(f'{output_path}/rejected')) 重命名目录： URI = sc._gateway.jvm.java.net.URI Path = sc._gateway.jvm.org.apache.hadoop.fs.Path FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration fs = FileSystem.get(URI("hdfs://<name_node>:8020"), Configuration()) #rename the directory fs.rename(Path(f'{output_path}/rej=rejected'),Path(f'{output_path}/rejected')) URI=sc.\u gateway.jvm.java.net.URI Path=sc.\u gateway.jvm.org.apache.hadoop.fs.Path FileSystem=sc.\u gateway.jvm.org.apache.hadoop.fs.FileSystem Configuration=sc.\u gateway.jvm.org.apache.hadoop.conf.Configuration fs=FileSystem.get（URI（“hdfs://:8020”），Configuration（） #重命名目录重命名（路径（f'{output\u Path}/rej=rejected'），路径（f'{output\u Path}/rejected'））此操作可工作并将文件输出到文件夹rej=reject 。有没有办法让名为just的文件夹被拒绝？@AaronZhong，您可以重命名该目录，请检查更新的答案！