Scala 拼花地板文件未分区_Scala_Apache Spark_Partitioning_Parquet

Scala 拼花地板文件未分区

scala apache-spark

Scala 拼花地板文件未分区,scala,apache-spark,partitioning,parquet,Scala,Apache Spark,Partitioning,Parquet,我试图将带有分区的拼花火花数据帧保存到临时目录中，以便进行单元测试，但是，由于某些原因，没有创建分区。数据本身保存到目录中，可用于测试。以下是我为此创建的方法： def saveParquet(df: DataFrame, partitions: String*): String = { val path = createTempDir() df.repartition(1).parquet(path)(partitions: _*) path } val fee

我试图将带有分区的拼花火花数据帧保存到临时目录中，以便进行单元测试，但是，由于某些原因，没有创建分区。数据本身保存到目录中，可用于测试。以下是我为此创建的方法：

def saveParquet(df: DataFrame, partitions: String*): String = {
    val path = createTempDir()
    df.repartition(1).parquet(path)(partitions: _*)
    path
  }

val feedPath: String = saveParquet(feedDF.select(feed.schema), "processing_time")

此方法适用于具有各种模式的各种数据帧，但由于某些原因，不会为此数据帧生成分区。我已经注销了生成的路径，它如下所示：

/var/folders/xg/fur_diuhg83b2ba15ih2rt822000dhst/T/testutils-samples8512758291/jf81n7bsj-95hs-573n-b73h-7531ug04515

/var/folders/xg/fur_diuhg83b2ba15ih2rt822000dhst/T/testutils-samples8512758291/jf81n7bsj-95hs-573n-b73h-7531ug04515/processing_time=1591714800000/part-some-random-numbersnappy.parquet

val schema: StructType = StructType(
    Seq(
      StructField("field1", StringType),
      StructField("field2", LongType),
      StructField("field3", StringType),
      StructField("field4Id", IntegerType, nullable = true),
      StructField("field4", FloatType, nullable = true),
      StructField("field5Id", IntegerType, nullable = true),
      StructField("field5", FloatType, nullable = true),
      StructField("field6Id", IntegerType, nullable = true),
      StructField("field6", FloatType, nullable = true),
      //partition keys
      StructField("processing_time", LongType)
    )
  )

但它应该是这样的：

/var/folders/xg/fur_diuhg83b2ba15ih2rt822000dhst/T/testutils-samples8512758291/jf81n7bsj-95hs-573n-b73h-7531ug04515

/var/folders/xg/fur_diuhg83b2ba15ih2rt822000dhst/T/testutils-samples8512758291/jf81n7bsj-95hs-573n-b73h-7531ug04515/processing_time=1591714800000/part-some-random-numbersnappy.parquet

val schema: StructType = StructType(
    Seq(
      StructField("field1", StringType),
      StructField("field2", LongType),
      StructField("field3", StringType),
      StructField("field4Id", IntegerType, nullable = true),
      StructField("field4", FloatType, nullable = true),
      StructField("field5Id", IntegerType, nullable = true),
      StructField("field5", FloatType, nullable = true),
      StructField("field6Id", IntegerType, nullable = true),
      StructField("field6", FloatType, nullable = true),
      //partition keys
      StructField("processing_time", LongType)
    )
  )

在分区之前，我已经检查了数据和所有列的读取是否正常，一旦创建了分区调用，就会出现这个问题。另外，我在目录上运行了一个正则表达式，但在测试样本上出现匹配错误-

s“*processing_time=（[0-9]+）/.*parquet”.r

那么这个问题的原因是什么呢？我还可以如何对数据帧进行分区

Dataframe架构如下所示：

/var/folders/xg/fur_diuhg83b2ba15ih2rt822000dhst/T/testutils-samples8512758291/jf81n7bsj-95hs-573n-b73h-7531ug04515

/var/folders/xg/fur_diuhg83b2ba15ih2rt822000dhst/T/testutils-samples8512758291/jf81n7bsj-95hs-573n-b73h-7531ug04515/processing_time=1591714800000/part-some-random-numbersnappy.parquet

val schema: StructType = StructType(
    Seq(
      StructField("field1", StringType),
      StructField("field2", LongType),
      StructField("field3", StringType),
      StructField("field4Id", IntegerType, nullable = true),
      StructField("field4", FloatType, nullable = true),
      StructField("field5Id", IntegerType, nullable = true),
      StructField("field5", FloatType, nullable = true),
      StructField("field6Id", IntegerType, nullable = true),
      StructField("field6", FloatType, nullable = true),
      //partition keys
      StructField("processing_time", LongType)
    )
  )

尝试

df.repartition（1）.write.partitionBy（partitions）.parquet（path）

@DusanVasiljevic出于某种原因，通过这种方法我得到

org.apache.spark.sql.AnalysisException:path

已经存在于.parquet calli上。如果目录是一个不包含任何其他数据集的临时路径，您可以添加

模式（“覆盖”）

并进行如下调用：

df.repartition（1）.write.mode（“覆盖”）.partitionBy（partitions）.parquet（path）

尝试

df.repartition（1）.write.partitionBy（partitions）.parquet（path）

@DusanVasiljevic出于某种原因，通过这种方法，我得到

org.apache.spark.sql.AnalysisException:path

已经存在于.parquet calli上。如果目录是一个不包含任何其他数据集的临时路径，您可以添加

mode（“overwrite”）

并进行如下调用：

df.repartition（1）.write.mode（“overwrite”）.partitionBy（partitions）.parquet（path）