Scala 火花拼花地板分区:我可以按给定贴图元素的值进行分区吗?
我想将我的数据框保存到配置单元表中的拼花地板文件中……但我想用保证存在的特定映射元素的值对该数据框进行分区 例如:Scala 火花拼花地板分区:我可以按给定贴图元素的值进行分区吗?,scala,apache-spark,apache-spark-sql,parquet,Scala,Apache Spark,Apache Spark Sql,Parquet,我想将我的数据框保存到配置单元表中的拼花地板文件中……但我想用保证存在的特定映射元素的值对该数据框进行分区 例如: case class Person(name: String, attributes: Map[String, String]) val people = Seq[Person](Person("John", Map("birthDate"->"2019-12-30", "favoriteColor"->"red")),
case class Person(name: String, attributes: Map[String, String])
val people = Seq[Person](Person("John", Map("birthDate"->"2019-12-30", "favoriteColor"->"red")),
Person("Lucy", Map("birthDate"->"2019-12-31", "favoriteFood"->"pizza")),
Person("David", Map("birthDate"->"2020-01-01", "favoriteMusic"->"jazz"))).toDF
//pseudo-code, doesn't work
//people.write.format("parquet").partitionBy("attributes[birthDate]").saveAsTable("people")
我可以通过将此值提升到顶级字段并加入下面的链接来绕过它,但最好避免这样做。除了避免连接开销之外,我们的用户还需要查询属性[birthDate],因此直接在该字段上进行分区是有利的,而不是单独的顶级字段
有没有一种方法可以在不需要临时DFs/联接的情况下直接对该值进行分区
val justNameAndBirthDate = people.select($"name", $"attributes"("birthDate")).withColumnRenamed("attributes[birthDate]", "birthDate")
val newDfWithBirthDate = people.join(justNameAndBirthDate, Seq("name"), "left")
newDfWithBirthDate.write.format("parquet").partitionBy("birthDate").saveAsTable("people")
一种方法是创建一个列来按它进行分区,并根据需要命名
val df = people.withColumn("attributes[birthDate]", $"attributes"("birthDate"))
scala> df.show(false)
+------+------------------------------------------------+---------------------+
|name |attributes |attributes[birthDate]|
+------+------------------------------------------------+---------------------+
|John |[birthDate -> 2019-12-30, favoriteColor -> red] |2019-12-30 |
|Lucy |[birthDate -> 2019-12-31, favoriteFood -> pizza]|2019-12-31 |
|David |[birthDate -> 2020-01-01, favoriteMusic -> jazz]|2020-01-01 |
+------+------------------------------------------------+---------------------+
它肯定会复制数据,但会成功的
然后,您可以根据需要进行分区:
df.write.format("parquet").partitionBy("attributes[birthDate]").saveAsTable("people")
检查输出表
一种方法是创建一个列来按它进行分区,并根据需要命名
val df = people.withColumn("attributes[birthDate]", $"attributes"("birthDate"))
scala> df.show(false)
+------+------------------------------------------------+---------------------+
|name |attributes |attributes[birthDate]|
+------+------------------------------------------------+---------------------+
|John |[birthDate -> 2019-12-30, favoriteColor -> red] |2019-12-30 |
|Lucy |[birthDate -> 2019-12-31, favoriteFood -> pizza]|2019-12-31 |
|David |[birthDate -> 2020-01-01, favoriteMusic -> jazz]|2020-01-01 |
+------+------------------------------------------------+---------------------+
它肯定会复制数据,但会成功的
然后,您可以根据需要进行分区:
df.write.format("parquet").partitionBy("attributes[birthDate]").saveAsTable("people")
检查输出表
这回答了你的问题吗?这回答了你的问题吗?