Scala Spark基于其他列值创建列名称
我是spark新手,需要帮助以以下格式转换此数据:Scala Spark基于其他列值创建列名称,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我是spark新手,需要帮助以以下格式转换此数据: | id |creation date| final_v1_16_15_wk | final_v1_17_18_wk |final_v2_16_15_wk | final_v2_17_18_wk | |id_1 | 2020-07-15 | 0.368 | 0.564 | 0.5 | 0.78 | |id_2 | 2020-07-15 | 0.468 |
| id |creation date| final_v1_16_15_wk | final_v1_17_18_wk |final_v2_16_15_wk | final_v2_17_18_wk |
|id_1 | 2020-07-15 | 0.368 | 0.564 | 0.5 | 0.78 |
|id_2 | 2020-07-15 | 0.468 | 0.657 | 0.3 | 0.65 |
我有以下格式的数据:
+----------+-------------------------+-------------------+---------+------+
| id | values | creation date | leadTime| span |
+----------+-------------------------+-------------------+---------+--+---+
|id_1 |[[v1, 0.368], [v2, 0.5]] | 2020-07-15 | 16 | 15 |
|id_2 |[[v1, 0.368], [v2, 0.4]] | 2020-07-15 | 16 | 15 |
|id_3 |[[v1, 0.468], [v2, 0.3]] | 2020-07-15 | 16 | 15 |
|id_4 |[[v1, 0.368], [v2, 0.3]] | 2020-07-15 | 16 | 15 |
|id_5 |[[v1, 0.668], [v2, 0.1]] | 2020-07-15 | 16 | 15 |
|id_6 |[[v1, 0.168], [v2, 0.2]] | 2020-07-15 | 16 | 15 |
+----------+-------------------------+-------------------+---------+------+
通过使用列字段中的值,我需要以下格式的数据:
使用提前期和跨度列值创建列名为的新列
+----------+--------------+--------------------+--------------------+
| id |creation date | final_v1_16_15_wk | final_v2_16_15_wk |
+----------+--------------+--------------------+--------------------+
|id_1 |2020-07-15 | 0.368 | 0.5 |
|id_2 |2020-07-15 | 0.368 | 0.4 |
|id_3 |2020-07-15 | 0.468 | 0.3 |
|id_4 |2020-07-15 | 0.368 | 0.3 |
|id_5 |2020-07-15 | 0.668 | 0.1 |
|id_6 |2020-07-15 | 0.168 | 0.2 |
+----------+--------------+--------------------+--------------------+
此DF的另一个示例:
val df=Seq((“id_1”,Map(“v1”->0.368,“v2”->0.5),“2020-07-15”,16,15),(“id_1”,Map(“v1”->0.564,“v2”->0.78),“2020-07-15”,17,18),(“id_2”,Map(“v1”->0.468,“v2”->0.3),“2020-07-15”,16,15),“id_2”,Map(“v1”->0.657,“v2”->0.65),“2020-07-15”,17,18),“id”,“创建日期”,“时间跨度”),
输出格式如下:
| id |creation date| final_v1_16_15_wk | final_v1_17_18_wk |final_v2_16_15_wk | final_v2_17_18_wk |
|id_1 | 2020-07-15 | 0.368 | 0.564 | 0.5 | 0.78 |
|id_2 | 2020-07-15 | 0.468 | 0.657 | 0.3 | 0.65 |
尝试使用以下逻辑生成列名/值,但无效:
val modDF = finalDF.withColumn("final_" + newFinalDF("values").getItem(0).getItem("_1") + "_" + newFinalDF("leadTime") + "_" + newFinalDF("span") + "_wk", $"values".getItem(0).getItem("_2"));
对评论的答复
import org.apache.spark.sql.functions._
import scala.collection.mutable.WrappedArray
import org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK
import spark.implicits._
val df30 = Seq(
("id_1", Map("v1" -> 0.368, "v2" -> 0.5), "2020-07-15", 16, 15),
("id_1", Map("v1" -> 0.564, "v2" -> 0.78), "2020-07-15", 17, 18),
("id_2", Map("v1" -> 0.468, "v2" -> 0.3), "2020-07-15", 16, 15),
("id_2", Map("v1" -> 0.657, "v2" -> 0.65), "2020-07-15", 17, 18))
.toDF("id", "values", "creation date", "leadTime", "span")
val df31 = df30.groupBy("id", "creation date")
.agg(
collect_list(col("values")).alias("values"),
collect_list(col("leadTime")).alias("leadTime"),
collect_list(col("span")).alias("span")
).persist(MEMORY_AND_DISK)
val leadTimeArray = df31.select('leadTime).first.getAs[WrappedArray[Int]](0).toArray
val spanArray = df31.select('span).first.getAs[WrappedArray[Int]](0).toArray
val valuesArrayNew = df31.select('values).first.getAs[WrappedArray[Map[String, Float]]](0).toList
val newCols = valuesArrayNew
.zipWithIndex
.flatMap{case(v, i) => v.keys.map(k => s"final_${k}_${leadTimeArray(i)}_${spanArray(i)}_wk")}
val resDF = newCols.foldLeft(df31){(tempDF, colName) =>
tempDF.withColumn(colName,
col("values")(newCols.indexOf(colName) / 2)(if (colName.contains("v1")) "v1" else "v2"))
}.drop("values", "leadTime", "span")
resDF.show(false)
// +----+-------------+-----------------+-----------------+-----------------+-----------------+
// |id |creation date|final_v1_16_15_wk|final_v2_16_15_wk|final_v1_17_18_wk|final_v2_17_18_wk|
// +----+-------------+-----------------+-----------------+-----------------+-----------------+
// |id_1|2020-07-15 |0.368 |0.5 |0.564 |0.78 |
// |id_2|2020-07-15 |0.468 |0.3 |0.657 |0.65 |
// +----+-------------+-----------------+-----------------+-----------------+-----------------+
df3.unpersist()
所以问题只是创建列名?@koiralo是的,主要问题是使用其他列值的组合创建列名。我在上面的问题中添加了另一个例子。谢谢@mvasyliv,这很有帮助。我还有一个问题,如果我在“提前期”和“跨度”列中有多行相同的“id”具有不同的值,我如何将它们分组在具有不同列的单行中?类似这样:
val df=Seq(((“id_1”,Map(“v1”->0.368,“v2”->0.5),“2020-07-15”,16,15),(“id_1”,Map(“v1”->0.368,“v2”->0.4),“2020-07-15”,17,18),(“id_2“,Map(“v1”->0.468,“v2”->0.3),“2020-07-15”,16,15),(“id_2”,Map(“v1”->0.368,“v2”->0.3),“2020-07-15”,17,18)).toDF(“id”,“values”,“creation date”,“leadTime”,“span”)
iv“values”您的值也有不同的值。“values”Map会有与“v1”和“v2”相同的键“但基于跨度和交付周期列的组合,它可能有不同的值:我在上面更新了DF:val DF=Seq((“id_1”,Map(“v1”->0.368,“v2”->0.5),“2020-07-15”,16,15),(“id_1”,Map(“v1”->0.564,“v2”->0.78),“2020-07-15”,17,18),“id_2”,Map(“v1”->0.468,“v2”->0.3),“2020-07-15”,Map(“id_2”,Map(“v1”->0.657,“v2”->0.65),“2020-07-15”,17,18)。toDF(“id”,“值”,“创建日期”,“交付周期”,“span”)
感谢您的回复,我刚刚用另一个示例更新了这个问题,该示例中有多行相同的“id”,在“交付周期”和span中有不同的值列.| id |创建日期| final | v1 | u 16 | u 15 | final | u v1 | u 17 | u 18 | u wk | final | u v2 | u 17 124u18 | u wk |+------+--------------------------+------------------------------------+------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------id | id | 1 | 2020-07-15 | 0.368 1240.564 1240.78 1240|2020-07-15 | 0.468 | 0.657 | 0.3 | 0.65 |请看第二个答案。非常感谢@mvasyliv,它按预期工作。我在非常大的数据集上尝试了上述逻辑,但这些步骤非常缓慢:val leadTimeArray=df31。选择('leadTime)。首先。getAs[WrappedArray[Int]](0).toArray val spanArray=df31.select('span).first.getAs[WrappedArray[Int]](0.toArray ValuesArray=df31.select('values.first.getAs[WrappedArray[Map[String,Float]](0.toList
偶数df31.first
非常慢