Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala Spark基于其他列值创建列名称_Scala_Apache Spark_Apache Spark Sql - Fatal编程技术网

Scala Spark基于其他列值创建列名称

Scala Spark基于其他列值创建列名称,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我是spark新手,需要帮助以以下格式转换此数据: | id |creation date| final_v1_16_15_wk | final_v1_17_18_wk |final_v2_16_15_wk | final_v2_17_18_wk | |id_1 | 2020-07-15 | 0.368 | 0.564 | 0.5 | 0.78 | |id_2 | 2020-07-15 | 0.468 |

我是spark新手,需要帮助以以下格式转换此数据:

| id  |creation date| final_v1_16_15_wk | final_v1_17_18_wk |final_v2_16_15_wk | final_v2_17_18_wk | 
|id_1 | 2020-07-15 |      0.368         |      0.564      |    0.5  | 0.78 | 
|id_2 | 2020-07-15 |      0.468         |      0.657      |    0.3  | 0.65 |
我有以下格式的数据:

+----------+-------------------------+-------------------+---------+------+
|   id     |       values            |     creation date | leadTime| span |
+----------+-------------------------+-------------------+---------+--+---+
|id_1      |[[v1, 0.368], [v2, 0.5]] |     2020-07-15    |      16 |  15  |
|id_2      |[[v1, 0.368], [v2, 0.4]] |     2020-07-15    |      16 |  15  |
|id_3      |[[v1, 0.468], [v2, 0.3]] |     2020-07-15    |      16 |  15  |
|id_4      |[[v1, 0.368], [v2, 0.3]] |     2020-07-15    |      16 |  15  |
|id_5      |[[v1, 0.668], [v2, 0.1]] |     2020-07-15    |      16 |  15  |
|id_6      |[[v1, 0.168], [v2, 0.2]] |     2020-07-15    |      16 |  15  |
+----------+-------------------------+-------------------+---------+------+
通过使用列字段中的值,我需要以下格式的数据:

使用提前期和跨度列值创建列名为的新列

+----------+--------------+--------------------+--------------------+
|   id     |creation date | final_v1_16_15_wk  |  final_v2_16_15_wk |
+----------+--------------+--------------------+--------------------+
|id_1      |2020-07-15    |       0.368        |         0.5        |
|id_2      |2020-07-15    |       0.368        |         0.4        |
|id_3      |2020-07-15    |       0.468        |         0.3        |
|id_4      |2020-07-15    |       0.368        |         0.3        |
|id_5      |2020-07-15    |       0.668        |         0.1        |
|id_6      |2020-07-15    |       0.168        |         0.2        |
+----------+--------------+--------------------+--------------------+
此DF的另一个示例:

val df=Seq((“id_1”,Map(“v1”->0.368,“v2”->0.5),“2020-07-15”,16,15),(“id_1”,Map(“v1”->0.564,“v2”->0.78),“2020-07-15”,17,18),(“id_2”,Map(“v1”->0.468,“v2”->0.3),“2020-07-15”,16,15),“id_2”,Map(“v1”->0.657,“v2”->0.65),“2020-07-15”,17,18),“id”,“创建日期”,“时间跨度”),

输出格式如下:

| id  |creation date| final_v1_16_15_wk | final_v1_17_18_wk |final_v2_16_15_wk | final_v2_17_18_wk | 
|id_1 | 2020-07-15 |      0.368         |      0.564      |    0.5  | 0.78 | 
|id_2 | 2020-07-15 |      0.468         |      0.657      |    0.3  | 0.65 |
尝试使用以下逻辑生成列名/值,但无效:

val modDF = finalDF.withColumn("final_" + newFinalDF("values").getItem(0).getItem("_1") + "_" + newFinalDF("leadTime") + "_" + newFinalDF("span") + "_wk", $"values".getItem(0).getItem("_2"));
对评论的答复

import org.apache.spark.sql.functions._
import scala.collection.mutable.WrappedArray
import org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK
import spark.implicits._

val df30 = Seq(
  ("id_1", Map("v1" -> 0.368, "v2" -> 0.5), "2020-07-15", 16, 15),
  ("id_1", Map("v1" -> 0.564, "v2" -> 0.78), "2020-07-15", 17, 18),
  ("id_2", Map("v1" -> 0.468, "v2" -> 0.3), "2020-07-15", 16, 15),
  ("id_2", Map("v1" -> 0.657, "v2" -> 0.65), "2020-07-15", 17, 18))
  .toDF("id", "values", "creation date", "leadTime", "span")

val df31 = df30.groupBy("id", "creation date")
  .agg(
    collect_list(col("values")).alias("values"),
    collect_list(col("leadTime")).alias("leadTime"),
    collect_list(col("span")).alias("span")
  ).persist(MEMORY_AND_DISK)

val leadTimeArray = df31.select('leadTime).first.getAs[WrappedArray[Int]](0).toArray
val spanArray = df31.select('span).first.getAs[WrappedArray[Int]](0).toArray
val valuesArrayNew = df31.select('values).first.getAs[WrappedArray[Map[String, Float]]](0).toList

val newCols = valuesArrayNew
  .zipWithIndex
  .flatMap{case(v, i) => v.keys.map(k => s"final_${k}_${leadTimeArray(i)}_${spanArray(i)}_wk")}

val resDF = newCols.foldLeft(df31){(tempDF, colName) =>
    tempDF.withColumn(colName, 
      col("values")(newCols.indexOf(colName) / 2)(if (colName.contains("v1")) "v1" else "v2"))
}.drop("values", "leadTime", "span")


resDF.show(false)
//    +----+-------------+-----------------+-----------------+-----------------+-----------------+
//    |id  |creation date|final_v1_16_15_wk|final_v2_16_15_wk|final_v1_17_18_wk|final_v2_17_18_wk|
//    +----+-------------+-----------------+-----------------+-----------------+-----------------+
//    |id_1|2020-07-15   |0.368            |0.5              |0.564            |0.78             |
//    |id_2|2020-07-15   |0.468            |0.3              |0.657            |0.65             |
//    +----+-------------+-----------------+-----------------+-----------------+-----------------+
df3.unpersist()

所以问题只是创建列名?@koiralo是的,主要问题是使用其他列值的组合创建列名。我在上面的问题中添加了另一个例子。谢谢@mvasyliv,这很有帮助。我还有一个问题,如果我在“提前期”和“跨度”列中有多行相同的“id”具有不同的值,我如何将它们分组在具有不同列的单行中?类似这样:
val df=Seq(((“id_1”,Map(“v1”->0.368,“v2”->0.5),“2020-07-15”,16,15),(“id_1”,Map(“v1”->0.368,“v2”->0.4),“2020-07-15”,17,18),(“id_2“,Map(“v1”->0.468,“v2”->0.3),“2020-07-15”,16,15),(“id_2”,Map(“v1”->0.368,“v2”->0.3),“2020-07-15”,17,18)).toDF(“id”,“values”,“creation date”,“leadTime”,“span”)
iv“values”您的值也有不同的值。“values”Map会有与“v1”和“v2”相同的键“但基于跨度和交付周期列的组合,它可能有不同的值:我在上面更新了DF:
val DF=Seq((“id_1”,Map(“v1”->0.368,“v2”->0.5),“2020-07-15”,16,15),(“id_1”,Map(“v1”->0.564,“v2”->0.78),“2020-07-15”,17,18),“id_2”,Map(“v1”->0.468,“v2”->0.3),“2020-07-15”,Map(“id_2”,Map(“v1”->0.657,“v2”->0.65),“2020-07-15”,17,18)。toDF(“id”,“值”,“创建日期”,“交付周期”,“span”)
感谢您的回复,我刚刚用另一个示例更新了这个问题,该示例中有多行相同的“id”,在“交付周期”和span中有不同的值列.| id |创建日期| final | v1 | u 16 | u 15 | final | u v1 | u 17 | u 18 | u wk | final | u v2 | u 17 124u18 | u wk |+------+--------------------------+------------------------------------+------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------id | id | 1 | 2020-07-15 | 0.368 1240.564 1240.78 1240|2020-07-15 | 0.468 | 0.657 | 0.3 | 0.65 |请看第二个答案。非常感谢@mvasyliv,它按预期工作。我在非常大的数据集上尝试了上述逻辑,但这些步骤非常缓慢:
val leadTimeArray=df31。选择('leadTime)。首先。getAs[WrappedArray[Int]](0).toArray val spanArray=df31.select('span).first.getAs[WrappedArray[Int]](0.toArray ValuesArray=df31.select('values.first.getAs[WrappedArray[Map[String,Float]](0.toList
偶数
df31.first
非常慢