Scala 透视数据帧中的列,该列的透视列具有多个值

Scala 透视数据帧中的列,该列的透视列具有多个值,scala,dataframe,apache-spark,apache-spark-sql,Scala,Dataframe,Apache Spark,Apache Spark Sql,我有一个如下所示的数据帧 +------+-------------+------+-----+ |NUM_ID| TIME|SIGNAL|VALUE| +------+-------------+------+-----+ |XXXX01|1571634079547| SIG1|78860| |XXXX01|1571634090000| SIG1|25.73| |XXXX01|1571634042000| SIG1|25.73| |XXXX01|1571634050000

我有一个如下所示的数据帧

+------+-------------+------+-----+
|NUM_ID|         TIME|SIGNAL|VALUE|
+------+-------------+------+-----+
|XXXX01|1571634079547|  SIG1|78860|
|XXXX01|1571634090000|  SIG1|25.73|
|XXXX01|1571634042000|  SIG1|25.73|
|XXXX01|1571634050000|  SIG1|25.73|
|XXXX01|1571634050000|  SIG2|25.73|
|XXXX01|1571634066000|  SIG2|25.73|
|XXXX01|1571634074000|  SIG2|25.73|
|XXXX01|1571634090000|  SIG3|25.73|
|XXXX02|1571634088000|  SIG1|25.73|
|XXXX02|1571634040000|  SIG1|25.73|
|XXXX02|1571634048000|  SIG1|25.73|
|XXXX02|1571634056000|  SIG1|25.73|
|XXXX02|1571634088000|  SIG2|25.73|
|XXXX02|1571634072000|  SIG2|25.73|
|XXXX02|1571634080000|  SIG2|25.73|
|XXXX02|1571634088000|  SIG3|25.73|
|XXXX02|1571634094000|  SIG3|25.73|
|XXXX02|1571634038000|  SIG3|25.73|
|XXXX03|1571634046000|  SIG1|25.73|
|XXXX03|1571634054000|  SIG1|25.73|
|XXXX03|1571634062000|  SIG1|25.73|
|XXXX03|1571634070000|  SIG1|25.73|
|XXXX03|1571634078000|  SIG2|25.73|
|XXXX03|1571634092000|  SIG2|25.73|
|XXXX03|1571634036000|  SIG2|25.73|
|XXXX03|1571634044000|  SIG3|25.73|
|XXXX03|1571634052000|  SIG3|25.73|
|XXXX03|1571634060000|  SIG3|25.73|
+------+-------------+------+-----+ 
+------+-------------+-----+-----+-----+
|NUM_ID|         TIME| SIG1| SIG2| SIG3|
+------+-------------+-----+-----+-----+
|XXXX01|1571634079547|78860| null| null|
|XXXX01|1571634090000|25.73| null|25.73|
|XXXX01|1571634042000|25.73| null| null|
|XXXX01|1571634050000|25.73|25.73| null|
|XXXX01|1571634066000| null|25.73| null|
|XXXX01|1571634074000| null|25.73| null|
|XXXX02|1571634088000|25.73|25.73|25.73|
|XXXX02|1571634040000|25.73| null| null|
|XXXX02|1571634048000|25.73| null| null|
|XXXX02|1571634056000|25.73| null| null|
|XXXX02|1571634072000| null|25.73| null|
|XXXX02|1571634080000| null|25.73| null|
|XXXX02|1571634094000| null| null|25.73|
|XXXX02|1571634038000| null| null|25.73|
|
|
|
+------+-------------+-----+-----+-----+
我希望将每个SIGx作为一个新列,并将相应的值作为现有列信号中每个SIGx的行

输出应如下所示

+------+-------------+------+-----+
|NUM_ID|         TIME|SIGNAL|VALUE|
+------+-------------+------+-----+
|XXXX01|1571634079547|  SIG1|78860|
|XXXX01|1571634090000|  SIG1|25.73|
|XXXX01|1571634042000|  SIG1|25.73|
|XXXX01|1571634050000|  SIG1|25.73|
|XXXX01|1571634050000|  SIG2|25.73|
|XXXX01|1571634066000|  SIG2|25.73|
|XXXX01|1571634074000|  SIG2|25.73|
|XXXX01|1571634090000|  SIG3|25.73|
|XXXX02|1571634088000|  SIG1|25.73|
|XXXX02|1571634040000|  SIG1|25.73|
|XXXX02|1571634048000|  SIG1|25.73|
|XXXX02|1571634056000|  SIG1|25.73|
|XXXX02|1571634088000|  SIG2|25.73|
|XXXX02|1571634072000|  SIG2|25.73|
|XXXX02|1571634080000|  SIG2|25.73|
|XXXX02|1571634088000|  SIG3|25.73|
|XXXX02|1571634094000|  SIG3|25.73|
|XXXX02|1571634038000|  SIG3|25.73|
|XXXX03|1571634046000|  SIG1|25.73|
|XXXX03|1571634054000|  SIG1|25.73|
|XXXX03|1571634062000|  SIG1|25.73|
|XXXX03|1571634070000|  SIG1|25.73|
|XXXX03|1571634078000|  SIG2|25.73|
|XXXX03|1571634092000|  SIG2|25.73|
|XXXX03|1571634036000|  SIG2|25.73|
|XXXX03|1571634044000|  SIG3|25.73|
|XXXX03|1571634052000|  SIG3|25.73|
|XXXX03|1571634060000|  SIG3|25.73|
+------+-------------+------+-----+ 
+------+-------------+-----+-----+-----+
|NUM_ID|         TIME| SIG1| SIG2| SIG3|
+------+-------------+-----+-----+-----+
|XXXX01|1571634079547|78860| null| null|
|XXXX01|1571634090000|25.73| null|25.73|
|XXXX01|1571634042000|25.73| null| null|
|XXXX01|1571634050000|25.73|25.73| null|
|XXXX01|1571634066000| null|25.73| null|
|XXXX01|1571634074000| null|25.73| null|
|XXXX02|1571634088000|25.73|25.73|25.73|
|XXXX02|1571634040000|25.73| null| null|
|XXXX02|1571634048000|25.73| null| null|
|XXXX02|1571634056000|25.73| null| null|
|XXXX02|1571634072000| null|25.73| null|
|XXXX02|1571634080000| null|25.73| null|
|XXXX02|1571634094000| null| null|25.73|
|XXXX02|1571634038000| null| null|25.73|
|
|
|
+------+-------------+-----+-----+-----+
具有相同时间的SIGx的值应在同一行中

有没有办法做到这一点? 我尝试使用pivot函数,但对于具有多个值的数据透视列,效果不理想


任何线索感谢。提前谢谢

您可以使用
“信号”
通过
“数值ID”
“时间”
透视
,并从
“值”
获取第一个值,如下所示

df.groupBy("NUM_ID", "TIME")
  .pivot("SIGNAL")
  .agg(first("VALUE"))

希望这有帮助

我尝试了这个方法,但得到了一个错误,即
org.apache.spark.sql.AnalysisException:“VALUE”不是数字列。聚合函数只能应用于数字列。;在org.apache.spark.sql.RelationalGroupedDataset$$anonfun$3.apply上(RelationalGroupedDataset.scala:103
列值是字符串类型。我在值列中有DOUBLE和BIGINT的值,因此也不可能强制转换为特定类型。-@Shankar KoiralaCan您提供了dataframe的模式吗?scala>DF.printSchema root |--NUM_ID:string(nullable=true)|--TIME:string(nullable=true)|--SIGNAL:string(nullable=true)|--VALUE:string(nullable=true)我试过不使用agg作为
df.groupBy(“NUM_ID”,“TIME”).pivot(“SIGNAL”)
但执行透视函数后,我们如何查看数据。show函数将无法工作,因为它不是RelationalGroupedDataset的成员。-@Shankar Koiralait应始终跟随group by,并使用一些聚合函数,如.agg()