Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/19.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 使用可以';不能直接从GroupedData类(如“last()”)调用并将其重命名为原始名称_Scala_Apache Spark_Apache Spark Sql - Fatal编程技术网

Scala 使用可以';不能直接从GroupedData类(如“last()”)调用并将其重命名为原始名称

Scala 使用可以';不能直接从GroupedData类(如“last()”)调用并将其重命名为原始名称,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,假设我们有以下DF scala> df.show +---+----+----+----+-------------------+---+ | id|name| cnt| amt| dt|scn| +---+----+----+----+-------------------+---+ | 1|null| 1|1.12|2000-01-02 00:11:11|112| | 1| aaa| 1|1.11|2000-01-01 00:00:00|11

假设我们有以下DF

scala> df.show
+---+----+----+----+-------------------+---+
| id|name| cnt| amt|                 dt|scn|
+---+----+----+----+-------------------+---+
|  1|null|   1|1.12|2000-01-02 00:11:11|112|
|  1| aaa|   1|1.11|2000-01-01 00:00:00|111|
|  2| bbb|null|2.22|2000-01-03 12:12:12|201|
|  2|null|   2|1.13|               null|200|
|  2|null|null|2.33|               null|202|
|  3| ccc|   3|3.34|               null|302|
|  3|null|null|3.33|               null|301|
|  3|null|null| 0.0|2000-12-31 23:59:59|300|
+---+----+----+----+-------------------+---+
我想得到以下DF-按
scn
排序,按
id
分组,并为每一列取最后一个非空值(除了
id
scn

可以这样做:

scala> :paste
// Entering paste mode (ctrl-D to finish)

df.orderBy("scn")
  .groupBy("id")
  .agg(last("name", true) as "name",
       last("cnt", true) as "cnt",
       last("amt", true) as "amt",
       last("dt", true) as "dt")
  .show

// Exiting paste mode, now interpreting.

+---+----+---+----+-------------------+
| id|name|cnt| amt|                 dt|
+---+----+---+----+-------------------+
|  1| aaa|  1|1.12|2000-01-02 00:11:11|
|  3| ccc|  3|3.34|2000-12-31 23:59:59|
|  2| bbb|  2|2.33|2000-01-03 12:12:12|
+---+----+---+----+-------------------+
在现实生活中,我希望处理具有大量列的不同DFs

我的问题是-如何以编程方式指定
.agg(last(col_name,true))中的所有列(除了
id
scn

生成源DF的代码:

case class C(id: Integer, name: String, cnt: Integer, amt: Double, dt: String, scn: Integer)

val cc = Seq(
  C(1, null, 1, 1.12, "2000-01-02 00:11:11", 112),
  C(1, "aaa", 1, 1.11, "2000-01-01 00:00:00", 111),
  C(2, "bbb", null, 2.22, "2000-01-03 12:12:12", 201),
  C(2, null, 2, 1.13, null,200),
  C(2, null, null, 2.33, null, 202),
  C(3, "ccc", 3, 3.34, null, 302),
  C(3, null, null, 3.33, "20001-01-01 00:33:33", 301),
  C(3, null, null, 0.00, "2000-12-31 23:59:59", 300)
)

val t = sc.parallelize(cc, 4).toDF()
val df = t.withColumn("dt", $"dt".cast("timestamp"))
val cols = df.columns.filterNot(_.equals("id"))
解决方案类似于,加上将结果DF中的列重命名为原始列:

val exprs = df.columns.filterNot(_.equals("id")).map(last(_, true))
val r = df.orderBy("scn").groupBy("id").agg(exprs.head, exprs.tail: _*).toDF(df.columns:_*)
结果:

scala> r.show
+---+----+---+----+-------------------+---+
| id|name|cnt| amt|                 dt|scn|
+---+----+---+----+-------------------+---+
|  1| aaa|  1|1.12|2000-01-02 00:11:11|112|
|  3| ccc|  3|3.34|2000-12-31 23:59:59|302|
|  2| bbb|  2|2.33|2000-01-03 12:12:12|202|
+---+----+---+----+-------------------+---+
或:


@用户8371915,谢谢你的链接!几个月前我已经读过了,当时甚至还投了更高的票,但这次我找不到了。我已经发布了一个答案,在聚合后会对列进行重新命名,希望这可能会对将来的人有所帮助……你能帮我解决这个答案中的错误信息吗--
val exprs = df.columns.filterNot(_.equals("id")).map(c=>last(c, true).as(c.toString))
val r = df.orderBy("scn").groupBy("id").agg(exprs.head, exprs.tail: _*)