Scala 在窗口上为列列表聚合(总和)
对于DataFrame中可用的列列表,我很难找到在给定窗口上计算总和(或任何聚合函数)的通用方法Scala 在窗口上为列列表聚合(总和),scala,apache-spark,apache-spark-sql,spark-dataframe,Scala,Apache Spark,Apache Spark Sql,Spark Dataframe,对于DataFrame中可用的列列表,我很难找到在给定窗口上计算总和(或任何聚合函数)的通用方法 val inputDF = spark .sparkContext .parallelize( Seq( (1,2,1, 30, 100), (1,2,2, 30, 100), (1,2,3, 30, 100), (11,21,1, 30, 100), (11,21,2, 30, 100),
val inputDF = spark
.sparkContext
.parallelize(
Seq(
(1,2,1, 30, 100),
(1,2,2, 30, 100),
(1,2,3, 30, 100),
(11,21,1, 30, 100),
(11,21,2, 30, 100),
(11,21,3, 30, 100)
),
10)
.toDF("c1", "c2", "offset", "v1", "v2")
input.show
+---+---+------+---+---+
| c1| c2|offset| v1| v2|
+---+---+------+---+---+
| 1| 2| 1| 30|100|
| 1| 2| 2| 30|100|
| 1| 2| 3| 30|100|
| 11| 21| 1| 30|100|
| 11| 21| 2| 30|100|
| 11| 21| 3| 30|100|
+---+---+------+---+---+
给定如上所示的数据帧,很容易找到列列表的总和,类似于下面所示的代码片段-
val groupKey = List("c1", "c2").map(x => col(x.trim))
val orderByKey = List("offset").map(x => col(x.trim))
val aggKey = List("v1", "v2").map(c => sum(c).alias(c.trim))
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy(groupKey: _*).orderBy(orderByKey: _*)
val outputDF = inputDF
.groupBy(groupKey: _*)
.agg(aggKey.head, aggKey.tail: _*)
outputDF.show
val outputDF2 = inputDF
.withColumn("cumulative_v1", sum(when($"offset".between(-1, 1), inputDF("v1")).otherwise(0)).over(w))
.withColumn("cumulative_v3", sum(when($"offset".between(-2, 2), inputDF("v1")).otherwise(0)).over(w))
但我似乎找不到类似的方法来处理窗口规范中的聚合函数。到目前为止,我只能通过单独指定每一列来解决这个问题,如下所示-
val groupKey = List("c1", "c2").map(x => col(x.trim))
val orderByKey = List("offset").map(x => col(x.trim))
val aggKey = List("v1", "v2").map(c => sum(c).alias(c.trim))
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy(groupKey: _*).orderBy(orderByKey: _*)
val outputDF = inputDF
.groupBy(groupKey: _*)
.agg(aggKey.head, aggKey.tail: _*)
outputDF.show
val outputDF2 = inputDF
.withColumn("cumulative_v1", sum(when($"offset".between(-1, 1), inputDF("v1")).otherwise(0)).over(w))
.withColumn("cumulative_v3", sum(when($"offset".between(-2, 2), inputDF("v1")).otherwise(0)).over(w))
如果有办法在动态列列表上进行聚合,我将不胜感激。谢谢 我想我找到了一种比上述问题中提到的方法更有效的方法
/**
* Utility method takes a DataFrame and a List of columns to return aggregated values for the specified list of columns
* @param colsToAggregate Seq[String] of all columns in the input DataFrame to be aggregated
* @param inputDF Input DataFrame
* @param f aggregate function 'call by name'
* @param partitionByColSeq Seq[] of column names to partition the inputDF before applying the aggregate
* @param orderByColSeq Seq[] of column names to order the inputDF before applying the aggregate
* @param name_prefix String to prefix the new columns with, to avoid collisions
* @param name New column names. Uses Identify function and reuses aggregated column names
* @return output DataFrame
*/
def withRollingAggregateColumns(colsToAggregate: Seq[String],
inputDF: DataFrame,
f: String => Column,
partitionByColSeq: Seq[String],
orderByColSeq: Seq[String],
name_prefix: String,
name: String => String = identity) = {
val groupByKey = partitionByColSeq.map(x => col(x.trim))
val orderByKey = orderByColSeq.map(x => col(x.trim))
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy(groupByKey: _*).orderBy(orderByKey: _*)
colsToAggregate
.foldLeft(inputDF)(
(df, elementInCols) => df
.withColumn(
name_prefix + "_" + name(elementInCols),
f(elementInCols).over(w)
)
)
}
在这种情况下,实用程序方法将数据帧作为输入,并根据提供的函数f追加新列。它使用“withColumn”和“foldLeft”语法来迭代需要聚合的列列表。为了避免任何列名冲突,它将用户提供的“前缀”附加到新的聚合列中我认为我找到了一种比上述问题中所述更好的方法
/**
* Utility method takes a DataFrame and a List of columns to return aggregated values for the specified list of columns
* @param colsToAggregate Seq[String] of all columns in the input DataFrame to be aggregated
* @param inputDF Input DataFrame
* @param f aggregate function 'call by name'
* @param partitionByColSeq Seq[] of column names to partition the inputDF before applying the aggregate
* @param orderByColSeq Seq[] of column names to order the inputDF before applying the aggregate
* @param name_prefix String to prefix the new columns with, to avoid collisions
* @param name New column names. Uses Identify function and reuses aggregated column names
* @return output DataFrame
*/
def withRollingAggregateColumns(colsToAggregate: Seq[String],
inputDF: DataFrame,
f: String => Column,
partitionByColSeq: Seq[String],
orderByColSeq: Seq[String],
name_prefix: String,
name: String => String = identity) = {
val groupByKey = partitionByColSeq.map(x => col(x.trim))
val orderByKey = orderByColSeq.map(x => col(x.trim))
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy(groupByKey: _*).orderBy(orderByKey: _*)
colsToAggregate
.foldLeft(inputDF)(
(df, elementInCols) => df
.withColumn(
name_prefix + "_" + name(elementInCols),
f(elementInCols).over(w)
)
)
}
在这种情况下,实用程序方法将数据帧作为输入,并根据提供的函数f追加新列。它使用“withColumn”和“foldLeft”语法来迭代需要聚合的列列表。为了避免任何列名冲突,它会在新的聚合列中附加用户提供的“前缀”您是否尝试过使用
inputDF.types.foreach
?谢谢。你能详细说明一下我在这种情况下如何使用每种方法吗。My outputDF2应包含输入中的所有列以及列表中指定列的运行总和。您是否尝试使用inputDF.types.foreach
?谢谢。你能详细说明一下我在这种情况下如何使用每种方法吗。我的outputDF2应该包含输入中的所有列以及列表中指定的列的运行总和