Scala Spark基于另一列中值的聚合计数在数据帧中创建新列
我有一个spark数据框,如下图所示Scala Spark基于另一列中值的聚合计数在数据帧中创建新列,scala,apache-spark,Scala,Apache Spark,我有一个spark数据框,如下图所示 +-----+----------+----------+ | ID| date| count | +-----+----------+----------+ |54500|2016-05-02| 0| |54500|2016-05-09| 0| |54500|2016-05-16| 0| |54500|2016-05-23| 0| |54500|2016-06-06|
+-----+----------+----------+
| ID| date| count |
+-----+----------+----------+
|54500|2016-05-02| 0|
|54500|2016-05-09| 0|
|54500|2016-05-16| 0|
|54500|2016-05-23| 0|
|54500|2016-06-06| 0|
|54500|2016-06-13| 0|
|54441|2016-06-20| 0|
|54441|2016-06-27| 0|
|54441|2016-07-04| 0|
|54441|2016-07-11| 0|
+-----+----------+----------+
我想添加一个额外的列,其中包含数据帧中特定id的记录计数,同时避免for循环。目标数据帧如下所示
+-----+----------+----------+
| ID| date| count |
+-----+----------+----------+
|54500|2016-05-02| 6|
|54500|2016-05-09| 6|
|54500|2016-05-16| 6|
|54500|2016-05-23| 6|
|54500|2016-06-06| 6|
|54500|2016-06-13| 6|
|54441|2016-06-20| 4|
|54441|2016-06-27| 4|
|54441|2016-07-04| 4|
|54441|2016-07-11| 4|
+-----+----------+----------+
试过这个
import org.apache.spark.sql.expressions.Window
var s = Window.partitionBy("ID")
var df2 = df.withColumn("count", count.over(s))
这是一个错误
error: ambiguous reference to overloaded definition,
both method count in object functions of type (columnName: String)org.apache.spark.sql.TypedColumn[Any,Long]
and method count in object functions of type (e: org.apache.spark.sql.Column)org.apache.spark.sql.Column
match expected type ?
遵循以下方法:
import spark.implicits._
val df1 = List(54500, 54500, 54500, 54500, 54500, 54500, 54441, 54441, 54441, 54441).toDF("ID")
val df2 = df1.groupBy("ID").count()
df1.join(df2, Seq("ID"), "left").show(false)
+-----+-----+
|ID |count|
+-----+-----+
|54500|6 |
|54500|6 |
|54500|6 |
|54500|6 |
|54500|6 |
|54500|6 |
|54441|4 |
|54441|4 |
|54441|4 |
|54441|4 |
+-----+-----+
火花窗是你需要做的,我得到的error@Leothorn您能提供错误详细信息吗?我已经添加了错误以及重试计数(df(“count”))答案是错误的,因为您正在减少最终数据帧中的行数。哦!对不起,我没有检查。编辑了这篇文章,请看一看。