Scala 如何找到&；填写第三高金额&；将相同的第三高金额填入新列（从第三列中剪切）_Scala_Apache Spark

Scala 如何找到&；填写第三高金额&；将相同的第三高金额填入新列（从第三列中剪切）

scala apache-spark

Scala 如何找到&；填写第三高金额&；将相同的第三高金额填入新列（从第三列中剪切）,scala,apache-spark,Scala,Apache Spark,如何找到并填充第三高金额，将相同的第三高金额填充到第三个（新列）中，并将相同的第三高金额重复到该相关ID中，如果该ID没有第三高金额，则需要将100填充到该相关ID中。请查找样本数据集和预期结果。提前谢谢样本数据集：- ID Status Date Amount 1 New 01/05/20 20 1 Assigned 02/05/20 30 1 In-Progress 02/05/20 50 2 New

如何找到并填充第三高金额，将相同的第三高金额填充到第三个（新列）中，并将相同的第三高金额重复到该相关ID中，如果该ID没有第三高金额，则需要将100填充到该相关ID中。请查找样本数据集和预期结果。提前谢谢

样本数据集：-

ID  Status       Date      Amount
1   New         01/05/20    20
1   Assigned    02/05/20    30
1   In-Progress 02/05/20    50
2   New         02/05/20    30
2   Removed     03/05/20    20
3   New         09/05/20    50
3   Assigned    09/05/20    20
3   In-Progress 10/05/20    30
3   Closed      10/05/20    10
4   New         10/05/20    20
4   Assigned    10/05/20    30

预期结果：-

ID  Status       Date      Amount  Cut_of_3
1   New         01/05/20    20        20
1   Assigned    02/05/20    30        20
1   In-Progress 02/05/20    50        20
2   New         02/05/20    30        100
2   Removed     03/05/20    20        100
3   New         09/05/20    50        35
3   Assigned    09/05/20    35        35
3   In-Progress 10/05/20    40        35
3   Closed      10/05/20    10        35
4   New         10/05/20    20        100
4   Assigned    10/05/20    30        100

下面是如何使用

窗口

函数实现

val window = Window.partitionBy("ID").orderBy("ID")

// collect as list and sort descending and get the third value 
df.withColumn("Cut_of_3", sort_array(collect_list($"Amount").over(window), false)(2))
  // if if there is no third value it returns null and replace null with 100
  .na.fill(100, Seq("Cut_of_3"))
  .sort("ID")
  .show(false)

输出：

+---+-----------+--------+------+--------+
|ID |Status     |Date    |Amount|Cut_of_3|
+---+-----------+--------+------+--------+
|1  |New        |01/05/20|20    |20      |
|1  |Assigned   |02/05/20|30    |20      |
|1  |In-Progress|02/05/20|50    |20      |
|2  |New        |02/05/20|30    |100     |
|2  |Removed    |03/05/20|20    |100     |
|3  |New        |09/05/20|50    |20      |
|3  |Assigned   |09/05/20|20    |20      |
|3  |In-Progress|10/05/20|30    |20      |
|3  |Closed     |10/05/20|10    |20      |
|4  |New        |10/05/20|20    |100     |
|4  |Assigned   |10/05/20|30    |100     |
+---+-----------+--------+------+--------+

你如何获得35%的产出，背后的逻辑是什么。到目前为止，您尝试了什么？谢谢Koiralo，这里是您如何找到第1个ID“1”的第3个最高值以及与ID“3”类似的值。这是一个类似于这样的条件，可能是一些解决方法，例如，根据按ID划分找到秩密集，并与主df进行别名连接。