Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/19.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何在具有多列的spark scala数据帧中应用分区?_Scala_Apache Spark - Fatal编程技术网

如何在具有多列的spark scala数据帧中应用分区?

如何在具有多列的spark scala数据帧中应用分区?,scala,apache-spark,Scala,Apache Spark,我在Spark Scala中有以下数据帧df: id project start_date Change_date designation 1 P1 08/10/2018 01/09/2017 2 1 P1 08/10/2018 02/11/2018 3 1 P1 08/10/2018 01/08/2016 1 然后让指定关闭开始日期小于该日期 预期产出: id project

我在Spark Scala中有以下数据帧df:

id   project  start_date    Change_date designation

1    P1       08/10/2018      01/09/2017   2
1    P1       08/10/2018      02/11/2018   3
1    P1       08/10/2018      01/08/2016   1
然后让指定关闭开始日期小于该日期

预期产出:

id   project  start_date    designation
1      P1     08/10/2018    2
这是因为变更日期2017年9月1日是开工日期之前最近的日期

有人能建议如何做到这一点吗

这不是选择第一行,而是选择与最接近开始日期的更改日期对应的名称

import org.apache.spark.sql.functions._

val spark: SparkSession = ???
import spark.implicits._

val df = Seq(
  (1, "P1", "08/10/2018", "01/09/2017", 2), 
  (1, "P1", "08/10/2018", "02/11/2018", 3),
  (1, "P1", "08/10/2018", "01/08/2016", 1)
).toDF("id", "project_id", "start_date", "changed_date", "designation")

val parsed = df
  .withColumn("start_date", to_date($"start_date", "dd/MM/yyyy"))        
  .withColumn("changed_date", to_date($"changed_date", "dd/MM/yyyy"))
发现差异

 val diff = parsed
   .withColumn("diff", datediff($"start_date", $"changed_date"))
   .where($"diff" > 0)
应用您从中选择的解决方案,例如窗口函数。如果按
id
分组:

import org.apache.spark.sql.expressions.Window

val w = Window.partitionBy($"id").orderBy($"diff")

diff.withColumn("rn", row_number.over(w)).where($"rn" === 1).drop("rn").show
// +---+----------+----------+------------+-----------+----+
// | id|project_id|start_date|changed_date|designation|diff|
// +---+----------+----------+------------+-----------+----+
// |  1|        P1|2018-10-08|  2017-09-01|          2| 402|
// +---+----------+----------+------------+-----------+----+
参考:


谢谢您的检查!这不是重复的,因为我不想只知道第一行,我希望数据按照diff logiv