Scala 将字符串值与spark DataFrame列进行比较，并根据条件更新字符串_Scala_Apache Spark_Apache Spark Sql

Scala 将字符串值与spark DataFrame列进行比较，并根据条件更新字符串

scala apache-spark

Scala 将字符串值与spark DataFrame列进行比较，并根据条件更新字符串,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我有一个带有id，group和date字段的输入数据框。我必须将input_df中的日期与字符串进行比较，如果input df具有新日期的大部分记录，则更新字符串中的日期 val lastRunDtDetails = "A#2021-04-02,B#2021-04-01,C#2021-04-02" val input_df = sc.parallelize(Seq((1,"A","2021-04-01"),(2,"A&quo

我有一个带有

id

，

group

和

date

字段的输入数据框。我必须将

input_df

中的日期与字符串进行比较，如果input df具有新日期的大部分记录，则更新字符串中的日期

val lastRunDtDetails = "A#2021-04-02,B#2021-04-01,C#2021-04-02"

val input_df = sc.parallelize(Seq((1,"A","2021-04-01"),(2,"A","2021-04-02"),(3,"A","2021-04- 
02"),(4,"A","2021-04-02"),(5,"A","2021-04-03"),(6,"B","2021-04-01"),(7,"B","2021-04-02"), 
(8,"B","2021-04-02"),(9,"B","2021-04-02"),(10,"B","2021-04-02"),(11,"C","2021-04-01"), 
(12,"C","2021-04-01"),(13,"C","2021-04-01"),(14,"C","2021-04-02"),(15,"C","2021-04- 
03"))).toDF("id","group","date")

input_df.show()
+---+-----+----------+
| id|group|      date|
+---+-----+----------+
|  1|    A|2021-04-01|
|  2|    A|2021-04-02|
|  3|    A|2021-04-02|
|  4|    A|2021-04-02|
|  5|    A|2021-04-03|
|  6|    B|2021-04-01|
|  7|    B|2021-04-02|
|  8|    B|2021-04-02|
|  9|    B|2021-04-02|
| 10|    B|2021-04-02|
| 11|    C|2021-04-01|
| 12|    C|2021-04-01|
| 13|    C|2021-04-01|
| 14|    C|2021-04-02|
| 15|    C|2021-04-03|
+---+-----+----------+

val input_df_count = input_df.groupBy("group").count.orderBy("group")
input_df_count: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [group: string, count: bigint]    

input_df_count.show()
+-----+-----+
|group|count|
+-----+-----+
|    A|    5|
|    B|    5|
|    C|    5|
+-----+-----+

val max_dt_count_df = input_df.groupBy("group", "date").count().groupBy("group").agg(max(struct("count", "date")) as "max_dt").select($"group", $"max_dt.date",$"max_dt.count" as "max_count").orderBy("group")
max_dt_count_df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [group: string, date: string ... 1 more field]    

max_dt_count_df.show()
+-----+----------+---------+
|group|      date|max_count|
+-----+----------+---------+
|    A|2021-04-02|        3|
|    B|2021-04-02|        4|
|    C|2021-04-01|        3|
+-----+----------+---------+

val percent_df= input_df_count.join(max_dt_count_df, input_df_count("group") === max_dt_count_df("group"), "inner").select(input_df_count("*"), max_dt_count_df("date"), max_dt_count_df("max_count")).withColumn("percentile", ($"max_count"/$"count")*100).orderBy("group")
2021-04-07 16:40:33 WARN  Column:66 - Constructing trivially true equals predicate, 'group#23 = group#23'. Perhaps you need to use aliases.
percent_df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [group: string, count: bigint ... 3 more fields]    

percent_df.show()
+-----+-----+----------+---------+----------+
|group|count|      date|max_count|percentile|
+-----+-----+----------+---------+----------+
|    A|    5|2021-04-02|        3|      60.0|
|    B|    5|2021-04-02|        4|      80.0|
|    C|    5|2021-04-01|        3|      50.0|
+-----+-----+----------+---------+----------+

现在，我想将来自

percent_df

的日期与输入字符串的日期进行比较，如果来自df的日期大于来自字符串的日期，并且来自df的百分比大于75%，则使用来自df的日期更新字符串。因此，从上面的输入来看，我期望的字符串应该如下所示：

val newDatesDetails = "A#2021-04-02,B#2021-04-02,C#2021-04-02"

我可以通过将输入字符串转换为DF来实现这一点

val last_run_details_df = sc.parallelize(lastRunRptgDtDetails.split(",").map(_.split("#")).map{ case Array(a,b) => (a, b) }).toDF("group", "previous_date")
last_run_details_df: org.apache.spark.sql.DataFrame = [group: string, previous_date: string]    

last_run_details_df.show()
+-----+-------------+
|group|previous_date|
+-----+-------------+
|    A|   2021-04-02|
|    B|   2021-04-01|
|    C|   2021-04-02|
+-----+-------------+

val temp_df = percent_df.join(last_run_details_df, percent_df("group") === last_run_details_df("group"), "inner").select(percent_df("*"), last_run_details_df("previous_date")).orderBy("group")    


temp_df.withColumn("new_dates", when($"date" >= $"previous_date" && $"percentile" >= 75, $"date").otherwise($"previous_date")).show()     


val newDatesDetails = temp_df.withColumn("new_dates", when($"date" >= $"previous_date" && $"percentile" >= 75, $"date").otherwise($"previous_date")).select(concat(col("group"),lit("#"),col("new_dates")) as "new_dates").collect.mkString(",").replaceAll("\\[|\\]","")

new_dates
res49: String = A#2021-04-02,B#2021-04-02,C#2021-04-02

我认为这不是推导newDatesDetails的理想方法。有没有更好的方法来推导如上所述的最终字符串