Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/18.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Spark 2.2数据帧[scala]_Scala_Apache Spark_Apache Spark Sql_Apache Spark 2.2 - Fatal编程技术网

Spark 2.2数据帧[scala]

Spark 2.2数据帧[scala],scala,apache-spark,apache-spark-sql,apache-spark-2.2,Scala,Apache Spark,Apache Spark Sql,Apache Spark 2.2,上表为输入数据集,下表为预期输出。这里的问题是,我们应该根据状态发生的顺序而不是数量进行计数。我们可以使用scala借助spark数据帧来实现这一点吗?提前感谢你的帮助 OrderNo Status1 Status2 Status3 123 Completed Pending Pending 456 Rejected Completed Completed 789 Pending In Progress Complet

上表为输入数据集,下表为预期输出。这里的问题是,我们应该根据状态发生的顺序而不是数量进行计数。我们可以使用scala借助spark数据帧来实现这一点吗?提前感谢你的帮助

OrderNo    Status1    Status2     Status3
123    Completed      Pending     Pending
456    Rejected   Completed   Completed
789    Pending    In Progress     Completed

您可以尝试以下代码。它统计所有状态的不同OrderNo的数量。我希望有帮助

Pending     2
Rejected    1
Completed   3
In Progress 2
以下是结果。(注意:进行中仅在测试数据中显示一次。)


+-----------+------------+
|状态| DistoreCnt|
+-----------+------------+
|已完成| 3|
|进行中| 1|
|待定| 2|
|拒绝| 1|

+-----------+------------+

这是一位了不起的冠军。你救了我一天,它工作得很好:)我很高兴它帮了我的忙。如果答案有帮助,请接受。是的。。认可的
val rawDF = Seq(
  ("123", "Completed", "Pending", "Pending"),
  ("456", "Rejected", "Completed", "Completed"),
  ("789", "Pending", "In Progress", "Completed")
).toDF("OrderNo", "Status1", "Status2", "Status3")

val newDF = rawDF.withColumn("All_Status",  array($"Status1", $"Status2", $"Status3"))
    .withColumn("Status", explode($"All_Status"))
    .groupBy("Status").agg(size(collect_set($"OrderNo")).as("DistOrderCnt"))