Scala 火花：获得最大连续下降值_Scala_Dataframe_Apache Spark_Pyspark_Apache Spark Sql

Scala 火花：获得最大连续下降值

scala dataframe apache-spark pyspark

Scala 火花：获得最大连续下降值,scala,dataframe,apache-spark,pyspark,apache-spark-sql,Scala,Dataframe,Apache Spark,Pyspark,Apache Spark Sql,我的要求是得到最大数量的减少值以下是我的输入数据集： +---+-------+ | id| amount| +---+-------+ | 1| 10.0| | 1| 9.0| | 1| 7.0| | 1| 6.0| | 2| 50.0| | 2| 60.0| | 2| 70.0| | 3| 90.0| | 3| 80.0| | 3| 90.0| +---+-------+ 我要求的结果如下： +---+--------+

我的要求是得到最大数量的减少值

以下是我的输入数据集：

+---+-------+
| id| amount|
+---+-------+
|  1|   10.0|
|  1|    9.0|
|  1|    7.0|
|  1|    6.0|
|  2|   50.0|
|  2|   60.0|
|  2|   70.0|
|  3|   90.0|
|  3|   80.0|
|  3|   90.0|
+---+-------+

我要求的结果如下：

+---+--------+
| id| outcome|
+---+--------+
|  1|       3|
|  2|       0|
|  3|       2|
+---+--------+

我的结果（新列）基于group by id和值连续3次下降的次数。对于id 1，即使它减少了4倍，我也只希望最多减少3倍

对于spark sql或spark dataframe（scala）中的任何建议或帮助，我们将不胜感激。

以下是使用

pyspark

的建议，您可以尝试在scala或sql中复制：

w = Window.partitionBy("id").orderBy(F.monotonically_increasing_id())

(df.withColumn("Diff",F.col("amount") - F.lag("amount").over(w))
   .withColumn('k', F.lead("Diff").over(w))
   .fillna(0, subset='k').groupby("id").agg(
  F.sum(F.when((F.isnull("Diff") & (F.col("k")<0))|(F.col("Diff")<0),1).otherwise(0))
  .alias("outcome")
).withColumn("outcome",F.when(F.col("outcome")>=3,3).otherwise(F.col("outcome"))) ).show()

首先需要一个排序列来计算减少量。在您的示例中，没有索引，因此我们可以使用

单调递增的id

构建一个

索引

列。然后，我们可以使用一个窗口和

lag

和

lead

功能来获得您想要的：

import org.apache.spark.sql.expressions.Window
val win=Window.partitionBy（“id”）.orderBy（“索引”）
df
.带列（“索引”，单调递增）
//如果金额小于下一个金额，则会减少
//或大于上一个
.带列（“减少”、（滞后（“金额，1”）。超过（赢）>金额）||
（领先（'金额，1）。超过（赢得）<'金额）
)
.groupBy（“id”）
//我们需要将布尔值转换为整数来求和
.agg（总和（“减少”整数）作为“结果”）
//将结果限制为3
.withColumn（“结果”，当（'output>3，lit（3））。否则（'output））
.orderBy（“id”）.show

+--+--+
|id |结果|
+---+-------+
|  1|      3|
|  2|      0|
|  3|      2|
+---+-------+

Spark数据帧是无序的，您的数据帧中没有排序。由于缺乏有序性，上一行的“减少”没有定义。一次掌握它有点难，但肯定很精彩

+---+-------+
| id|outcome|
+---+-------+
|  1|      3|
|  2|      0|
|  3|      2|
+---+-------+