Apache spark 获取时间范围内的最新值或null_Apache Spark_F#_Apache Spark Sql

Apache spark 获取时间范围内的最新值或null

apache-spark f#

Apache spark 获取时间范围内的最新值或null,apache-spark,f#,apache-spark-sql,Apache Spark,F#,Apache Spark Sql,我有一个庞大的数据集 | Date | ID | Value | +------------+----+-------+ | 10-10-2020 | 1 | 1 | | 10-11-2020 | 1 | 2 | | 10-12-2020 | 1 | 3 | | 10-13-2020 | 1 | 4 | | 10-10-2020 | 2 | 5 | | 10-11-2020 | 2 | 6 | | 10-12-2020 |

我有一个庞大的数据集

| Date       | ID | Value |
+------------+----+-------+
| 10-10-2020 | 1  | 1     |
| 10-11-2020 | 1  | 2     |
| 10-12-2020 | 1  | 3     |
| 10-13-2020 | 1  | 4     |
| 10-10-2020 | 2  | 5     |
| 10-11-2020 | 2  | 6     |
| 10-12-2020 | 2  | 7     |
| 10-09-2020 | 3  | 8     |
| 10-08-2020 | 4  | 9     |

如您所见，此示例包含不同日期范围内的4个ID

我有一个特殊的逻辑，它使用

RangeBetween

函数计算一些派生值。假设它是定义时间范围内的一个简单的和

我需要做的是生成这样一个结果（解释如下）：

本例假设今天的

是10-13-2020。
对于每个Id，我需要得到两个范围内的值的总和：2天和4天
1. the table contains 2 calculations for the same ranges starting from now and the day before (columns last and prev X days)
2. if all values exist in a range - simply result the sum of the range (example with ID = 1)
3. if some of values are not specified in a range assume it is zero (example with ID = 2)
4. if values do not exist in the defined range, but there is at least 1 value in the range with the day before - assume there was a sum yesterday, but no such today - set it to zero (example #3)
5. if no value values in the range and the day before - do not include in the result set (example #4)

现在我有一个代码：
let last2Days =
    Window
        .PartitionBy('ID')
        .OrderBy(Functions.Col('Date').Cast("timestamp").Cast("long"))
        .RangeBetween(-1, 0)

let prev2Days =
    Window
        .PartitionBy('ID')
        .OrderBy(Functions.Col('Date').Cast("timestamp").Cast("long"))
        .RangeBetween(-2, -1)

df
    .WithColumn('last2daysSum', Functions.Sum('value').Over(last2Days))
    .WithColumn('prev2daysSum', Functions.Sum('value').Over(last4Days))
    .WithColumn('result2Days', Functions.Col('last2daysSum'))
    .Where(Functions.Col('Date').EqualTo(Functions.Lit('10-13-2020')))

例如#1（当结果取自last2daysSum
）
这是否可能在不进行改组的情况下解决问题？
对于问题1，如果您只想计算一个特定日期，那么groupBy
和agg
更简单，应该执行得更快。诀窍是在聚合函数（如sum
）中使用
对于问题#2和#3，您可以合并为零，并在此之前过滤掉完全为空的行。如果需要筛选的范围比您希望显示的范围更广（因此包括以前有值但现在没有值的行），则可以为更长的时间段添加额外的计算，即筛选后的下降。参见下面的代码示例
import org.apache.spark.sql.functions_
val数据=序列(
("2020-10-10", 1, 1),
("2020-10-11", 1, 2),
("2020-10-12", 1, 3),
("2020-10-13", 1, 4),
("2020-10-10", 2, 5),
("2020-10-11", 2, 6),
("2020-10-12", 2, 7),
("2020-10-09", 3, 8),
("2020-10-08", 4, 9)
).toDF（“日期”、“ID”、“值”）。带列（“日期”，截止日期（$“日期”））
def sumlastdays（现在：java.sql.Timestamp，start:Int，end:Int=0）=
总和（当（$“日期”）。介于（日期（现在开始），开始-1），日期（现在开始，结束）），$“值”））
val now=java.sql.Timestamp.valueOf（“2020-10-13 00:00:00”）
数据
.groupBy（$“ID”）
阿格先生(
Sumlastdays（现在，2）。作为（“last2DaysSum”），
Sumlastdays（现在，4）。作为（“last4DaysSum”），
Sumlastdays（现在，4，2）。作为（“prev2DaysSum”），
Sumlastdays（现在，5.as）（“last5DaysSum”）
)
.filter（$“last5DaysSum”.isNotNull）
.drop（$“last5DaysSum”）
.withColumn（“last4DaysSum”，coalesce（$“last4DaysSum”，lit（0）））
.withColumn（“last2DaysSum”，coalesce（$“last2DaysSum”，lit（0）））
.withColumn（“prev2DaysSum”，coalesce（$“prev2DaysSum”，lit（0）））
.orderBy（$“ID”）
.show（）

结果:
+---+------------+------------+------------+
| ID|last2DaysSum|last4DaysSum|prev2DaysSum|
+---+------------+------------+------------+
|  1|           7|          10|           3|
|  2|           7|          18|          11|
|  3|           0|           0|           0|
+---+------------+------------+------------+

注意：我不确定您的意思是prev2Days是当前2天时间间隔之前的前2天时间间隔还是昨天的最后2天时间间隔，因为在预期结果表中，ID 1对10月11日至12日进行了求和，ID 2对前2天的10月10日至11日进行了求和，但如果需要其他内容，您可以调整范围参数。我假设prev2Days与last2Days不重叠，如果您希望重叠2天范围，只需将其更改为SumlastDays（现在，3，1）
。
谢谢您的回复。是否可以使用聚合函数来计算特定ID和与公共ID组的比率？例如，我有一个具有10个不同ID的类别。我需要计算定义范围内具体ID总和与整个类别总和的比率如果您只需要一个或几个特定ID计数比率与组总数，您可以使用相同的技巧，count（当（$“ID”===lit（123），lit（1））/count（lit（1））如果您需要所有ID，那么也许您可以在一个聚合中使用不同的窗口函数对不同的计数进行操作，并将它们彼此分开，但我不确定我的想法是什么。如果只有一个总计数需要除以，则可以计算该总计数并将其放入val中，然后将其传递给val totalCount=data.count data.groupBy（$“ID”）.count.withColumn（“比率”、$“计数”/lit（totalCount））
1. is there a simple way to get a proper result for #2 (the latest record within defined time range)?
2. combine the previous question and condition `if last = null && prev != null then 0 else if last = null && prev = null then null else last` - example #3?
3. how to exclude records as per example #4?

+---+------------+------------+------------+
| ID|last2DaysSum|last4DaysSum|prev2DaysSum|
+---+------------+------------+------------+
|  1|           7|          10|           3|
|  2|           7|          18|          11|
|  3|           0|           0|           0|
+---+------------+------------+------------+