Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/18.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 根据Spark中DF2的日期范围获取DF1中某列的总和_Scala_Apache Spark - Fatal编程技术网

Scala 根据Spark中DF2的日期范围获取DF1中某列的总和

Scala 根据Spark中DF2的日期范围获取DF1中某列的总和,scala,apache-spark,Scala,Apache Spark,我有两个数据帧,我想根据数据帧2 startDate和endDate的日期范围得到数据帧1中的值之和,并在Spark中将结果从最大值到最小值排序 结果输出将向df_date dataframe sum_值添加一列。我真的不知道从哪里开始。我在网上搜索,但找不到解决方案。首先必须将日期值与日期范围关联,然后聚合: df_dates .join(df, $"date".between($"startDate", $"endDate"), "left") .groupBy($"startDat

我有两个数据帧,我想根据数据帧2 startDate和endDate的日期范围得到数据帧1中的值之和,并在Spark中将结果从最大值到最小值排序


结果输出将向df_date dataframe sum_值添加一列。我真的不知道从哪里开始。我在网上搜索,但找不到解决方案。

首先必须将日期值与日期范围关联,然后聚合:

df_dates
  .join(df, $"date".between($"startDate", $"endDate"), "left")
  .groupBy($"startDate", $"endDate").agg(
     sum($"value").as("sum_value")
 )
  .orderBy($"sum_value".desc)
  .show()

+----------+----------+---------+
| startDate|   endDate|sum_value|
+----------+----------+---------+
|2019-01-05|2019-01-08|      984|
|2019-01-09|2019-01-12|      681|
|2019-01-13|2019-01-16|      568|
|2019-01-01|2019-01-04|      408|
+----------+----------+---------+
df_dates
  .join(df, $"date".between($"startDate", $"endDate"), "left")
  .groupBy($"startDate", $"endDate").agg(
     sum($"value").as("sum_value")
 )
  .orderBy($"sum_value".desc)
  .show()

+----------+----------+---------+
| startDate|   endDate|sum_value|
+----------+----------+---------+
|2019-01-05|2019-01-08|      984|
|2019-01-09|2019-01-12|      681|
|2019-01-13|2019-01-16|      568|
|2019-01-01|2019-01-04|      408|
+----------+----------+---------+