修复scala databricks 2.4.3中用于解决字符和或字符串比较问题的查询

修复scala databricks 2.4.3中用于解决字符和或字符串比较问题的查询,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我已经处理了拼花地板文件,并在scala spark 2.4.3中创建了以下数据帧 +-----------+------------+-----------+--------------+-----------+ | itemno|requestMonth|requestYear|totalRequested|requestDate| +-----------+------------+-----------+--------------+-----------+ | 75123

我已经处理了拼花地板文件,并在scala spark 2.4.3中创建了以下数据帧

+-----------+------------+-----------+--------------+-----------+
|     itemno|requestMonth|requestYear|totalRequested|requestDate|
+-----------+------------+-----------+--------------+-----------+
|    7512365|           2|       2014|         110.0| 2014-02-01|
|    7519278|           4|       2013|          96.0| 2013-04-01|
|5436134-070|          12|       2013|           8.0| 2013-12-01|
|    7547385|           1|       2014|          89.0| 2014-01-01|
|    0453978|           9|       2014|          18.0| 2014-09-01|
|    7558402|          10|       2014|         260.0| 2014-10-01|
|5437662-070|           7|       2013|          78.0| 2013-07-01|
|    3089858|          11|       2014|           5.0| 2014-11-01|
|    7181584|           2|       2017|           4.0| 2017-02-01|
|    7081417|           3|       2017|          15.0| 2017-03-01|
|    5814215|           4|       2017|          35.0| 2017-04-01|
|    7178940|          10|       2014|           5.0| 2014-10-01|
|    0450636|           1|       2015|           7.0| 2015-01-01|
|    5133406|           5|       2014|          46.0| 2014-05-01|
|    2204858|          12|       2015|          34.0| 2015-12-01|
|    1824299|           5|       2015|           1.0| 2015-05-01|
|5437474-620|           8|       2015|           4.0| 2015-08-01|
|    3086317|           9|       2014|           1.0| 2014-09-01|
|    2204331|           3|       2015|           2.0| 2015-03-01|
|    5334160|           1|       2018|           2.0| 2018-01-01|
+-----------+------------+-----------+--------------+-----------+
为了派生一个新特性,我尝试应用逻辑并重新排列数据帧,如下所示

itemno – as it is in above-mentioned data frame

startDate - the start of the season

endDate - the end of the season

totalRequested - number of parts requested in that season

percetageOfRequests - totalRequested in current season / total over this plus 3 previous seasons (4 total seasons)

//seasons date for reference
Spring: 1 March to 31 May.

Summer: 1 June to 31 August.

Autumn: 1 September to 30 November.

Winter: 1 December to 28 February.
我所做的:

itemno  startDateOfSeason  endDateOfSeason     season      sum_totalRequestedBySeason  (totalrequestedinCurrentSeason/totalRequestedinlast 3 + current season.)
123     12/01/2018          02/28/2019         winter       12                          12/12+ 36 (36 from previous three seasons)
123     03/01/2019          05/31/2019         spring       24                          24/24 + 45 (45 from previous three seasons)
我试着遵循两种逻辑

case 
            when to_char(StartDate,'MMDD') between '0301' and '0531' then 'spring'
           .....
           .....
           end as season
但它不起作用。我在OracleDB中使用了字符逻辑,它在那里工作,但环顾四周后,我发现SparkSQL没有这个函数。还有,我试过了

import org.apache.spark.sql.functions._

val dateDF1 = orvPartRequestsDF.withColumn("MMDD", concat_ws("-", month($"requestDate"), dayofmonth($"requestDate")))

%sql
select distinct requestDate, MMDD, 
case 
           when MMDD between '3-1' and '5-31' then 'Spring' 
           when MMDD between '6-1' and '8-31' then 'Summer' 
           when MMDD between '9-1' and '11-30' then 'Autumn' 
           when MMDD between '12-1' and '2-28' then 'Winter'
 end as season
from temporal
而且它也不起作用。你能让我知道我在这里遗漏了什么吗(我猜我不能比较像这样的字符串,但我不确定,所以我在这里问了),以及我如何解决这个问题

在JXC解决方案#1之后,范围在

因为我看到了一些不一致性,所以我再次共享数据帧。以下是数据帧
季节性F12

+-------+-----------+--------------+------+----------+
| itemno|requestYear|totalRequested|season|seasonCalc|
+-------+-----------+--------------+------+----------+
|0450000|       2011|           0.0|Winter|    201075|
|0450000|       2011|           0.0|Winter|    201075|
|0450000|       2011|           0.0|Spring|    201100|
|0450000|       2011|           0.0|Spring|    201100|
|0450000|       2011|           0.0|Spring|    201100|
|0450000|       2011|           0.0|Summer|    201125|
|0450000|       2011|           0.0|Summer|    201125|
|0450000|       2011|           0.0|Summer|    201125|
|0450000|       2011|           0.0|Autumn|    201150|
|0450000|       2011|           0.0|Autumn|    201150|
|0450000|       2011|           0.0|Autumn|    201150|
|0450000|       2011|           0.0|Winter|    201175|
|0450000|       2012|           3.0|Winter|    201175|
|0450000|       2012|           1.0|Winter|    201175|
|0450000|       2012|           4.0|Spring|    201200|
|0450000|       2012|           0.0|Spring|    201200|
|0450000|       2012|           0.0|Spring|    201200|
|0450000|       2012|           2.0|Summer|    201225|
|0450000|       2012|           3.0|Summer|    201225|
|0450000|       2012|           2.0|Summer|    201225|
+-------+-----------+--------------+------+----------+
我会申请的

val seasonDF2 = seasonDF12.selectExpr("*", """
                                      sum(totalRequested) OVER (
                                          PARTITION BY itemno
                                          ORDER BY seasonCalc
                                          RANGE BETWEEN 100 PRECEDING AND CURRENT ROW
                                      ) AS sum_totalRequested

                                   """)
我看到了

查看
sum\u totalRequested列中的前40个
。上面的所有条目都是0。不知道为什么是40。我想我已经共享了它,但我需要将上面的数据帧转换为

itemno  startDateOfSeason  endDateOfSeason   sum_totalRequestedBySeason  (totalrequestedinCurrentSeason/totalRequestedinlast 3 + current season.)

最终输出如下:

itemno  startDateOfSeason  endDateOfSeason     season      sum_totalRequestedBySeason  (totalrequestedinCurrentSeason/totalRequestedinlast 3 + current season.)
123     12/01/2018          02/28/2019         winter       12                          12/12+ 36 (36 from previous three seasons)
123     03/01/2019          05/31/2019         spring       24                          24/24 + 45 (45 from previous three seasons)

编辑-2:调整以首先计算按季节分组的总和,然后计算窗口合计总和:

编辑-1:根据评论,不需要指定季节。我们可以将
春季
夏季
秋季
冬季
分别设置为
0
25
50
75
,季节将是一个整数,加上
年(请求日期)*100
,这样我们就可以使用范围(当前+前三个季节的偏移量=-100)在窗口聚合函数中:

注:以下是pyspark代码:

这将得到以下结果:

df1.show()
+-----------+-----------------+---------------+------+------+--------------------------+
|     itemno|startDateOfSeason|endDateOfSeason|season| label|sum_totalRequestedBySeason|
+-----------+-----------------+---------------+------+------+--------------------------+
|5436134-070|       2013-12-01|     2013-12-31|winter|201375|                       8.0|
|    1824299|       2015-03-01|     2015-05-31|spring|201500|                       1.0|
|    0453978|       2014-09-01|     2014-11-30|autumn|201450|                      18.0|
|    7181584|       2017-01-01|     2017-02-28|winter|201675|                       4.0|
|    7178940|       2014-09-01|     2014-11-30|autumn|201450|                       5.0|
|    7547385|       2014-01-01|     2014-02-28|winter|201375|                      89.0|
|    5814215|       2017-03-01|     2017-05-31|spring|201700|                      35.0|
|    3086317|       2014-09-01|     2014-11-30|autumn|201450|                       1.0|
|    0450636|       2015-01-01|     2015-02-28|winter|201475|                       7.0|
|    2204331|       2015-03-01|     2015-05-31|spring|201500|                       2.0|
|5437474-620|       2015-06-01|     2015-08-31|summer|201525|                       4.0|
|    5133406|       2014-03-01|     2014-05-31|spring|201400|                      46.0|
|    7081417|       2017-03-01|     2017-05-31|spring|201700|                      15.0|
|    7519278|       2013-03-01|     2013-05-31|spring|201300|                      96.0|
|    7558402|       2014-09-01|     2014-11-30|autumn|201450|                     260.0|
|    2204858|       2015-12-01|     2015-12-31|winter|201575|                      34.0|
|5437662-070|       2013-06-01|     2013-08-31|summer|201325|                      78.0|
|    5334160|       2018-01-01|     2018-02-28|winter|201775|                       2.0|
|    3089858|       2014-09-01|     2014-11-30|autumn|201450|                       5.0|
|    7512365|       2014-01-01|     2014-02-28|winter|201375|                     110.0|
+-----------+-----------------+---------------+------+------+--------------------------+
在我们得到季节总数后,然后使用窗口聚合函数计算当前加上前3个季节的总和,然后计算比率:

df1.selectExpr("*", """

    round(sum_totalRequestedBySeason/sum(sum_totalRequestedBySeason) OVER (         
        PARTITION BY itemno         
        ORDER BY label         
        RANGE BETWEEN 100 PRECEDING AND CURRENT ROW         
    ),2) AS ratio_of_current_over_current_plus_past_3_seasons

""").show()

删除日期,只使用月份,即(3,4,5)中的
month(startDate),然后使用“Spring”
等@jxc:在这种情况下可以使用,但我想知道如果需要检查MMDD
怎么办,例如,在02-15和03-15之间的MMDD,然后…
。我很困惑,所以我问了它。同样,在percetageOfRequests分母中,即本季加上前三季(共四季),我使用了一个简单的逻辑,即查看年份和季节,得到最后三季,然后将其添加到当前季节,然后进行分割。有更好的方法吗?我认为你只需要以不同的方式处理一年中的季节,将年份和季节命名为_struct是保存这两种信息的简单方法,因此,您可以更轻松地对数据帧行进行筛选/分组。
+------------------+------------------+--------+--------+--------+---------+----项目号|请求月|请求年|总请求|请求日期| MMDD |季节|+---------+---------+---------+---------项目号|请求月|请求年|总请求日期| MMDD |季节|+---------+---------+---------+----项目号|+---------+---------请求月2014 | 110.0 | 2014-02-01 |[2013,冬季]|
例如
当MMDD在'0301'和'0531'之间时,然后CONCAT(年份(请求日期),'1')
或者我们可以为春季设置
00
,为夏季设置
25
,为秋季设置
50
,为冬季设置
75
,并将其转换为
整数,然后使用范围,4个季节的偏移量将为100??例如,使用
int(year(requestDate))*100
作为春季,使用
int(year(requestDate))*100+25作为夏季等?在这种情况下,rangeBetween应该可以。@SachinSharma,我将在一个SQL中合并所有内容,并在接下来的几分钟内更新帖子