修复scala databricks 2.4.3中用于解决字符和或字符串比较问题的查询
我已经处理了拼花地板文件,并在scala spark 2.4.3中创建了以下数据帧修复scala databricks 2.4.3中用于解决字符和或字符串比较问题的查询,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我已经处理了拼花地板文件,并在scala spark 2.4.3中创建了以下数据帧 +-----------+------------+-----------+--------------+-----------+ | itemno|requestMonth|requestYear|totalRequested|requestDate| +-----------+------------+-----------+--------------+-----------+ | 75123
+-----------+------------+-----------+--------------+-----------+
| itemno|requestMonth|requestYear|totalRequested|requestDate|
+-----------+------------+-----------+--------------+-----------+
| 7512365| 2| 2014| 110.0| 2014-02-01|
| 7519278| 4| 2013| 96.0| 2013-04-01|
|5436134-070| 12| 2013| 8.0| 2013-12-01|
| 7547385| 1| 2014| 89.0| 2014-01-01|
| 0453978| 9| 2014| 18.0| 2014-09-01|
| 7558402| 10| 2014| 260.0| 2014-10-01|
|5437662-070| 7| 2013| 78.0| 2013-07-01|
| 3089858| 11| 2014| 5.0| 2014-11-01|
| 7181584| 2| 2017| 4.0| 2017-02-01|
| 7081417| 3| 2017| 15.0| 2017-03-01|
| 5814215| 4| 2017| 35.0| 2017-04-01|
| 7178940| 10| 2014| 5.0| 2014-10-01|
| 0450636| 1| 2015| 7.0| 2015-01-01|
| 5133406| 5| 2014| 46.0| 2014-05-01|
| 2204858| 12| 2015| 34.0| 2015-12-01|
| 1824299| 5| 2015| 1.0| 2015-05-01|
|5437474-620| 8| 2015| 4.0| 2015-08-01|
| 3086317| 9| 2014| 1.0| 2014-09-01|
| 2204331| 3| 2015| 2.0| 2015-03-01|
| 5334160| 1| 2018| 2.0| 2018-01-01|
+-----------+------------+-----------+--------------+-----------+
为了派生一个新特性,我尝试应用逻辑并重新排列数据帧,如下所示
itemno – as it is in above-mentioned data frame
startDate - the start of the season
endDate - the end of the season
totalRequested - number of parts requested in that season
percetageOfRequests - totalRequested in current season / total over this plus 3 previous seasons (4 total seasons)
//seasons date for reference
Spring: 1 March to 31 May.
Summer: 1 June to 31 August.
Autumn: 1 September to 30 November.
Winter: 1 December to 28 February.
我所做的:
itemno startDateOfSeason endDateOfSeason season sum_totalRequestedBySeason (totalrequestedinCurrentSeason/totalRequestedinlast 3 + current season.)
123 12/01/2018 02/28/2019 winter 12 12/12+ 36 (36 from previous three seasons)
123 03/01/2019 05/31/2019 spring 24 24/24 + 45 (45 from previous three seasons)
我试着遵循两种逻辑
case
when to_char(StartDate,'MMDD') between '0301' and '0531' then 'spring'
.....
.....
end as season
但它不起作用。我在OracleDB中使用了字符逻辑,它在那里工作,但环顾四周后,我发现SparkSQL没有这个函数。还有,我试过了
import org.apache.spark.sql.functions._
val dateDF1 = orvPartRequestsDF.withColumn("MMDD", concat_ws("-", month($"requestDate"), dayofmonth($"requestDate")))
%sql
select distinct requestDate, MMDD,
case
when MMDD between '3-1' and '5-31' then 'Spring'
when MMDD between '6-1' and '8-31' then 'Summer'
when MMDD between '9-1' and '11-30' then 'Autumn'
when MMDD between '12-1' and '2-28' then 'Winter'
end as season
from temporal
而且它也不起作用。你能让我知道我在这里遗漏了什么吗(我猜我不能比较像这样的字符串,但我不确定,所以我在这里问了),以及我如何解决这个问题
在JXC解决方案#1之后,范围在
因为我看到了一些不一致性,所以我再次共享数据帧。以下是数据帧季节性F12
+-------+-----------+--------------+------+----------+
| itemno|requestYear|totalRequested|season|seasonCalc|
+-------+-----------+--------------+------+----------+
|0450000| 2011| 0.0|Winter| 201075|
|0450000| 2011| 0.0|Winter| 201075|
|0450000| 2011| 0.0|Spring| 201100|
|0450000| 2011| 0.0|Spring| 201100|
|0450000| 2011| 0.0|Spring| 201100|
|0450000| 2011| 0.0|Summer| 201125|
|0450000| 2011| 0.0|Summer| 201125|
|0450000| 2011| 0.0|Summer| 201125|
|0450000| 2011| 0.0|Autumn| 201150|
|0450000| 2011| 0.0|Autumn| 201150|
|0450000| 2011| 0.0|Autumn| 201150|
|0450000| 2011| 0.0|Winter| 201175|
|0450000| 2012| 3.0|Winter| 201175|
|0450000| 2012| 1.0|Winter| 201175|
|0450000| 2012| 4.0|Spring| 201200|
|0450000| 2012| 0.0|Spring| 201200|
|0450000| 2012| 0.0|Spring| 201200|
|0450000| 2012| 2.0|Summer| 201225|
|0450000| 2012| 3.0|Summer| 201225|
|0450000| 2012| 2.0|Summer| 201225|
+-------+-----------+--------------+------+----------+
我会申请的
val seasonDF2 = seasonDF12.selectExpr("*", """
sum(totalRequested) OVER (
PARTITION BY itemno
ORDER BY seasonCalc
RANGE BETWEEN 100 PRECEDING AND CURRENT ROW
) AS sum_totalRequested
""")
我看到了
查看sum\u totalRequested列中的前40个
。上面的所有条目都是0。不知道为什么是40。我想我已经共享了它,但我需要将上面的数据帧转换为
itemno startDateOfSeason endDateOfSeason sum_totalRequestedBySeason (totalrequestedinCurrentSeason/totalRequestedinlast 3 + current season.)
最终输出如下:
itemno startDateOfSeason endDateOfSeason season sum_totalRequestedBySeason (totalrequestedinCurrentSeason/totalRequestedinlast 3 + current season.)
123 12/01/2018 02/28/2019 winter 12 12/12+ 36 (36 from previous three seasons)
123 03/01/2019 05/31/2019 spring 24 24/24 + 45 (45 from previous three seasons)
编辑-2:调整以首先计算按季节分组的总和,然后计算窗口合计总和: 编辑-1:根据评论,不需要指定季节。我们可以将
春季
、夏季
、秋季
、冬季
分别设置为0
、25
、50
和75
,季节将是一个整数,加上年(请求日期)*100
,这样我们就可以使用范围(当前+前三个季节的偏移量=-100)在窗口聚合函数中:
注:以下是pyspark代码:
这将得到以下结果:
df1.show()
+-----------+-----------------+---------------+------+------+--------------------------+
| itemno|startDateOfSeason|endDateOfSeason|season| label|sum_totalRequestedBySeason|
+-----------+-----------------+---------------+------+------+--------------------------+
|5436134-070| 2013-12-01| 2013-12-31|winter|201375| 8.0|
| 1824299| 2015-03-01| 2015-05-31|spring|201500| 1.0|
| 0453978| 2014-09-01| 2014-11-30|autumn|201450| 18.0|
| 7181584| 2017-01-01| 2017-02-28|winter|201675| 4.0|
| 7178940| 2014-09-01| 2014-11-30|autumn|201450| 5.0|
| 7547385| 2014-01-01| 2014-02-28|winter|201375| 89.0|
| 5814215| 2017-03-01| 2017-05-31|spring|201700| 35.0|
| 3086317| 2014-09-01| 2014-11-30|autumn|201450| 1.0|
| 0450636| 2015-01-01| 2015-02-28|winter|201475| 7.0|
| 2204331| 2015-03-01| 2015-05-31|spring|201500| 2.0|
|5437474-620| 2015-06-01| 2015-08-31|summer|201525| 4.0|
| 5133406| 2014-03-01| 2014-05-31|spring|201400| 46.0|
| 7081417| 2017-03-01| 2017-05-31|spring|201700| 15.0|
| 7519278| 2013-03-01| 2013-05-31|spring|201300| 96.0|
| 7558402| 2014-09-01| 2014-11-30|autumn|201450| 260.0|
| 2204858| 2015-12-01| 2015-12-31|winter|201575| 34.0|
|5437662-070| 2013-06-01| 2013-08-31|summer|201325| 78.0|
| 5334160| 2018-01-01| 2018-02-28|winter|201775| 2.0|
| 3089858| 2014-09-01| 2014-11-30|autumn|201450| 5.0|
| 7512365| 2014-01-01| 2014-02-28|winter|201375| 110.0|
+-----------+-----------------+---------------+------+------+--------------------------+
在我们得到季节总数后,然后使用窗口聚合函数计算当前加上前3个季节的总和,然后计算比率:
df1.selectExpr("*", """
round(sum_totalRequestedBySeason/sum(sum_totalRequestedBySeason) OVER (
PARTITION BY itemno
ORDER BY label
RANGE BETWEEN 100 PRECEDING AND CURRENT ROW
),2) AS ratio_of_current_over_current_plus_past_3_seasons
""").show()
删除日期,只使用月份,即(3,4,5)中的
month(startDate),然后使用“Spring”
等@jxc:在这种情况下可以使用,但我想知道如果需要检查MMDD怎么办,例如,在02-15和03-15之间的MMDD,然后…
。我很困惑,所以我问了它。同样,在percetageOfRequests分母中,即本季加上前三季(共四季),我使用了一个简单的逻辑,即查看年份和季节,得到最后三季,然后将其添加到当前季节,然后进行分割。有更好的方法吗?我认为你只需要以不同的方式处理一年中的季节,将年份和季节命名为_struct是保存这两种信息的简单方法,因此,您可以更轻松地对数据帧行进行筛选/分组。+------------------+------------------+--------+--------+--------+---------+----项目号|请求月|请求年|总请求|请求日期| MMDD |季节|+---------+---------+---------+---------项目号|请求月|请求年|总请求日期| MMDD |季节|+---------+---------+---------+----项目号|+---------+---------请求月2014 | 110.0 | 2014-02-01 |[2013,冬季]|
例如当MMDD在'0301'和'0531'之间时,然后CONCAT(年份(请求日期),'1')
或者我们可以为春季设置00
,为夏季设置25
,为秋季设置50
,为冬季设置75
,并将其转换为整数,然后使用范围,4个季节的偏移量将为100??例如,使用int(year(requestDate))*100
作为春季,使用int(year(requestDate))*100+25作为夏季等?在这种情况下,rangeBetween应该可以。@SachinSharma,我将在一个SQL中合并所有内容,并在接下来的几分钟内更新帖子