Pyspark 获取上个月的值
我试图得到上个月的数据,因为我使用了滞后函数,但没有得到想要的结果Pyspark 获取上个月的值,pyspark,Pyspark,我试图得到上个月的数据,因为我使用了滞后函数,但没有得到想要的结果 ut cntr src Item section Year Period css fct ytd_1 ytd_1*fct aproach1 aproach2 49 52 179 f 84 2019 1 63 0.616580311 5578.092 3439.341699
ut cntr src Item section Year Period css fct ytd_1 ytd_1*fct aproach1 aproach2
49 52 179 f 84 2019 1 63 0.616580311 5578.092 3439.341699 0 0
e4 52 179 f 84 2019 1 31 0.248704663 5578.092 1387.297492 0 0
49 52 179 f 84 2019 1 31 0.248704663 5578.092 1387.297492 0 0
a5 52 179 f 84 2019 1 31 0.248704663 5578.092 1387.297492 0 0
49 52 179 f 84 2019 2 63 0.080405405 18506.982 1488.061391 3439.341 5578.092
49 52 179 f 84 2019 2 31 0.072297297 18506.982 1338.00478 1387.29 5578.092
e4 52 187 f 84 2019 2 31 0.072297297 18506.982 1338.00478 1387.29 5578.092
e4 52 179 f 84 2019 2 31 0.072297297 18506.982 1338.00478 1387.29 5578.092
代码:
我可以得到帮助来获取我在Approach 2列(预期结果)中提到的上个月的值吗?检查以下内容是否适用于您 首先创建数据帧(添加了周期3以验证结果,而不考虑其他列) 创建数据帧
dfl1 = spark.createDataFrame(l1).toDF('ut','cntr','src','Item','section','Year','Period','css','fct','ytd_1','ytd_1*fct')
dfl1.show()
+---+----+---+----+-------+----+------+---+-----------+---------+-----------+
| ut|cntr|src|Item|section|Year|Period|css| fct| ytd_1| ytd_1*fct|
+---+----+---+----+-------+----+------+---+-----------+---------+-----------+
| 49| 52|179| f| 84|2019| 1| 63|0.616580311| 5578.092|3439.341699|
| e4| 52|179| f| 84|2019| 1| 31|0.248704663| 5578.092|1387.297492|
| 49| 52|179| f| 84|2019| 1| 31|0.248704663| 5578.092|1387.297492|
| a5| 52|179| f| 84|2019| 1| 31|0.248704663| 5578.092|1387.297492|
| 49| 52|179| f| 84|2019| 2| 63|0.080405405|18506.982|1488.061391|
| 49| 52|179| f| 84|2019| 2| 31|0.072297297|18506.982| 1338.00478|
| e4| 52|187| f| 84|2019| 2| 31|0.072297297|18506.982| 1338.00478|
| e4| 52|179| f| 84|2019| 2| 31|0.072297297|18506.982| 1338.00478|
| e4| 52|179| f| 84|2019| 3| 31|0.072297297|10006.982| 1338.00478|
| e4| 52|179| f| 84|2019| 3| 31|0.072297297|10006.982| 1338.00478|
+---+----+---+----+-------+----+------+---+-----------+---------+-----------+
定义窗口。这里是诀窍,我们给出的范围是-1到0,所以它总是检查上面的一个范围,然后取上一个范围的第一个值
范围描述自
基于范围的边界基于ORDER BY表达式的实际值
现在,周期1的第一个值相同,所以在函数和标记为0时添加
dfl2 = dfl1.withColumn('Result', func.when(func.first(dfl1['ytd_1']).over(wl1) == dfl1['ytd_1'], func.lit(0)).otherwise(func.first(dfl1['ytd_1']).over(wl1)))
dfl2.orderBy('Period').show()
+---+----+---+----+-------+----+------+---+-----------+---------+-----------+---------+
| ut|cntr|src|Item|section|Year|Period|css| fct| ytd_1| ytd_1*fct| Result|
+---+----+---+----+-------+----+------+---+-----------+---------+-----------+---------+
| e4| 52|179| f| 84|2019| 1| 31|0.248704663| 5578.092|1387.297492| 0.0|
| a5| 52|179| f| 84|2019| 1| 31|0.248704663| 5578.092|1387.297492| 0.0|
| 49| 52|179| f| 84|2019| 1| 63|0.616580311| 5578.092|3439.341699| 0.0|
| 49| 52|179| f| 84|2019| 1| 31|0.248704663| 5578.092|1387.297492| 0.0|
| 49| 52|179| f| 84|2019| 2| 63|0.080405405|18506.982|1488.061391| 5578.092|
| e4| 52|179| f| 84|2019| 2| 31|0.072297297|18506.982| 1338.00478| 5578.092|
| 49| 52|179| f| 84|2019| 2| 31|0.072297297|18506.982| 1338.00478| 5578.092|
| e4| 52|187| f| 84|2019| 2| 31|0.072297297|18506.982| 1338.00478| 5578.092|
| e4| 52|179| f| 84|2019| 3| 31|0.072297297|10006.982| 1338.00478|18506.982|
| e4| 52|179| f| 84|2019| 3| 31|0.072297297|10006.982| 1338.00478|18506.982|
+---+----+---+----+-------+----+------+---+-----------+---------+-----------+---------+
可以编辑并对齐源数据框。另外,请给出预期的输出。您在问题陈述中提到了
滞后
,但您的代码正在使用领先
。对不起,这是我的错。。。我使用的问题是不清楚实际的数据帧和预期的结果。你能把这个问题修改一下,让它更清楚吗。为什么partitionBy子句中有两个节?
dfl1 = spark.createDataFrame(l1).toDF('ut','cntr','src','Item','section','Year','Period','css','fct','ytd_1','ytd_1*fct')
dfl1.show()
+---+----+---+----+-------+----+------+---+-----------+---------+-----------+
| ut|cntr|src|Item|section|Year|Period|css| fct| ytd_1| ytd_1*fct|
+---+----+---+----+-------+----+------+---+-----------+---------+-----------+
| 49| 52|179| f| 84|2019| 1| 63|0.616580311| 5578.092|3439.341699|
| e4| 52|179| f| 84|2019| 1| 31|0.248704663| 5578.092|1387.297492|
| 49| 52|179| f| 84|2019| 1| 31|0.248704663| 5578.092|1387.297492|
| a5| 52|179| f| 84|2019| 1| 31|0.248704663| 5578.092|1387.297492|
| 49| 52|179| f| 84|2019| 2| 63|0.080405405|18506.982|1488.061391|
| 49| 52|179| f| 84|2019| 2| 31|0.072297297|18506.982| 1338.00478|
| e4| 52|187| f| 84|2019| 2| 31|0.072297297|18506.982| 1338.00478|
| e4| 52|179| f| 84|2019| 2| 31|0.072297297|18506.982| 1338.00478|
| e4| 52|179| f| 84|2019| 3| 31|0.072297297|10006.982| 1338.00478|
| e4| 52|179| f| 84|2019| 3| 31|0.072297297|10006.982| 1338.00478|
+---+----+---+----+-------+----+------+---+-----------+---------+-----------+
wl1 = Window.partitionBy(['Item','section','Year','css']).orderBy('Period').rangeBetween( -1, 0)
dfl2 = dfl1.withColumn('Result', func.when(func.first(dfl1['ytd_1']).over(wl1) == dfl1['ytd_1'], func.lit(0)).otherwise(func.first(dfl1['ytd_1']).over(wl1)))
dfl2.orderBy('Period').show()
+---+----+---+----+-------+----+------+---+-----------+---------+-----------+---------+
| ut|cntr|src|Item|section|Year|Period|css| fct| ytd_1| ytd_1*fct| Result|
+---+----+---+----+-------+----+------+---+-----------+---------+-----------+---------+
| e4| 52|179| f| 84|2019| 1| 31|0.248704663| 5578.092|1387.297492| 0.0|
| a5| 52|179| f| 84|2019| 1| 31|0.248704663| 5578.092|1387.297492| 0.0|
| 49| 52|179| f| 84|2019| 1| 63|0.616580311| 5578.092|3439.341699| 0.0|
| 49| 52|179| f| 84|2019| 1| 31|0.248704663| 5578.092|1387.297492| 0.0|
| 49| 52|179| f| 84|2019| 2| 63|0.080405405|18506.982|1488.061391| 5578.092|
| e4| 52|179| f| 84|2019| 2| 31|0.072297297|18506.982| 1338.00478| 5578.092|
| 49| 52|179| f| 84|2019| 2| 31|0.072297297|18506.982| 1338.00478| 5578.092|
| e4| 52|187| f| 84|2019| 2| 31|0.072297297|18506.982| 1338.00478| 5578.092|
| e4| 52|179| f| 84|2019| 3| 31|0.072297297|10006.982| 1338.00478|18506.982|
| e4| 52|179| f| 84|2019| 3| 31|0.072297297|10006.982| 1338.00478|18506.982|
+---+----+---+----+-------+----+------+---+-----------+---------+-----------+---------+