Pyspark 获取上个月的值

Pyspark 获取上个月的值,pyspark,Pyspark,我试图得到上个月的数据,因为我使用了滞后函数,但没有得到想要的结果 ut cntr src Item section Year Period css fct ytd_1 ytd_1*fct aproach1 aproach2 49 52 179 f 84 2019 1 63 0.616580311 5578.092 3439.341699

我试图得到上个月的数据,因为我使用了滞后函数,但没有得到想要的结果

ut  cntr         src    Item    section Year    Period  css    fct       ytd_1     ytd_1*fct    aproach1    aproach2
  49    52      179     f         84    2019    1      63   0.616580311 5578.092    3439.341699     0             0
  e4    52      179     f         84    2019    1      31   0.248704663 5578.092    1387.297492     0             0
  49    52      179     f         84    2019    1      31   0.248704663 5578.092    1387.297492     0             0
  a5    52      179     f         84    2019    1      31   0.248704663 5578.092    1387.297492     0             0
  49    52      179     f         84    2019    2      63   0.080405405 18506.982   1488.061391    3439.341    5578.092
  49    52      179     f         84    2019    2      31   0.072297297 18506.982   1338.00478     1387.29     5578.092
  e4    52      187     f         84    2019    2      31   0.072297297 18506.982   1338.00478     1387.29     5578.092
  e4    52      179     f         84    2019    2      31   0.072297297 18506.982   1338.00478     1387.29     5578.092
代码:


我可以得到帮助来获取我在Approach 2列(预期结果)中提到的上个月的值吗?

检查以下内容是否适用于您

首先创建数据帧(添加了周期3以验证结果,而不考虑其他列)

创建数据帧

dfl1 = spark.createDataFrame(l1).toDF('ut','cntr','src','Item','section','Year','Period','css','fct','ytd_1','ytd_1*fct')

dfl1.show()
+---+----+---+----+-------+----+------+---+-----------+---------+-----------+
| ut|cntr|src|Item|section|Year|Period|css|        fct|    ytd_1|  ytd_1*fct|
+---+----+---+----+-------+----+------+---+-----------+---------+-----------+
| 49|  52|179|   f|     84|2019|     1| 63|0.616580311| 5578.092|3439.341699|
| e4|  52|179|   f|     84|2019|     1| 31|0.248704663| 5578.092|1387.297492|
| 49|  52|179|   f|     84|2019|     1| 31|0.248704663| 5578.092|1387.297492|
| a5|  52|179|   f|     84|2019|     1| 31|0.248704663| 5578.092|1387.297492|
| 49|  52|179|   f|     84|2019|     2| 63|0.080405405|18506.982|1488.061391|
| 49|  52|179|   f|     84|2019|     2| 31|0.072297297|18506.982| 1338.00478|
| e4|  52|187|   f|     84|2019|     2| 31|0.072297297|18506.982| 1338.00478|
| e4|  52|179|   f|     84|2019|     2| 31|0.072297297|18506.982| 1338.00478|
| e4|  52|179|   f|     84|2019|     3| 31|0.072297297|10006.982| 1338.00478|
| e4|  52|179|   f|     84|2019|     3| 31|0.072297297|10006.982| 1338.00478|
+---+----+---+----+-------+----+------+---+-----------+---------+-----------+

定义窗口。这里是诀窍,我们给出的范围是-1到0,所以它总是检查上面的一个范围,然后取上一个范围的第一个值

范围描述自

基于范围的边界基于ORDER BY表达式的实际值

现在,周期1的第一个值相同,所以在函数和标记为0时添加

dfl2 = dfl1.withColumn('Result', func.when(func.first(dfl1['ytd_1']).over(wl1) == dfl1['ytd_1'], func.lit(0)).otherwise(func.first(dfl1['ytd_1']).over(wl1)))

dfl2.orderBy('Period').show()

+---+----+---+----+-------+----+------+---+-----------+---------+-----------+---------+
| ut|cntr|src|Item|section|Year|Period|css|        fct|    ytd_1|  ytd_1*fct|   Result|
+---+----+---+----+-------+----+------+---+-----------+---------+-----------+---------+
| e4|  52|179|   f|     84|2019|     1| 31|0.248704663| 5578.092|1387.297492|      0.0|
| a5|  52|179|   f|     84|2019|     1| 31|0.248704663| 5578.092|1387.297492|      0.0|
| 49|  52|179|   f|     84|2019|     1| 63|0.616580311| 5578.092|3439.341699|      0.0|
| 49|  52|179|   f|     84|2019|     1| 31|0.248704663| 5578.092|1387.297492|      0.0|
| 49|  52|179|   f|     84|2019|     2| 63|0.080405405|18506.982|1488.061391| 5578.092|
| e4|  52|179|   f|     84|2019|     2| 31|0.072297297|18506.982| 1338.00478| 5578.092|
| 49|  52|179|   f|     84|2019|     2| 31|0.072297297|18506.982| 1338.00478| 5578.092|
| e4|  52|187|   f|     84|2019|     2| 31|0.072297297|18506.982| 1338.00478| 5578.092|
| e4|  52|179|   f|     84|2019|     3| 31|0.072297297|10006.982| 1338.00478|18506.982|
| e4|  52|179|   f|     84|2019|     3| 31|0.072297297|10006.982| 1338.00478|18506.982|
+---+----+---+----+-------+----+------+---+-----------+---------+-----------+---------+

可以编辑并对齐源数据框。另外,请给出预期的输出。您在问题陈述中提到了
滞后
,但您的代码正在使用
领先
。对不起,这是我的错。。。我使用的问题是不清楚实际的数据帧和预期的结果。你能把这个问题修改一下,让它更清楚吗。为什么partitionBy子句中有两个节?
dfl1 = spark.createDataFrame(l1).toDF('ut','cntr','src','Item','section','Year','Period','css','fct','ytd_1','ytd_1*fct')

dfl1.show()
+---+----+---+----+-------+----+------+---+-----------+---------+-----------+
| ut|cntr|src|Item|section|Year|Period|css|        fct|    ytd_1|  ytd_1*fct|
+---+----+---+----+-------+----+------+---+-----------+---------+-----------+
| 49|  52|179|   f|     84|2019|     1| 63|0.616580311| 5578.092|3439.341699|
| e4|  52|179|   f|     84|2019|     1| 31|0.248704663| 5578.092|1387.297492|
| 49|  52|179|   f|     84|2019|     1| 31|0.248704663| 5578.092|1387.297492|
| a5|  52|179|   f|     84|2019|     1| 31|0.248704663| 5578.092|1387.297492|
| 49|  52|179|   f|     84|2019|     2| 63|0.080405405|18506.982|1488.061391|
| 49|  52|179|   f|     84|2019|     2| 31|0.072297297|18506.982| 1338.00478|
| e4|  52|187|   f|     84|2019|     2| 31|0.072297297|18506.982| 1338.00478|
| e4|  52|179|   f|     84|2019|     2| 31|0.072297297|18506.982| 1338.00478|
| e4|  52|179|   f|     84|2019|     3| 31|0.072297297|10006.982| 1338.00478|
| e4|  52|179|   f|     84|2019|     3| 31|0.072297297|10006.982| 1338.00478|
+---+----+---+----+-------+----+------+---+-----------+---------+-----------+

wl1 = Window.partitionBy(['Item','section','Year','css']).orderBy('Period').rangeBetween( -1, 0)
dfl2 = dfl1.withColumn('Result', func.when(func.first(dfl1['ytd_1']).over(wl1) == dfl1['ytd_1'], func.lit(0)).otherwise(func.first(dfl1['ytd_1']).over(wl1)))

dfl2.orderBy('Period').show()

+---+----+---+----+-------+----+------+---+-----------+---------+-----------+---------+
| ut|cntr|src|Item|section|Year|Period|css|        fct|    ytd_1|  ytd_1*fct|   Result|
+---+----+---+----+-------+----+------+---+-----------+---------+-----------+---------+
| e4|  52|179|   f|     84|2019|     1| 31|0.248704663| 5578.092|1387.297492|      0.0|
| a5|  52|179|   f|     84|2019|     1| 31|0.248704663| 5578.092|1387.297492|      0.0|
| 49|  52|179|   f|     84|2019|     1| 63|0.616580311| 5578.092|3439.341699|      0.0|
| 49|  52|179|   f|     84|2019|     1| 31|0.248704663| 5578.092|1387.297492|      0.0|
| 49|  52|179|   f|     84|2019|     2| 63|0.080405405|18506.982|1488.061391| 5578.092|
| e4|  52|179|   f|     84|2019|     2| 31|0.072297297|18506.982| 1338.00478| 5578.092|
| 49|  52|179|   f|     84|2019|     2| 31|0.072297297|18506.982| 1338.00478| 5578.092|
| e4|  52|187|   f|     84|2019|     2| 31|0.072297297|18506.982| 1338.00478| 5578.092|
| e4|  52|179|   f|     84|2019|     3| 31|0.072297297|10006.982| 1338.00478|18506.982|
| e4|  52|179|   f|     84|2019|     3| 31|0.072297297|10006.982| 1338.00478|18506.982|
+---+----+---+----+-------+----+------+---+-----------+---------+-----------+---------+