Apache spark 火花动态窗口计算
以下是可用于计算最高价格的销售数据。 最大价格逻辑Apache spark 火花动态窗口计算,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,以下是可用于计算最高价格的销售数据。 最大价格逻辑 Max(最近3周价格) 对于前3周,如果最后几周的数据不可用 最高价格为 max of(第1周、第2周、第3周) 在下面的示例中,最大值(排名5、6、7) 如何在spark中使用窗口功能实现相同的功能 以下是使用PySpark窗口、lead/udf的解决方案 请注意,我将排名5,6,7的价格更改为1,2,3,以区别于其他值进行解释。这个逻辑就是选择你解释的 max_price_udf = udf(lambda prices_list: max(
Max(最近3周价格)
对于前3周,如果最后几周的数据不可用
最高价格为
max of(第1周、第2周、第3周)
在下面的示例中,最大值(排名5、6、7)
如何在spark中使用窗口功能实现相同的功能
以下是使用PySpark窗口、lead/udf的解决方案 请注意,我将排名5,6,7的价格更改为1,2,3,以区别于其他值进行解释。这个逻辑就是选择你解释的
max_price_udf = udf(lambda prices_list: max(prices_list), IntegerType())
df = spark.createDataFrame([(1, 5, 2019,1,20),(2, 4, 2019,2,18),
(3, 3, 2019,3,21),(4, 2, 2019,4,20),
(5, 1, 2019,5,1),(6, 52, 2018,6,2),
(7, 51, 2018,7,3)], ["product_id", "week", "year","rank","price"])
window = Window.orderBy(col("year").desc(),col("week").desc())
df = df.withColumn("prices_list", array([coalesce(lead(col("price"),x, None).over(window),lead(col("price"),x-3, None).over(window)) for x in range(1, 4)]))
df = df.withColumn("max_price",max_price_udf(col("prices_list")))
df.show()
结果是什么
+----------+----+----+----+-----+------------+---------+
|product_id|week|year|rank|price| prices_list|max_price|
+----------+----+----+----+-----+------------+---------+
| 1| 5|2019| 1| 20|[18, 21, 20]| 21|
| 2| 4|2019| 2| 18| [21, 20, 1]| 21|
| 3| 3|2019| 3| 21| [20, 1, 2]| 20|
| 4| 2|2019| 4| 20| [1, 2, 3]| 3|
| 5| 1|2019| 5| 1| [2, 3, 1]| 3|
| 6| 52|2018| 6| 2| [3, 1, 2]| 3|
| 7| 51|2018| 7| 3| [1, 2, 3]| 3|
+----------+----+----+----+-----+------------+---------+
这是Scala中的解决方案
var df = Seq((1, 5, 2019, 1, 20), (2, 4, 2019, 2, 18),
(3, 3, 2019, 3, 21), (4, 2, 2019, 4, 20),
(5, 1, 2019, 5, 1), (6, 52, 2018, 6, 2),
(7, 51, 2018, 7, 3)).toDF("product_id", "week", "year", "rank", "price")
val window = Window.orderBy($"year".desc, $"week".desc)
df = df.withColumn("max_price", greatest((for (x <- 1 to 3) yield coalesce(lead(col("price"), x, null).over(window), lead(col("price"), x - 3, null).over(window))):_*))
df.show()
var df=Seq((1,5,2019,1,20)、(2,4,2019,2,18),
(3, 3, 2019, 3, 21), (4, 2, 2019, 4, 20),
(5, 1, 2019, 5, 1), (6, 52, 2018, 6, 2),
(7,512018,7,3)).toDF(“产品id”、“周”、“年”、“排名”、“价格”)
val window=window.orderBy($“year.desc,$“week.desc”)
df=df.withColumn(“max_price”,最大值)((对于(x您可以将SQL窗口函数与最大值()结合使用)。当SQL window函数的行数少于3行时,您考虑的是当前行,甚至是之前的行。因此,您需要在内部子查询中计算lag1_价格、lag2_价格。在外部查询中,您可以使用行_计数值并使用最大()函数通过将lag1、lag2和当前价格分别传递给2,1,0,得到最大值
看看这个:
val df = Seq((1, 5, 2019,1,20),(2, 4, 2019,2,18),
(3, 3, 2019,3,21),(4, 2, 2019,4,20),
(5, 1, 2019,5,1),(6, 52, 2018,6,2),
(7, 51, 2018,7,3)).toDF("product_id", "week", "year","rank","price")
df.createOrReplaceTempView("sales")
val df2 = spark.sql("""
select product_id, week, year, price,
count(*) over(order by year desc, week desc rows between 1 following and 3 following ) as count_row,
lag(price) over(order by year desc, week desc ) as lag1_price,
sum(price) over(order by year desc, week desc rows between 2 preceding and 2 preceding ) as lag2_price,
max(price) over(order by year desc, week desc rows between 1 following and 3 following ) as max_price1 from sales
""")
df2.show(false)
df2.createOrReplaceTempView("sales_inner")
spark.sql("""
select product_id, week, year, price,
case
when count_row=2 then greatest(price,max_price1)
when count_row=1 then greatest(price,lag1_price,max_price1)
when count_row=0 then greatest(price,lag1_price,lag2_price)
else max_price1
end as max_price
from sales_inner
""").show(false)
结果:
+----------+----+----+-----+---------+----------+----------+----------+
|product_id|week|year|price|count_row|lag1_price|lag2_price|max_price1|
+----------+----+----+-----+---------+----------+----------+----------+
|1 |5 |2019|20 |3 |null |null |21 |
|2 |4 |2019|18 |3 |20 |null |21 |
|3 |3 |2019|21 |3 |18 |20 |20 |
|4 |2 |2019|20 |3 |21 |18 |3 |
|5 |1 |2019|1 |2 |20 |21 |3 |
|6 |52 |2018|2 |1 |1 |20 |3 |
|7 |51 |2018|3 |0 |2 |1 |null |
+----------+----+----+-----+---------+----------+----------+----------+
+----------+----+----+-----+---------+
|product_id|week|year|price|max_price|
+----------+----+----+-----+---------+
|1 |5 |2019|20 |21 |
|2 |4 |2019|18 |21 |
|3 |3 |2019|21 |20 |
|4 |2 |2019|20 |3 |
|5 |1 |2019|1 |3 |
|6 |52 |2018|2 |3 |
|7 |51 |2018|3 |3 |
+----------+----+----+-----+---------+
更新您的示例输入和预期输出。所有给定的coulmn都是输入,max_price是需要添加到数据集的输出coulmn。我已根据您的数据集更新了问题,您的答案似乎正确,但我无法将此行转换为scala->df=df。withColumn(“prices_list”),array([coalesce(lead(col(“price”),x,None)。完毕(window),lead(col(“price”),x-3,None.over(window))for x in range(1,4)])最终设法在scala中使用column(“maxPrice”,array((1到3).map(e=>coalesce(lead(“price”,e)。over(windowSpec),lead(“price”,e-3)。over(windowSpec)):)虽然不需要使用udf,但是可以为数组使用最大函数来查找最大值尝试使用最大值,但找不到,不知道最大值可用,我正要用scala进行更新,很高兴您能够做到!!!