Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 火花动态窗口计算_Apache Spark_Apache Spark Sql - Fatal编程技术网

Apache spark 火花动态窗口计算

Apache spark 火花动态窗口计算,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,以下是可用于计算最高价格的销售数据。 最大价格逻辑 Max(最近3周价格) 对于前3周,如果最后几周的数据不可用 最高价格为 max of(第1周、第2周、第3周) 在下面的示例中,最大值(排名5、6、7) 如何在spark中使用窗口功能实现相同的功能 以下是使用PySpark窗口、lead/udf的解决方案 请注意,我将排名5,6,7的价格更改为1,2,3,以区别于其他值进行解释。这个逻辑就是选择你解释的 max_price_udf = udf(lambda prices_list: max(

以下是可用于计算最高价格的销售数据。 最大价格逻辑

Max(最近3周价格)

对于前3周,如果最后几周的数据不可用 最高价格为

max of(第1周、第2周、第3周)

在下面的示例中,最大值(排名5、6、7)

如何在spark中使用窗口功能实现相同的功能


以下是使用PySpark窗口、lead/udf的解决方案

请注意,我将排名5,6,7的价格更改为1,2,3,以区别于其他值进行解释。这个逻辑就是选择你解释的

max_price_udf = udf(lambda prices_list: max(prices_list), IntegerType())

df = spark.createDataFrame([(1, 5, 2019,1,20),(2, 4, 2019,2,18),
                            (3, 3, 2019,3,21),(4, 2, 2019,4,20),
                            (5, 1, 2019,5,1),(6, 52, 2018,6,2),
                            (7, 51, 2018,7,3)], ["product_id", "week", "year","rank","price"])

window = Window.orderBy(col("year").desc(),col("week").desc())

df = df.withColumn("prices_list", array([coalesce(lead(col("price"),x, None).over(window),lead(col("price"),x-3, None).over(window)) for x in range(1, 4)]))
df = df.withColumn("max_price",max_price_udf(col("prices_list")))

df.show()
结果是什么

+----------+----+----+----+-----+------------+---------+
|product_id|week|year|rank|price| prices_list|max_price|
+----------+----+----+----+-----+------------+---------+
|         1|   5|2019|   1|   20|[18, 21, 20]|       21|
|         2|   4|2019|   2|   18| [21, 20, 1]|       21|
|         3|   3|2019|   3|   21|  [20, 1, 2]|       20|
|         4|   2|2019|   4|   20|   [1, 2, 3]|        3|
|         5|   1|2019|   5|    1|   [2, 3, 1]|        3|
|         6|  52|2018|   6|    2|   [3, 1, 2]|        3|
|         7|  51|2018|   7|    3|   [1, 2, 3]|        3|
+----------+----+----+----+-----+------------+---------+
这是Scala中的解决方案

var df = Seq((1, 5, 2019, 1, 20), (2, 4, 2019, 2, 18),
         (3, 3, 2019, 3, 21), (4, 2, 2019, 4, 20),
         (5, 1, 2019, 5, 1), (6, 52, 2018, 6, 2),
         (7, 51, 2018, 7, 3)).toDF("product_id", "week", "year", "rank", "price")

val window = Window.orderBy($"year".desc, $"week".desc)

df = df.withColumn("max_price", greatest((for (x <- 1 to 3) yield coalesce(lead(col("price"), x, null).over(window), lead(col("price"), x - 3, null).over(window))):_*))

df.show()

var df=Seq((1,5,2019,1,20)、(2,4,2019,2,18),
(3, 3, 2019, 3, 21), (4, 2, 2019, 4, 20),
(5, 1, 2019, 5, 1), (6, 52, 2018, 6, 2),
(7,512018,7,3)).toDF(“产品id”、“周”、“年”、“排名”、“价格”)
val window=window.orderBy($“year.desc,$“week.desc”)

df=df.withColumn(“max_price”,最大值)((对于(x您可以将SQL窗口函数与最大值()结合使用)。当SQL window函数的行数少于3行时,您考虑的是当前行,甚至是之前的行。因此,您需要在内部子查询中计算lag1_价格、lag2_价格。在外部查询中,您可以使用行_计数值并使用最大()函数通过将lag1、lag2和当前价格分别传递给2,1,0,得到最大值

看看这个:

val df = Seq((1, 5, 2019,1,20),(2, 4, 2019,2,18),
(3, 3, 2019,3,21),(4, 2, 2019,4,20),
(5, 1, 2019,5,1),(6, 52, 2018,6,2),
(7, 51, 2018,7,3)).toDF("product_id", "week", "year","rank","price")

df.createOrReplaceTempView("sales")

val df2 = spark.sql("""
          select product_id, week, year, price,
          count(*) over(order by year desc, week desc rows between 1 following and 3 following  ) as count_row,
          lag(price) over(order by year desc, week desc ) as lag1_price,
          sum(price) over(order by year desc, week desc rows between 2 preceding and 2 preceding ) as lag2_price,
          max(price) over(order by year desc, week desc rows between 1 following and 3 following  ) as max_price1 from sales
  """)
df2.show(false)
df2.createOrReplaceTempView("sales_inner")
spark.sql("""
          select product_id, week, year, price,
          case
             when count_row=2 then greatest(price,max_price1)
             when count_row=1 then greatest(price,lag1_price,max_price1)
             when count_row=0 then greatest(price,lag1_price,lag2_price)
             else  max_price1
          end as max_price
         from sales_inner
  """).show(false)
结果:

+----------+----+----+-----+---------+----------+----------+----------+
|product_id|week|year|price|count_row|lag1_price|lag2_price|max_price1|
+----------+----+----+-----+---------+----------+----------+----------+
|1         |5   |2019|20   |3        |null      |null      |21        |
|2         |4   |2019|18   |3        |20        |null      |21        |
|3         |3   |2019|21   |3        |18        |20        |20        |
|4         |2   |2019|20   |3        |21        |18        |3         |
|5         |1   |2019|1    |2        |20        |21        |3         |
|6         |52  |2018|2    |1        |1         |20        |3         |
|7         |51  |2018|3    |0        |2         |1         |null      |
+----------+----+----+-----+---------+----------+----------+----------+

+----------+----+----+-----+---------+
|product_id|week|year|price|max_price|
+----------+----+----+-----+---------+
|1         |5   |2019|20   |21       |
|2         |4   |2019|18   |21       |
|3         |3   |2019|21   |20       |
|4         |2   |2019|20   |3        |
|5         |1   |2019|1    |3        |
|6         |52  |2018|2    |3        |
|7         |51  |2018|3    |3        |
+----------+----+----+-----+---------+

更新您的示例输入和预期输出。所有给定的coulmn都是输入,max_price是需要添加到数据集的输出coulmn。我已根据您的数据集更新了问题,您的答案似乎正确,但我无法将此行转换为scala->df=df。withColumn(“prices_list”),array([coalesce(lead(col(“price”),x,None)。完毕(window),lead(col(“price”),x-3,None.over(window))for x in range(1,4)])最终设法在scala中使用column(“maxPrice”,array((1到3).map(e=>coalesce(lead(“price”,e)。over(windowSpec),lead(“price”,e-3)。over(windowSpec)):)虽然不需要使用udf,但是可以为数组使用最大函数来查找最大值尝试使用最大值,但找不到,不知道最大值可用,我正要用scala进行更新,很高兴您能够做到!!!