如何使用Scala在Spark中进行滑动窗口排名?
我有一个数据集:如何使用Scala在Spark中进行滑动窗口排名?,scala,apache-spark,window,apache-spark-mllib,ranking-functions,Scala,Apache Spark,Window,Apache Spark Mllib,Ranking Functions,我有一个数据集: +-----+-------------------+---------------------+------------------+ |query|similar_queries |model_score |count | +-----+-------------------+---------------------+------------------+ |shirt|funny shirt |0.00340
+-----+-------------------+---------------------+------------------+
|query|similar_queries |model_score |count |
+-----+-------------------+---------------------+------------------+
|shirt|funny shirt |0.0034038130658784866|189.0 |
|shirt|shirt womens |0.0019435265241921438|136.0 |
|shirt|watch |0.001097496453284101 |212.0 |
|shirt|necklace |6.694577024597908E-4 |151.0 |
|shirt|white shirt |0.0037413097560623485|217.0 |
|shirt|shoes |0.0022062579255572733|575.0 |
|shirt|crop top |9.065831060804897E-4 |173.0 |
|shirt|polo shirts for men|0.007706416273211698 |349.0 |
|shirt|shorts |0.002669621942466027 |200.0 |
|shirt|black shirt |0.03264296242546658 |114.0 |
+-----+-------------------+---------------------+------------------+
我首先根据“计数”对数据集进行排名
我现在尝试使用滚动窗口对内容进行排名,该窗口按行数(4行)排列,并根据模型分数在窗口内进行排名。例如:
在第一个窗口中,第1到4行,将显示新列(新列)
在第一个窗口中,第5行到第8行,将显示新列(新列)
在第一个窗口中,第9行静止,新列(新列)将
有没有人能告诉我,如果有这样的火花和Scala,我该如何实现?有没有我可以使用的预定义函数
我试过:
lazy val MODEL_RANK=Window.partitionBy(col(查询))
.orderBy(col(MODEL_SCORE).desc).rowsBetween(0,3)
但这给了我:
sql.AnalysisException: Window Frame ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING must match the required frame ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW;
另外,尝试使用.rowsBetween(-3,0),但这也会导致错误:
org.apache.spark.sql.AnalysisException: Window Frame ROWS BETWEEN 3 PRECEDING AND CURRENT ROW must match the required frame ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW;
由于您已经计算了
count\u rank
,下一步是找到一种方法将行分组为一组四。可按如下方式进行:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
val ranked_data_grouped = ranked_data
.withColumn("bucket", (($"count_rank" -1)/4).cast(IntegerType))
排名数据分组将如下所示:
+-----+-------------------+---------------------+------------------+----------+-------+
|query|similar_queries |model_score |count |count_rank|bucket |
+-----+-------------------+---------------------+------------------+----------+-------+
|shirt|shoes |0.0022062579255572733|575.0 |1 |0 |
|shirt|polo shirts for men|0.007706416273211698 |349.0 |2 |0 |
|shirt|white shirt |0.0037413097560623485|217.0 |3 |0 |
|shirt|watch |0.001097496453284101 |212.0 |4 |0 |
|shirt|shorts |0.002669621942466027 |200.0 |5 |1 |
|shirt|funny shirt |0.0034038130658784866|189.0 |6 |1 |
|shirt|crop top |9.065831060804897E-4 |173.0 |7 |1 |
|shirt|necklace |6.694577024597908E-4 |151.0 |8 |1 |
|shirt|shirt womens |0.0019435265241921438|136.0 |9 |2 |
|shirt|black shirt |0.03264296242546658 |114.0 |10 |2 |
+-----+-------------------+---------------------+------------------+----------+-------+
现在,您所要做的就是按bucket
进行分区并按model\u score
排序:
val output = ranked_data_grouped
.withColumn("finalRank", row_number().over(Window.partitionBy($"bucket").orderBy($"model_score".desc)))
预期的输出数据帧是什么?@ollik1预期的o/p是1。男式马球衫2件。白衬衫3。鞋子4。看5。有趣的衬衫6。短裤7。衬衫女式8。收成前9名。黑色衬衫10件。但这并没有给出从1到n的最终库存。。它又给了我1..4 1..4等等。。有没有办法获得最终排名1。。n ie。。1..4(组0)接着5..8(组1的排名1到4)…我得到了-val输出=排名数据分组。带列(“最终排名”,行编号()。超过(窗口分区($“bucket”).orderBy(col(“模型评分”).desc))。带列(“最终排名”,行编号()。超过(窗口分区($“查询”).orderBy(col(“bucket”),col(“最终排名”)
org.apache.spark.sql.AnalysisException: Window Frame ROWS BETWEEN 3 PRECEDING AND CURRENT ROW must match the required frame ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW;
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
val ranked_data_grouped = ranked_data
.withColumn("bucket", (($"count_rank" -1)/4).cast(IntegerType))
+-----+-------------------+---------------------+------------------+----------+-------+
|query|similar_queries |model_score |count |count_rank|bucket |
+-----+-------------------+---------------------+------------------+----------+-------+
|shirt|shoes |0.0022062579255572733|575.0 |1 |0 |
|shirt|polo shirts for men|0.007706416273211698 |349.0 |2 |0 |
|shirt|white shirt |0.0037413097560623485|217.0 |3 |0 |
|shirt|watch |0.001097496453284101 |212.0 |4 |0 |
|shirt|shorts |0.002669621942466027 |200.0 |5 |1 |
|shirt|funny shirt |0.0034038130658784866|189.0 |6 |1 |
|shirt|crop top |9.065831060804897E-4 |173.0 |7 |1 |
|shirt|necklace |6.694577024597908E-4 |151.0 |8 |1 |
|shirt|shirt womens |0.0019435265241921438|136.0 |9 |2 |
|shirt|black shirt |0.03264296242546658 |114.0 |10 |2 |
+-----+-------------------+---------------------+------------------+----------+-------+
val output = ranked_data_grouped
.withColumn("finalRank", row_number().over(Window.partitionBy($"bucket").orderBy($"model_score".desc)))