Python Pyspark-按分组并选择N个最高值
我有这样的数据:Python Pyspark-按分组并选择N个最高值,python,sql,database,pyspark,Python,Sql,Database,Pyspark,我有这样的数据: +----------+----------+--------+ | Location | Product | Amount | +----------+----------+--------+ | London | Fish | 307 | | London | Chips | 291 | | London | Beer | 147 | | Paris | Baguettes| 217 | | Paris
+----------+----------+--------+
| Location | Product | Amount |
+----------+----------+--------+
| London | Fish | 307 |
| London | Chips | 291 |
| London | Beer | 147 |
| Paris | Baguettes| 217 |
| Paris | Cheese | 103 |
| Paris | Champagne| 74 |
+----------+----------+--------+
当然,有许多地点,每个地点有许多产品。我想在这样的数据帧结束:
+----------+---------------------+-------------------------+-------+-------------------------+
| Location | Most Common Product | 2nd Most Common Product |..... | Nth Most Common Product |
+----------+---------------------+-------------------------+-------+-------------------------+
| London | Fish | Chips | .... | something |
| Paris | Baguettes | Cheese | .... | something else |
+----------+---------------------+-------------------------+-------+-------------------------+
我已经找到了最常见的,使用
通过将其扩展到N个最常见的数据框,我可以创建另一个数据框,删除这些行,再次运行该过程以获得第二个最常见的数据框,并按位置将它们连接在一起。通过适当的列命名,可以将其放入一个循环中运行N次,每次迭代添加一列
然而,这将是非常缓慢的,因为它将分区并加入每个迭代。例如,如何以更好的方式获得每个位置最常见的50个位置?您可以使用pivot-
首先,您需要创建一个行号,然后基于该行号应用透视-
from pyspark.sql.functions import first
from pyspark.sql import functions as f
df_data.withColumn("row_number", f.row_number().over(Window.partitionBy("Location").orderBy(col("unit_count").desc())))
(df_data
.groupby(df_data.Location)
.pivot("row_number")
.agg(first("Product"))
.show())