Python Pyspark-按分组并选择N个最高值

Python Pyspark-按分组并选择N个最高值,python,sql,database,pyspark,Python,Sql,Database,Pyspark,我有这样的数据: +----------+----------+--------+ | Location | Product | Amount | +----------+----------+--------+ | London | Fish | 307 | | London | Chips | 291 | | London | Beer | 147 | | Paris | Baguettes| 217 | | Paris

我有这样的数据:

+----------+----------+--------+
| Location | Product  | Amount |
+----------+----------+--------+
| London   | Fish     |    307 |
| London   | Chips    |    291 |
| London   | Beer     |    147 |
| Paris    | Baguettes|    217 |
| Paris    | Cheese   |    103 |
| Paris    | Champagne|     74 |
+----------+----------+--------+
当然,有许多地点,每个地点有许多产品。我想在这样的数据帧结束:

+----------+---------------------+-------------------------+-------+-------------------------+
| Location | Most Common Product | 2nd Most Common Product |.....  | Nth Most Common Product |
+----------+---------------------+-------------------------+-------+-------------------------+
| London   | Fish                | Chips                   | ....  |     something           |
| Paris    | Baguettes           | Cheese                  | ....  |     something else      |
+----------+---------------------+-------------------------+-------+-------------------------+
我已经找到了最常见的,使用

通过将其扩展到N个最常见的数据框,我可以创建另一个数据框,删除这些行,再次运行该过程以获得第二个最常见的数据框,并按位置将它们连接在一起。通过适当的列命名,可以将其放入一个循环中运行N次,每次迭代添加一列

然而,这将是非常缓慢的,因为它将分区并加入每个迭代。例如,如何以更好的方式获得每个位置最常见的50个位置?

您可以使用pivot-

首先,您需要创建一个行号,然后基于该行号应用透视-

from pyspark.sql.functions import first
from pyspark.sql import functions as f

df_data.withColumn("row_number", f.row_number().over(Window.partitionBy("Location").orderBy(col("unit_count").desc())))

(df_data
    .groupby(df_data.Location)
    .pivot("row_number")
    .agg(first("Product"))
    .show())