Performance PySpark中是否有类似于sql窗口函数的有效方法？_Performance_Pyspark_Bigdata_Pyspark Sql

Performance PySpark中是否有类似于sql窗口函数的有效方法？

performance pyspark

Performance PySpark中是否有类似于sql窗口函数的有效方法？,performance,pyspark,bigdata,pyspark-sql,Performance,Pyspark,Bigdata,Pyspark Sql,我正在处理一个包含3列和5个Bil的巨大数据帧。排。数据的大小为360 GB。对于数据分析，我使用以下设置： -在AWS r4.16xlarge上运行的Jupyternotebooks -Pypark内核名为customer\u sales的表类似于以下示例： +--------------------+----------+-------+ | business_unit_id | customer | sales | +--------------------+

我正在处理一个包含3列和5个Bil的巨大数据帧。排。数据的大小为360 GB。对于数据分析，我使用以下设置：

-在AWS r4.16xlarge上运行的Jupyternotebooks

-Pypark内核

名为

customer\u sales

的表类似于以下示例：

    +--------------------+----------+-------+
    |  business_unit_id  | customer | sales |
    +--------------------+----------+-------+
    |                  1 +        a +  5000 +
    |                  1 +        b +  2000 +
    |                  1 +        c +  3000 +
    |                  1 +        d +  5000 +
    |                  2 +        f +   600 +
    |                  2 +        c +  7000 +
    |                  3 +        j +   200 +
    |                  3 +        k +   800 +
    |                  3 +        c +  4500 +

现在，我想为每个

业务单元\u id

获取

销售额最高的客户。如果几个客户之间有一个销售
抽签，我想把他们全部都拿到。信息应存储在一个名为best\u customers\u的表格中，用于每个\u单元
。因此，对于上述示例，每个单元的表最佳客户\u
如下所示：
    +--------------------+----------+-------+
    |  business_unit_id  | customer | sales |
    +--------------------+----------+-------+
    |                  1 +        a +  5000 +
    |                  1 +        d +  5000 +
    |                  2 +        c +  7000 +
    |                  3 +        c +  4500 +

     best_customers_for_each_unit = spark.sql("""
     SELECT 
          business_unit_id,
          customer,
          sales
     FROM (
          SELECT
             business_unit_id,
             customer,
             sales,
             dense_rank() OVER (PARTITION BY business_unit_id ORDER BY sales DESC)as rank
          FROM customer_sales) tmp
     WHERE
     rank =1
     """)

在第二步中，我想计算一个客户在特定业务部门id
中拥有最高销售额的频率。此查询的输出将是：
    +----------+-------+
    | customer | count |
    +----------+-------+
    |        a +     1 +
    |        b +     1 +
    |        c +     2 +

对于第一个查询，我使用spark.sql和窗口函数。使用的查询如下所示：
    +--------------------+----------+-------+
    |  business_unit_id  | customer | sales |
    +--------------------+----------+-------+
    |                  1 +        a +  5000 +
    |                  1 +        d +  5000 +
    |                  2 +        c +  7000 +
    |                  3 +        c +  4500 +

     best_customers_for_each_unit = spark.sql("""
     SELECT 
          business_unit_id,
          customer,
          sales
     FROM (
          SELECT
             business_unit_id,
             customer,
             sales,
             dense_rank() OVER (PARTITION BY business_unit_id ORDER BY sales DESC)as rank
          FROM customer_sales) tmp
     WHERE
     rank =1
     """)

对于第二个查询，我使用了以下Pypark片段：
     best_customers_for_each_unit.groupBy("customer").count()

我的查询确实有效，但只处理一小部分数据需要很长时间。那么您知道使用PySpark进行此类查询的有效方法吗
关于
你能发布执行计划吗？如果business\u unit\u id包含许多唯一的值，那么这应该非常快。您使用了多少个分区？重新划分为更多的分区可能会help@FokkoDriesprong你所说的执行计划是什么意思？你是指函数explain（）的输出吗？你能发布执行计划吗？如果business\u unit\u id包含许多唯一的值，那么这应该非常快。您使用了多少个分区？重新划分为更多的分区可能会help@FokkoDriesprong你所说的执行计划是什么意思？您是指函数explain（）的输出吗？