Postgresql pyspark sql avg返回非零值，平均值为0和空值_Postgresql_Pyspark_Apache Spark Sql

Postgresql pyspark sql avg返回非零值，平均值为0和空值

postgresql pyspark

Postgresql pyspark sql avg返回非零值，平均值为0和空值,postgresql,pyspark,apache-spark-sql,Postgresql,Pyspark,Apache Spark Sql,我正在pyspark中运行下面的代码。在下面的代码中，我通过从子查询中获取product_cycle_days字段的平均值（按多个字段分组），创建final product_cycle_days字段。我遇到的问题是，如果我查看未聚合的子查询，product_cycle_days的唯一值为0或null，但在最终查询中，平均值后返回0.5 我正在从postgresql源代码翻译此代码，在postgresql源代码中，在最终查询中返回值0。所以我想弄明白为什么pyspark sql返回0.5 我在下面

我正在pyspark中运行下面的代码。在下面的代码中，我通过从子查询中获取product_cycle_days字段的平均值（按多个字段分组），创建final product_cycle_days字段。我遇到的问题是，如果我查看未聚合的子查询，product_cycle_days的唯一值为0或null，但在最终查询中，平均值后返回0.5

我正在从postgresql源代码翻译此代码，在postgresql源代码中，在最终查询中返回值0。所以我想弄明白为什么pyspark sql返回0.5

我在下面列出了整个查询和结果。我还包括了下面子查询的结果

代码：

输出：

+----------+-------+------------+-----------+--------------+----+------------------------------------+--------------------+------------------+
|dateclosed|storeid|brandname|producttype|productsubtype|size|product_id                          |product_repeat_gross|product_cycle_days|
+----------+-------+------------+-----------+--------------+----+------------------------------------+--------------------+------------------+
|2020-01-02|105    |LABS  |SHIRT     |PREPACK       |XL|c43a1a06-6a63-46ba-a476-xxxxxxxxxxx|3.464462313624555E-5|0.5          |
+----------+-------+------------+-----------+--------------+----+------------------------------------+--------------------+------------------+

子查询：

sub_query="""select
            dateclosed, t.product_id, t.storeid, t.producttype, t.productsubtype, t.size, t.brandname,
            case
                when ticketid = first_value(ticketid) over (partition by t.product_id, customer_uuid
            ORDER BY
                dateclosed ASC rows between unbounded preceding and unbounded following) then 0
                else grossreceipts
            end as product_repeat_gross, datediff(dateclosed,
            lag(dateclosed, 1) over (partition by t.brandname, customer_uuid, t.product_id
        ORDER BY
            dateclosed ASC )
            ) as product_cycle_days
        from
            t
        where dateclosed='2020-01-02'
        and brandname='LABS'
        and producttype='SHIRT'
        and productsubtype='PREPACK'
        and size='XL'
        and storeid=105
        and product_id='c43a1a06-6a63-46ba-a476-xxxxxxxxxxx'"""

sub=spark.sql(sub_query)

sub.show(truncate=False)


+----------+------------------------------------+-------+-----------+--------------+----+------------+--------------------+------------------+
|dateclosed|product_id                          |storeid|producttype|productsubtype|size|brandname|product_repeat_gross|product_cycle_days|
+----------+------------------------------------+-------+-----------+--------------+----+------------+--------------------+------------------+
|2020-01-02|c43a1a06-6a63-46ba-a476-xxxxxxxxxxx|105    |SHIRT     |PREPACK       |XL|LABS  |0.0                 |null              |
|2020-01-02|c43a1a06-6a63-46ba-a476-xxxxxxxxxxx|105    |SHIRT     |PREPACK       |XL|LABS  |0.0                 |0                 |
|2020-01-02|c43a1a06-6a63-46ba-a476-xxxxxxxxxxx|105    |SHIRT     |PREPACK       |XL|LABS  |0.0                 |null              |
+----------+------------------------------------+-------+-----------+--------------+----+------------+--------------------+------------------+

你好只是想知道为什么会有+（rand（）/10000）？很可能与此无关，但我无法将我的经历概括起来。Sorr，只是有兴趣。请提供一份报告来解释您的问题。您的查询中有太多不必要的详细信息。请只保留少量代码来说明您的问题。

sub_query="""select
            dateclosed, t.product_id, t.storeid, t.producttype, t.productsubtype, t.size, t.brandname,
            case
                when ticketid = first_value(ticketid) over (partition by t.product_id, customer_uuid
            ORDER BY
                dateclosed ASC rows between unbounded preceding and unbounded following) then 0
                else grossreceipts
            end as product_repeat_gross, datediff(dateclosed,
            lag(dateclosed, 1) over (partition by t.brandname, customer_uuid, t.product_id
        ORDER BY
            dateclosed ASC )
            ) as product_cycle_days
        from
            t
        where dateclosed='2020-01-02'
        and brandname='LABS'
        and producttype='SHIRT'
        and productsubtype='PREPACK'
        and size='XL'
        and storeid=105
        and product_id='c43a1a06-6a63-46ba-a476-xxxxxxxxxxx'"""

sub=spark.sql(sub_query)

sub.show(truncate=False)


+----------+------------------------------------+-------+-----------+--------------+----+------------+--------------------+------------------+
|dateclosed|product_id                          |storeid|producttype|productsubtype|size|brandname|product_repeat_gross|product_cycle_days|
+----------+------------------------------------+-------+-----------+--------------+----+------------+--------------------+------------------+
|2020-01-02|c43a1a06-6a63-46ba-a476-xxxxxxxxxxx|105    |SHIRT     |PREPACK       |XL|LABS  |0.0                 |null              |
|2020-01-02|c43a1a06-6a63-46ba-a476-xxxxxxxxxxx|105    |SHIRT     |PREPACK       |XL|LABS  |0.0                 |0                 |
|2020-01-02|c43a1a06-6a63-46ba-a476-xxxxxxxxxxx|105    |SHIRT     |PREPACK       |XL|LABS  |0.0                 |null              |
+----------+------------------------------------+-------+-----------+--------------+----+------------+--------------------+------------------+