Postgresql pyspark sql avg返回非零值,平均值为0和空值
我正在pyspark中运行下面的代码。在下面的代码中,我通过从子查询中获取product_cycle_days字段的平均值(按多个字段分组),创建final product_cycle_days字段。我遇到的问题是,如果我查看未聚合的子查询,product_cycle_days的唯一值为0或null,但在最终查询中,平均值后返回0.5 我正在从postgresql源代码翻译此代码,在postgresql源代码中,在最终查询中返回值0。所以我想弄明白为什么pyspark sql返回0.5 我在下面列出了整个查询和结果。我还包括了下面子查询的结果 代码: 输出:Postgresql pyspark sql avg返回非零值,平均值为0和空值,postgresql,pyspark,apache-spark-sql,Postgresql,Pyspark,Apache Spark Sql,我正在pyspark中运行下面的代码。在下面的代码中,我通过从子查询中获取product_cycle_days字段的平均值(按多个字段分组),创建final product_cycle_days字段。我遇到的问题是,如果我查看未聚合的子查询,product_cycle_days的唯一值为0或null,但在最终查询中,平均值后返回0.5 我正在从postgresql源代码翻译此代码,在postgresql源代码中,在最终查询中返回值0。所以我想弄明白为什么pyspark sql返回0.5 我在下面
+----------+-------+------------+-----------+--------------+----+------------------------------------+--------------------+------------------+
|dateclosed|storeid|brandname|producttype|productsubtype|size|product_id |product_repeat_gross|product_cycle_days|
+----------+-------+------------+-----------+--------------+----+------------------------------------+--------------------+------------------+
|2020-01-02|105 |LABS |SHIRT |PREPACK |XL|c43a1a06-6a63-46ba-a476-xxxxxxxxxxx|3.464462313624555E-5|0.5 |
+----------+-------+------------+-----------+--------------+----+------------------------------------+--------------------+------------------+
子查询:
sub_query="""select
dateclosed, t.product_id, t.storeid, t.producttype, t.productsubtype, t.size, t.brandname,
case
when ticketid = first_value(ticketid) over (partition by t.product_id, customer_uuid
ORDER BY
dateclosed ASC rows between unbounded preceding and unbounded following) then 0
else grossreceipts
end as product_repeat_gross, datediff(dateclosed,
lag(dateclosed, 1) over (partition by t.brandname, customer_uuid, t.product_id
ORDER BY
dateclosed ASC )
) as product_cycle_days
from
t
where dateclosed='2020-01-02'
and brandname='LABS'
and producttype='SHIRT'
and productsubtype='PREPACK'
and size='XL'
and storeid=105
and product_id='c43a1a06-6a63-46ba-a476-xxxxxxxxxxx'"""
sub=spark.sql(sub_query)
sub.show(truncate=False)
+----------+------------------------------------+-------+-----------+--------------+----+------------+--------------------+------------------+
|dateclosed|product_id |storeid|producttype|productsubtype|size|brandname|product_repeat_gross|product_cycle_days|
+----------+------------------------------------+-------+-----------+--------------+----+------------+--------------------+------------------+
|2020-01-02|c43a1a06-6a63-46ba-a476-xxxxxxxxxxx|105 |SHIRT |PREPACK |XL|LABS |0.0 |null |
|2020-01-02|c43a1a06-6a63-46ba-a476-xxxxxxxxxxx|105 |SHIRT |PREPACK |XL|LABS |0.0 |0 |
|2020-01-02|c43a1a06-6a63-46ba-a476-xxxxxxxxxxx|105 |SHIRT |PREPACK |XL|LABS |0.0 |null |
+----------+------------------------------------+-------+-----------+--------------+----+------------+--------------------+------------------+
你好只是想知道为什么会有+(rand()/10000)?很可能与此无关,但我无法将我的经历概括起来。Sorr,只是有兴趣。请提供一份报告来解释您的问题。您的查询中有太多不必要的详细信息。请只保留少量代码来说明您的问题。
sub_query="""select
dateclosed, t.product_id, t.storeid, t.producttype, t.productsubtype, t.size, t.brandname,
case
when ticketid = first_value(ticketid) over (partition by t.product_id, customer_uuid
ORDER BY
dateclosed ASC rows between unbounded preceding and unbounded following) then 0
else grossreceipts
end as product_repeat_gross, datediff(dateclosed,
lag(dateclosed, 1) over (partition by t.brandname, customer_uuid, t.product_id
ORDER BY
dateclosed ASC )
) as product_cycle_days
from
t
where dateclosed='2020-01-02'
and brandname='LABS'
and producttype='SHIRT'
and productsubtype='PREPACK'
and size='XL'
and storeid=105
and product_id='c43a1a06-6a63-46ba-a476-xxxxxxxxxxx'"""
sub=spark.sql(sub_query)
sub.show(truncate=False)
+----------+------------------------------------+-------+-----------+--------------+----+------------+--------------------+------------------+
|dateclosed|product_id |storeid|producttype|productsubtype|size|brandname|product_repeat_gross|product_cycle_days|
+----------+------------------------------------+-------+-----------+--------------+----+------------+--------------------+------------------+
|2020-01-02|c43a1a06-6a63-46ba-a476-xxxxxxxxxxx|105 |SHIRT |PREPACK |XL|LABS |0.0 |null |
|2020-01-02|c43a1a06-6a63-46ba-a476-xxxxxxxxxxx|105 |SHIRT |PREPACK |XL|LABS |0.0 |0 |
|2020-01-02|c43a1a06-6a63-46ba-a476-xxxxxxxxxxx|105 |SHIRT |PREPACK |XL|LABS |0.0 |null |
+----------+------------------------------------+-------+-----------+--------------+----+------------+--------------------+------------------+