Sql 前一个秩为零时如何分配秩（第2部分）_Sql_Apache Spark Sql_Window Functions_Databricks_Gaps And Islands

Sql 前一个秩为零时如何分配秩（第2部分）

sql

Sql 前一个秩为零时如何分配秩（第2部分）,sql,apache-spark-sql,window-functions,databricks,gaps-and-islands,Sql,Apache Spark Sql,Window Functions,Databricks,Gaps And Islands,这是我先前问题的延伸。该解决方案在postgres环境中非常有效，但现在我需要复制到databricks环境spark sql 问题与之前相同，但现在正在尝试确定如何将此postgres查询转换为spark sql。基本上，如果数据中存在缺口，也就是说，按位置和geo3进行分组时，没有微观geo，则汇总分配金额。对于所有位置和zip3组，估算分配将等于1 这是postgres查询，非常有效： select location_code, geo3, distance_group, has_

这是我先前问题的延伸。该解决方案在postgres环境中非常有效，但现在我需要复制到databricks环境spark sql

问题与之前相同，但现在正在尝试确定如何将此postgres查询转换为spark sql。基本上，如果数据中存在缺口，也就是说，按位置和geo3进行分组时，没有微观geo，则汇总分配金额。对于所有位置和zip3组，估算分配将等于1

这是postgres查询，非常有效：

    select location_code, geo3, distance_group, has_micro_geo, imputed_allocation from 
        (
        select ia.*,
               (case when has_micro_geo > 0
                     then sum(allocation) over (partition by location_code, geo3, grp)
                     else 0
                end) as imputed_allocation
        from (select s.*,
                     count(*) filter (where has_micro_geo <> 0) over (partition by location_code, geo3 order by distance_group desc) as grp
              from staging_groups s
             ) ia
        )z

但它不能很好地转换，并在databricks中产生以下错误：

    Error in SQL statement: ParseException: 
    mismatched input 'from' expecting <EOF>(line 1, pos 78)

    == SQL ==
    select location_code, geo3, distance_group, has_micro_geo, imputed_allocation from 
    ------------------------------------------------------------------------------^^^
        (
        select ia.*,
               (case when has_micro_geo > 0
                     then sum(allocation) over (partition by location_code, geo3, grp)
                     else 0
                end) as imputed_allocation
        from (select s.*,
                     count(*) filter (where has_micro_geo <> 0) over (partition by location_code, geo3 order by distance_group desc) as grp
              from staging_groups s
             ) ia
        )z

或者至少，如何只转换创建grp的内部查询的一部分，然后其他部分就可以工作了。我一直在尝试用其他东西替换这个过滤器，但是尝试并没有达到预期效果

    select s.*,
    count(*) filter (where has_micro_geo <> 0) over (partition by location_code, geo3 order by distance_group desc) as grp
    from staging_groups s

这里有一个db摆弄当前设置为postgres的数据，但我需要在spark sql环境中再次运行它。我尝试过将其分解并创建不同的表，但我的小组并没有按预期工作

    select s.*,
    count(*) filter (where has_micro_geo <> 0) over (partition by location_code, geo3 order by distance_group desc) as grp
    from staging_groups s

以下是一幅图像，可以更好地显示输出：

您需要重写此子查询：

select s.*,
    count(*) filter (where has_micro_geo <> 0) over (partition by location_code, geo3 order by distance_group desc) as grp
from staging_groups s

我认为查询的其余部分在Spark SQL中应该可以正常运行。

因为micro\u geo已经是一个0/1标志，您可以将countfilter重新连接到它

sum(has_micro_geo)
over (partition by location_code, geo3
      order by distance_group desc
      rows unbounded preceding) as grp

添加行无界前置，以避免默认范围无界前置，这可能会降低性能

顺便说一句，我已经在对戈登对你之前问题的解决方案的评论中写道：-