带条件的Spark SQL窗口函数范围边界

带条件的Spark SQL窗口函数范围边界,sql,pyspark-sql,window-functions,Sql,Pyspark Sql,Window Functions,我的数据如下所示: Sequence| type | sg | +-----------------+----------------+----------+ | 1| Pump |3 | | 2| Pump |2 | | 3| Inject |4 |

我的数据如下所示:

         Sequence|       type      | sg       |
+-----------------+----------------+----------+
|              1| Pump             |3         |
|              2| Pump             |2         |
|              3| Inject           |4         |
|              4| Pump             |5         |
|              5| Pump             |3         | 
|              6| pump             |6         |
|              7| Inject           |7         |
|              8| Inject           |8         |
|              9| Pump             |9         |
+-----------------+----------------+----------+
我想添加一个新列并检查以前的
类型

如果以前的
类型
值为
,则将新列的值设置为相应的
sg
的值

如果是
inject
,则获取前面所有行的
sg
值之和,直到找到带有
泵的行(其
sg
值包含在总和中)

例: 对于
Sequence=2
,前一行的
type
Pump
,因此新列的值应该是对应的
sg
列的值:3

对于
Sequence=9
,前一行的
type
Inject
,因此新列的值将是前三行的
sg
列的总和,因为
Sequence=6
行是带有
type=Pump
的前一行。新列的值将是
8+7+6=21

最终输出应如下所示:

       Sequence|       type      | sg       |  New sg |
+-----------------+----------------+----------+--------+
|              1| Pump             |3         |-
|              2| Pump             |2         |3
|              3| Inject           |4         |2
|              4| Pump             |5         |6
|              5| Pump             |3         |5
|              6| pump             |6         |3
|              7| Inject           |7         |6
|              8| Inject           |8         |7
|              9| Pump             |9         |21
+-----------------+----------------+----------+

根据您的规则,这只是一组窗口函数。诀窍是用“注入”s按组聚合“泵”值。“泵”的累计总和将查找组

那么查询是:

select t.*,
        (case when prev_type = 'Pump' then sg
              else lag(pump_sg) over (order by id)
         end) as your_value
from (select t.*,
             sum(sg) over (partition by pump_grp) as pump_sg
      from (select t.*,
                   lag(sg) over (order by id) as prev_sg,
                   lag(type) over (order by id) as prev_type,
                 sum(case when type = 'Pump' then 1 else 0 end) over (order by id) as pump_grp
            from t
           ) t
     ) t;
我认为你的规则太复杂了,你不需要前一行是“泵”的特殊情况。因此:

谢谢戈登:)!!。
select t.*,
       lag(pump_sg) over (order by id) as your_value
from (select t.*,
             sum(sg) over (partition by pump_grp) as pump_sg
      from (select t.*,
                 sum(case when type = 'Pump' then 1 else 0 end) over (order by id) as pump_grp
            from t
           ) t
     ) t;