带条件的Spark SQL窗口函数范围边界_Sql_Pyspark Sql_Window Functions

带条件的Spark SQL窗口函数范围边界

sql

带条件的Spark SQL窗口函数范围边界,sql,pyspark-sql,window-functions,Sql,Pyspark Sql,Window Functions,我的数据如下所示： Sequence| type | sg | +-----------------+----------------+----------+ | 1| Pump |3 | | 2| Pump |2 | | 3| Inject |4 |

我的数据如下所示：

         Sequence|       type      | sg       |
+-----------------+----------------+----------+
|              1| Pump             |3         |
|              2| Pump             |2         |
|              3| Inject           |4         |
|              4| Pump             |5         |
|              5| Pump             |3         | 
|              6| pump             |6         |
|              7| Inject           |7         |
|              8| Inject           |8         |
|              9| Pump             |9         |
+-----------------+----------------+----------+

我想添加一个新列并检查以前的

类型

值

如果以前的

类型

值为

泵

，则将新列的值设置为相应的

sg

的值

如果是

inject

，则获取前面所有行的

sg

值之和，直到找到带有

泵的行（其sg
值包含在总和中）
例：
对于Sequence=2
，前一行的type
是Pump
，因此新列的值应该是对应的sg
列的值：3
对于Sequence=9
，前一行的type
是Inject
，因此新列的值将是前三行的sg
列的总和，因为Sequence=6
行是带有type=Pump
的前一行。新列的值将是8+7+6=21

最终输出应如下所示：
       Sequence|       type      | sg       |  New sg |
+-----------------+----------------+----------+--------+
|              1| Pump             |3         |-
|              2| Pump             |2         |3
|              3| Inject           |4         |2
|              4| Pump             |5         |6
|              5| Pump             |3         |5
|              6| pump             |6         |3
|              7| Inject           |7         |6
|              8| Inject           |8         |7
|              9| Pump             |9         |21
+-----------------+----------------+----------+

根据您的规则，这只是一组窗口函数。诀窍是用“注入”s按组聚合“泵”值。“泵”的累计总和将查找组
那么查询是：
select t.*,
        (case when prev_type = 'Pump' then sg
              else lag(pump_sg) over (order by id)
         end) as your_value
from (select t.*,
             sum(sg) over (partition by pump_grp) as pump_sg
      from (select t.*,
                   lag(sg) over (order by id) as prev_sg,
                   lag(type) over (order by id) as prev_type,
                 sum(case when type = 'Pump' then 1 else 0 end) over (order by id) as pump_grp
            from t
           ) t
     ) t;

我认为你的规则太复杂了，你不需要前一行是“泵”的特殊情况。因此：
谢谢戈登：）！！。
select t.*,
       lag(pump_sg) over (order by id) as your_value
from (select t.*,
             sum(sg) over (partition by pump_grp) as pump_sg
      from (select t.*,
                 sum(case when type = 'Pump' then 1 else 0 end) over (order by id) as pump_grp
            from t
           ) t
     ) t;