带条件的Spark SQL窗口函数范围边界
我的数据如下所示:带条件的Spark SQL窗口函数范围边界,sql,pyspark-sql,window-functions,Sql,Pyspark Sql,Window Functions,我的数据如下所示: Sequence| type | sg | +-----------------+----------------+----------+ | 1| Pump |3 | | 2| Pump |2 | | 3| Inject |4 |
Sequence| type | sg |
+-----------------+----------------+----------+
| 1| Pump |3 |
| 2| Pump |2 |
| 3| Inject |4 |
| 4| Pump |5 |
| 5| Pump |3 |
| 6| pump |6 |
| 7| Inject |7 |
| 8| Inject |8 |
| 9| Pump |9 |
+-----------------+----------------+----------+
我想添加一个新列并检查以前的类型
值
如果以前的类型
值为泵
,则将新列的值设置为相应的sg
的值
如果是inject
,则获取前面所有行的sg
值之和,直到找到带有泵的行(其sg
值包含在总和中)
例:
对于Sequence=2
,前一行的type
是Pump
,因此新列的值应该是对应的sg
列的值:3
对于Sequence=9
,前一行的type
是Inject
,因此新列的值将是前三行的sg
列的总和,因为Sequence=6
行是带有type=Pump
的前一行。新列的值将是8+7+6=21
最终输出应如下所示:
Sequence| type | sg | New sg |
+-----------------+----------------+----------+--------+
| 1| Pump |3 |-
| 2| Pump |2 |3
| 3| Inject |4 |2
| 4| Pump |5 |6
| 5| Pump |3 |5
| 6| pump |6 |3
| 7| Inject |7 |6
| 8| Inject |8 |7
| 9| Pump |9 |21
+-----------------+----------------+----------+
根据您的规则,这只是一组窗口函数。诀窍是用“注入”s按组聚合“泵”值。“泵”的累计总和将查找组
那么查询是:
select t.*,
(case when prev_type = 'Pump' then sg
else lag(pump_sg) over (order by id)
end) as your_value
from (select t.*,
sum(sg) over (partition by pump_grp) as pump_sg
from (select t.*,
lag(sg) over (order by id) as prev_sg,
lag(type) over (order by id) as prev_type,
sum(case when type = 'Pump' then 1 else 0 end) over (order by id) as pump_grp
from t
) t
) t;
我认为你的规则太复杂了,你不需要前一行是“泵”的特殊情况。因此:
谢谢戈登:)!!。
select t.*,
lag(pump_sg) over (order by id) as your_value
from (select t.*,
sum(sg) over (partition by pump_grp) as pump_sg
from (select t.*,
sum(case when type = 'Pump' then 1 else 0 end) over (order by id) as pump_grp
from t
) t
) t;