如何在sql或pyspark中转换数据

如何在sql或pyspark中转换数据,sql,pyspark,Sql,Pyspark,输入数据集 Act status from to 123 1 2011-03-29 00:00:00 2011-03-29 23:59:59 123 1 2011-03-30 00:00:00 2011-03-30 23:59:59 123 1 2011-03-31 00:00:00 2011-03-31 23:59:59 123 0 2011-04-01 00:00:00 2011-04-03 23

输入数据集

Act status  from                        to
123 1     2011-03-29 00:00:00   2011-03-29 23:59:59
123 1    2011-03-30 00:00:00    2011-03-30 23:59:59
123 1    2011-03-31 00:00:00    2011-03-31 23:59:59
123 0    2011-04-01 00:00:00    2011-04-03 23:59:59
123 0    2011-04-04 00:00:00    2011-04-04 23:59:59
123 0    2011-04-05 00:00:00    2011-04-05 23:59:59
123 1    2011-04-06 00:00:00    2011-04-06 23:59:59
123 1    2011-04-07 00:00:00    2011-04-07 23:59:59
123 1    2011-04-08 00:00:00    2011-04-10 23:59:59
我希望输出是

act status  from                        to 
123 1     2011-03-29 00:00:00   2011-03-31 23:59:59
123 0     2011-04-01 00:00:00   2011-04-05 23:59:59
123 1    2011-04-06 00:00:00    2011-04-10 23:59:59

您可以使用滞后功能跟踪
状态
更改。应用滞后函数后,您将使用结果建立排名,并将排名用作
groupBy
参数。例如:

地位 滞后 改变 排名 1. 无效的 1. 1. 1. 1. 0 1. 0 1. 1. 2. 1. 0 1. 3. 0 1. 1. 4. 1. 0 1. 5. 1. 1. 0 5.
如果日期没有差异,我建议使用行号的差异:

select act, status, min(from), max(to)
from (select t.*,
             row_number() over (partition by act order by from) as seqnum,
             row_number() over (partition by act, status order by from) as seqnum_2
      from t
     ) t
group by act, status, (seqnum - seqnum_2);
为什么这样做有点难解释。但是,如果查看子查询的结果,您将看到在具有相同状态的相邻行上,
seqnum
seqnum_2
之间的差异是恒定的


注意:我建议您修复数据模型,这样您就不会错过每天的最后一秒。一行的
to
datetime应与下一行的
from
datetime相同。当查询时,您可以使用
=
您应该简单地解释如何从输入计算输出,这不是显而易见的。。工作得很有魅力