如何在sql或pyspark中转换数据
输入数据集如何在sql或pyspark中转换数据,sql,pyspark,Sql,Pyspark,输入数据集 Act status from to 123 1 2011-03-29 00:00:00 2011-03-29 23:59:59 123 1 2011-03-30 00:00:00 2011-03-30 23:59:59 123 1 2011-03-31 00:00:00 2011-03-31 23:59:59 123 0 2011-04-01 00:00:00 2011-04-03 23
Act status from to
123 1 2011-03-29 00:00:00 2011-03-29 23:59:59
123 1 2011-03-30 00:00:00 2011-03-30 23:59:59
123 1 2011-03-31 00:00:00 2011-03-31 23:59:59
123 0 2011-04-01 00:00:00 2011-04-03 23:59:59
123 0 2011-04-04 00:00:00 2011-04-04 23:59:59
123 0 2011-04-05 00:00:00 2011-04-05 23:59:59
123 1 2011-04-06 00:00:00 2011-04-06 23:59:59
123 1 2011-04-07 00:00:00 2011-04-07 23:59:59
123 1 2011-04-08 00:00:00 2011-04-10 23:59:59
我希望输出是
act status from to
123 1 2011-03-29 00:00:00 2011-03-31 23:59:59
123 0 2011-04-01 00:00:00 2011-04-05 23:59:59
123 1 2011-04-06 00:00:00 2011-04-10 23:59:59
您可以使用滞后功能跟踪
状态
更改。应用滞后函数后,您将使用结果建立排名,并将排名用作groupBy
参数。例如:
地位
滞后
改变
排名
1.
无效的
1.
1.
1.
1.
0
1.
0
1.
1.
2.
1.
0
1.
3.
0
1.
1.
4.
1.
0
1.
5.
1.
1.
0
5.
如果日期没有差异,我建议使用行号的差异:
select act, status, min(from), max(to)
from (select t.*,
row_number() over (partition by act order by from) as seqnum,
row_number() over (partition by act, status order by from) as seqnum_2
from t
) t
group by act, status, (seqnum - seqnum_2);
为什么这样做有点难解释。但是,如果查看子查询的结果,您将看到在具有相同状态的相邻行上,seqnum
和seqnum_2
之间的差异是恒定的
注意:我建议您修复数据模型,这样您就不会错过每天的最后一秒。一行的
to
datetime应与下一行的from
datetime相同。当查询时,您可以使用=
和您应该简单地解释如何从输入计算输出,这不是显而易见的。。工作得很有魅力