Sql server 减少由于错误而创建的SQL表中的数据
不幸的是,由于软件缺陷在开发环境中不够明显,无法识别,我们创建了大量实际上并不需要的SQL记录。这些记录不会损害数据完整性或其他任何东西,但它们只是不必要的 我们正在研究如下所示的数据库模式:Sql server 减少由于错误而创建的SQL表中的数据,sql-server,tsql,Sql Server,Tsql,不幸的是,由于软件缺陷在开发环境中不够明显,无法识别,我们创建了大量实际上并不需要的SQL记录。这些记录不会损害数据完整性或其他任何东西,但它们只是不必要的 我们正在研究如下所示的数据库模式: entity_static (just some static data that won't change): id | val1 | val2 | val3 ----------------------- 1 | 50 | 183 | 93 2 | 60 | 823 | 123
entity_static (just some static data that won't change):
id | val1 | val2 | val3
-----------------------
1 | 50 | 183 | 93
2 | 60 | 823 | 123
entity_dynamic (some dynamic data we need a historical record of):
id | entity_static_id | val1 | val2 | valid_from | valid_to
-------------------------------------------------------------------------------
1 | 1 | 50 | 75 | 2018-01-01 00:00:00 | 2018-01-01 00:59:59
2 | 1 | 50 | 75 | 2018-01-01 01:00:00 | 2018-01-01 01:59:59
3 | 1 | 50 | 75 | 2018-01-01 02:00:00 | 2018-01-01 02:59:59
4 | 1 | 50 | 75 | 2018-01-01 03:00:00 | 2018-01-01 03:59:59
5 | 2 | 60 | 75 | 2018-01-01 00:00:00 | 2018-01-01 00:59:59
6 | 2 | 60 | 75 | 2018-01-01 01:00:00 | 2018-01-01 01:59:59
7 | 2 | 60 | 75 | 2018-01-01 02:00:00 | 2018-01-01 02:59:59
8 | 2 | 60 | 75 | 2018-01-01 03:00:00 | 2018-01-01 03:59:59
id | entity_static_id | val1 | val2 | valid_from | valid_to
-------------------------------------------------------------------------------
1 | 1 | 50 | 75 | 2018-01-01 00:00:00 | 2018-01-01 03:59:59
5 | 2 | 60 | 75 | 2018-01-01 00:00:00 | 2018-01-01 03:59:59
除了val1
和val2
之外,还有更多的列,这只是一个示例
entity\u dynamic
表描述了在给定时间段内哪些参数是有效的。它不是某个时间点的记录(如传感器数据)
因此,所有相等的记录可以很容易地聚合为一个记录,如下所示:
entity_static (just some static data that won't change):
id | val1 | val2 | val3
-----------------------
1 | 50 | 183 | 93
2 | 60 | 823 | 123
entity_dynamic (some dynamic data we need a historical record of):
id | entity_static_id | val1 | val2 | valid_from | valid_to
-------------------------------------------------------------------------------
1 | 1 | 50 | 75 | 2018-01-01 00:00:00 | 2018-01-01 00:59:59
2 | 1 | 50 | 75 | 2018-01-01 01:00:00 | 2018-01-01 01:59:59
3 | 1 | 50 | 75 | 2018-01-01 02:00:00 | 2018-01-01 02:59:59
4 | 1 | 50 | 75 | 2018-01-01 03:00:00 | 2018-01-01 03:59:59
5 | 2 | 60 | 75 | 2018-01-01 00:00:00 | 2018-01-01 00:59:59
6 | 2 | 60 | 75 | 2018-01-01 01:00:00 | 2018-01-01 01:59:59
7 | 2 | 60 | 75 | 2018-01-01 02:00:00 | 2018-01-01 02:59:59
8 | 2 | 60 | 75 | 2018-01-01 03:00:00 | 2018-01-01 03:59:59
id | entity_static_id | val1 | val2 | valid_from | valid_to
-------------------------------------------------------------------------------
1 | 1 | 50 | 75 | 2018-01-01 00:00:00 | 2018-01-01 03:59:59
5 | 2 | 60 | 75 | 2018-01-01 00:00:00 | 2018-01-01 03:59:59
valid\u to
列中的数据可能是NULL
我现在的问题是,通过什么查询,我能够将具有连续有效性范围的类似记录聚合到一个记录中。分组应通过实体\u静态\u id
上的外键完成
with entity_dynamic as
(
select
*
from
(values
('1','1','50','75',' 2018-01-01 00:00:00 ',' 2018-01-01 00:59:59')
,('2','1','50','75',' 2018-01-01 01:00:00 ',' 2018-01-01 01:59:59')
,('3','1','50','75',' 2018-01-01 02:00:00 ',' 2018-01-01 02:59:59')
,('4','1','50','75',' 2018-01-01 03:00:00 ',' 2018-01-01 03:59:59')
,('5','2','60','75',' 2018-01-01 00:00:00 ',' 2018-01-01 00:59:59')
,('6','2','60','75',' 2018-01-01 01:00:00 ',' 2018-01-01 01:59:59')
,('7','2','60','75',' 2018-01-01 02:00:00 ',' 2018-01-01 02:59:59')
,('8','2','60','75',' 2018-01-01 03:00:00 ',' 2018-01-01 03:59:59')
,('9','1','60','75',' 2018-01-01 04:00:00 ',' 2018-01-01 04:59:59')
,('10','1','60','75',' 2018-01-01 05:00:00 ',' 2018-01-01 05:59:59')
,('11','2','70','75',' 2018-01-01 04:00:00 ',' 2018-01-01 04:59:59')
,('12','2','70','75',' 2018-01-01 05:00:00 ',' 2018-01-01 05:59:59')
,('13','2','60','75',' 2018-01-01 06:00:00 ',' 2018-01-01 06:59:59')
)
a(id , entity_static_id , val1 , val2 , valid_from , valid_to)
)
,
首先为每个实体_static_id(唯一组)的val1和val2的唯一组合添加行号,为实体_static_id添加行号
step1 as
(
select
id , entity_static_id , val1 , val2 , valid_from , valid_to
,row_number() over (partition by entity_static_id,val1,val2 order by valid_from) valrn
,ROW_NUMBER() over (partition by entity_static_id order by valid_from desc) rn
from entity_dynamic
)
这使得:
+----------------------------------------------------------------------------------------+
|id|entity_static_id|val1|val2|valid_from |valid_to |unique_group|rn|
+----------------------------------------------------------------------------------------+
|10|1 |60 |75 | 2018-01-01 05:00:00 | 2018-01-01 05:59:59|2 |1 |
|9 |1 |60 |75 | 2018-01-01 04:00:00 | 2018-01-01 04:59:59|1 |2 |
|4 |1 |50 |75 | 2018-01-01 03:00:00 | 2018-01-01 03:59:59|4 |3 |
|3 |1 |50 |75 | 2018-01-01 02:00:00 | 2018-01-01 02:59:59|3 |4 |
|2 |1 |50 |75 | 2018-01-01 01:00:00 | 2018-01-01 01:59:59|2 |5 |
|1 |1 |50 |75 | 2018-01-01 00:00:00 | 2018-01-01 00:59:59|1 |6 |
|13|2 |60 |75 | 2018-01-01 06:00:00 | 2018-01-01 06:59:59|5 |1 |
|12|2 |70 |75 | 2018-01-01 05:00:00 | 2018-01-01 05:59:59|2 |2 |
|11|2 |70 |75 | 2018-01-01 04:00:00 | 2018-01-01 04:59:59|1 |3 |
|8 |2 |60 |75 | 2018-01-01 03:00:00 | 2018-01-01 03:59:59|4 |4 |
|7 |2 |60 |75 | 2018-01-01 02:00:00 | 2018-01-01 02:59:59|3 |5 |
|6 |2 |60 |75 | 2018-01-01 01:00:00 | 2018-01-01 01:59:59|2 |6 |
|5 |2 |60 |75 | 2018-01-01 00:00:00 | 2018-01-01 00:59:59|1 |7 |
+----------------------------------------------------------------------------------------+
第二步是为每个唯一组添加行数和总行数,因为最后一个是降序的,每个vil后面值相等的行具有相同的和,在本例中称为tar
,step2 as
(
select
*
,unique_group+rn tar
from step1
)
第2步给出:
+--------------------------------------------------------------------------------------------+
|id|entity_static_id|val1|val2|valid_from |valid_to |unique_group|rn|tar|
+--------------------------------------------------------------------------------------------+
|10|1 |60 |75 | 2018-01-01 05:00:00 | 2018-01-01 05:59:59|2 |1 |3 |
|9 |1 |60 |75 | 2018-01-01 04:00:00 | 2018-01-01 04:59:59|1 |2 |3 |
|4 |1 |50 |75 | 2018-01-01 03:00:00 | 2018-01-01 03:59:59|4 |3 |7 |
|3 |1 |50 |75 | 2018-01-01 02:00:00 | 2018-01-01 02:59:59|3 |4 |7 |
|2 |1 |50 |75 | 2018-01-01 01:00:00 | 2018-01-01 01:59:59|2 |5 |7 |
|1 |1 |50 |75 | 2018-01-01 00:00:00 | 2018-01-01 00:59:59|1 |6 |7 |
|13|2 |60 |75 | 2018-01-01 06:00:00 | 2018-01-01 06:59:59|5 |1 |6 |
|12|2 |70 |75 | 2018-01-01 05:00:00 | 2018-01-01 05:59:59|2 |2 |4 |
|11|2 |70 |75 | 2018-01-01 04:00:00 | 2018-01-01 04:59:59|1 |3 |4 |
|8 |2 |60 |75 | 2018-01-01 03:00:00 | 2018-01-01 03:59:59|4 |4 |8 |
|7 |2 |60 |75 | 2018-01-01 02:00:00 | 2018-01-01 02:59:59|3 |5 |8 |
|6 |2 |60 |75 | 2018-01-01 01:00:00 | 2018-01-01 01:59:59|2 |6 |8 |
|5 |2 |60 |75 | 2018-01-01 00:00:00 | 2018-01-01 00:59:59|1 |7 |8 |
+--------------------------------------------------------------------------------------------+
最后,通过使用min和maxm以及group by正确值,您可以找到有效的起始日期和终止日期:
select
min(id) id
,entity_static_id
,val1
,val2
,min(valid_from) valid_from
,max(valid_to) valid_to
from step2
group by entity_static_id,val1
,val2
,tar
order by entity_static_id,valid_from
总的来说,代码是:
with entity_dynamic as
(
select
*
from
(values
('1','1','50','75',' 2018-01-01 00:00:00 ',' 2018-01-01 00:59:59')
,('2','1','50','75',' 2018-01-01 01:00:00 ',' 2018-01-01 01:59:59')
,('3','1','50','75',' 2018-01-01 02:00:00 ',' 2018-01-01 02:59:59')
,('4','1','50','75',' 2018-01-01 03:00:00 ',' 2018-01-01 03:59:59')
,('5','2','60','75',' 2018-01-01 00:00:00 ',' 2018-01-01 00:59:59')
,('6','2','60','75',' 2018-01-01 01:00:00 ',' 2018-01-01 01:59:59')
,('7','2','60','75',' 2018-01-01 02:00:00 ',' 2018-01-01 02:59:59')
,('8','2','60','75',' 2018-01-01 03:00:00 ',' 2018-01-01 03:59:59')
,('9','1','60','75',' 2018-01-01 04:00:00 ',' 2018-01-01 04:59:59')
,('10','1','60','75',' 2018-01-01 05:00:00 ',' 2018-01-01 05:59:59')
,('11','2','70','75',' 2018-01-01 04:00:00 ',' 2018-01-01 04:59:59')
,('12','2','70','75',' 2018-01-01 05:00:00 ',' 2018-01-01 05:59:59')
,('13','2','60','75',' 2018-01-01 06:00:00 ',' 2018-01-01 06:59:59')
)
a(id , entity_static_id , val1 , val2 , valid_from , valid_to)
)
,step1 as
(
select
id , entity_static_id , val1 , val2 , valid_from , valid_to
,row_number() over (partition by entity_static_id,val1,val2 order by valid_from) unique_group
,ROW_NUMBER() over (partition by entity_static_id order by valid_from desc) rn
from entity_dynamic
)
,step2 as
(
select
*
,dense_rank() over (partition by entity_static_id order by unique_group) f
,unique_group+rn tar
from step1
)
select
min(id) id
,entity_static_id
,val1
,val2
,min(valid_from) valid_from
,max(valid_to) valid_to
from step2
group by entity_static_id,val1
,val2
,tar
order by entity_static_id,valid_from
结果是
+------------------------------------------------------------------------+
|id|entity_static_id|val1|val2|valid_from |valid_to |
+------------------------------------------------------------------------+
|1 |1 |50 |75 | 2018-01-01 00:00:00 | 2018-01-01 03:59:59|
|10|1 |60 |75 | 2018-01-01 04:00:00 | 2018-01-01 05:59:59|
|5 |2 |60 |75 | 2018-01-01 00:00:00 | 2018-01-01 03:59:59|
|11|2 |70 |75 | 2018-01-01 04:00:00 | 2018-01-01 05:59:59|
|13|2 |60 |75 | 2018-01-01 06:00:00 | 2018-01-01 06:59:59|
+------------------------------------------------------------------------+
如果组由实体_dynamic定义,那么这就是您所需要的全部
with entity_dynamic as
( select *
from (values ('1' ,'1','50','75',' 2018-01-01 00:00:00 ',' 2018-01-01 00:59:59')
,('2' ,'1','50','75',' 2018-01-01 01:00:00 ',' 2018-01-01 01:59:59')
,('3' ,'1','50','75',' 2018-01-01 02:00:00 ',' 2018-01-01 02:59:59')
,('4' ,'1','50','75',' 2018-01-01 03:00:00 ',' 2018-01-01 03:59:59')
,('5' ,'2','60','75',' 2018-01-01 00:00:00 ',' 2018-01-01 00:59:59')
,('6' ,'2','60','75',' 2018-01-01 01:00:00 ',' 2018-01-01 01:59:59')
,('7' ,'2','60','75',' 2018-01-01 02:00:00 ',' 2018-01-01 02:59:59')
,('8' ,'2','60','75',' 2018-01-01 03:00:00 ',' 2018-01-01 03:59:59')
) a(id , entity_static_id , val1 , val2 , valid_from , valid_to)
)
, entity_dynamicPlus as
( select *
, ROW_NUMBER() over (partition by entity_static_id order by valid_to asc ) as rnA
, ROW_NUMBER() over (partition by entity_static_id order by valid_to desc) as rnD
from entity_dynamic
)
select eStart.id, eStart.entity_static_id, eStart.val1, eStart.val2, eStart.valid_from, eEnd.valid_to
, eEnd.valid_to
from entity_dynamicPlus as eStart
join entity_dynamicPlus as eEnd
on eStart.entity_static_id = eEnd.entity_static_id
and eStart.rnA = 1
and eEnd.rnD = 1
order by eStart.entity_static_id
例如,查看ROW_NUMBER()以查找可以处理的重复值。查找重复记录不是问题,可以通过使用统计类似记录的子查询轻松实现。我在确定哪些记录可以聚合(由于它们的时间顺序)以及如何聚合这些记录方面存在问题。我将很快对此进行深入研究。现在看来这就是我要找的,谢谢!谢谢你,结果很好。由于我无法真正理解如何正确处理
valid\u to
列的NULL
值,因此我只聚合了没有NULL
值的记录,然后应用了一个UNION
。总的来说,它将表格缩小了99,73%(记录计数),考虑到每条记录的平均重复数接近1k,这是完全合理的