Sql server 减少由于错误而创建的SQL表中的数据

Sql server 减少由于错误而创建的SQL表中的数据,sql-server,tsql,Sql Server,Tsql,不幸的是,由于软件缺陷在开发环境中不够明显,无法识别,我们创建了大量实际上并不需要的SQL记录。这些记录不会损害数据完整性或其他任何东西,但它们只是不必要的 我们正在研究如下所示的数据库模式: entity_static (just some static data that won't change): id | val1 | val2 | val3 ----------------------- 1 | 50 | 183 | 93 2 | 60 | 823 | 123

不幸的是,由于软件缺陷在开发环境中不够明显,无法识别,我们创建了大量实际上并不需要的SQL记录。这些记录不会损害数据完整性或其他任何东西,但它们只是不必要的

我们正在研究如下所示的数据库模式:

entity_static (just some static data that won't change):

id | val1 | val2 | val3
-----------------------
1  | 50   | 183  | 93
2  | 60   | 823  | 123


entity_dynamic (some dynamic data we need a historical record of):

id | entity_static_id | val1 | val2 | valid_from          | valid_to
-------------------------------------------------------------------------------
1  | 1                | 50   | 75   | 2018-01-01 00:00:00 | 2018-01-01 00:59:59
2  | 1                | 50   | 75   | 2018-01-01 01:00:00 | 2018-01-01 01:59:59
3  | 1                | 50   | 75   | 2018-01-01 02:00:00 | 2018-01-01 02:59:59
4  | 1                | 50   | 75   | 2018-01-01 03:00:00 | 2018-01-01 03:59:59
5  | 2                | 60   | 75   | 2018-01-01 00:00:00 | 2018-01-01 00:59:59
6  | 2                | 60   | 75   | 2018-01-01 01:00:00 | 2018-01-01 01:59:59
7  | 2                | 60   | 75   | 2018-01-01 02:00:00 | 2018-01-01 02:59:59
8  | 2                | 60   | 75   | 2018-01-01 03:00:00 | 2018-01-01 03:59:59
id | entity_static_id | val1 | val2 | valid_from          | valid_to
-------------------------------------------------------------------------------
1  | 1                | 50   | 75   | 2018-01-01 00:00:00 | 2018-01-01 03:59:59
5  | 2                | 60   | 75   | 2018-01-01 00:00:00 | 2018-01-01 03:59:59
除了
val1
val2
之外,还有更多的列,这只是一个示例

entity\u dynamic
表描述了在给定时间段内哪些参数是有效的。它不是某个时间点的记录(如传感器数据)

因此,所有相等的记录可以很容易地聚合为一个记录,如下所示:

entity_static (just some static data that won't change):

id | val1 | val2 | val3
-----------------------
1  | 50   | 183  | 93
2  | 60   | 823  | 123


entity_dynamic (some dynamic data we need a historical record of):

id | entity_static_id | val1 | val2 | valid_from          | valid_to
-------------------------------------------------------------------------------
1  | 1                | 50   | 75   | 2018-01-01 00:00:00 | 2018-01-01 00:59:59
2  | 1                | 50   | 75   | 2018-01-01 01:00:00 | 2018-01-01 01:59:59
3  | 1                | 50   | 75   | 2018-01-01 02:00:00 | 2018-01-01 02:59:59
4  | 1                | 50   | 75   | 2018-01-01 03:00:00 | 2018-01-01 03:59:59
5  | 2                | 60   | 75   | 2018-01-01 00:00:00 | 2018-01-01 00:59:59
6  | 2                | 60   | 75   | 2018-01-01 01:00:00 | 2018-01-01 01:59:59
7  | 2                | 60   | 75   | 2018-01-01 02:00:00 | 2018-01-01 02:59:59
8  | 2                | 60   | 75   | 2018-01-01 03:00:00 | 2018-01-01 03:59:59
id | entity_static_id | val1 | val2 | valid_from          | valid_to
-------------------------------------------------------------------------------
1  | 1                | 50   | 75   | 2018-01-01 00:00:00 | 2018-01-01 03:59:59
5  | 2                | 60   | 75   | 2018-01-01 00:00:00 | 2018-01-01 03:59:59
valid\u to
列中的数据可能是
NULL

我现在的问题是,通过什么查询,我能够将具有连续有效性范围的类似记录聚合到一个记录中。分组应通过
实体\u静态\u id
上的外键完成

with entity_dynamic  as
(
select
*
from 
(values
('1','1','50','75',' 2018-01-01 00:00:00 ',' 2018-01-01 00:59:59')
,('2','1','50','75',' 2018-01-01 01:00:00 ',' 2018-01-01 01:59:59')
,('3','1','50','75',' 2018-01-01 02:00:00 ',' 2018-01-01 02:59:59')
,('4','1','50','75',' 2018-01-01 03:00:00 ',' 2018-01-01 03:59:59')
,('5','2','60','75',' 2018-01-01 00:00:00 ',' 2018-01-01 00:59:59')
,('6','2','60','75',' 2018-01-01 01:00:00 ',' 2018-01-01 01:59:59')
,('7','2','60','75',' 2018-01-01 02:00:00 ',' 2018-01-01 02:59:59')
,('8','2','60','75',' 2018-01-01 03:00:00 ',' 2018-01-01 03:59:59')
,('9','1','60','75',' 2018-01-01 04:00:00 ',' 2018-01-01 04:59:59')
,('10','1','60','75',' 2018-01-01 05:00:00 ',' 2018-01-01 05:59:59')
,('11','2','70','75',' 2018-01-01 04:00:00 ',' 2018-01-01 04:59:59')
,('12','2','70','75',' 2018-01-01 05:00:00 ',' 2018-01-01 05:59:59')
,('13','2','60','75',' 2018-01-01 06:00:00 ',' 2018-01-01 06:59:59')
)
 a(id , entity_static_id , val1 , val2 , valid_from , valid_to)
 )
 ,
首先为每个实体_static_id(唯一组)的val1和val2的唯一组合添加行号,为实体_static_id添加行号

 step1 as
 (
 select 
 id , entity_static_id , val1 , val2 , valid_from , valid_to
 ,row_number() over (partition by entity_static_id,val1,val2 order by valid_from) valrn
 ,ROW_NUMBER() over (partition by entity_static_id order by valid_from desc) rn

 from entity_dynamic 
)
这使得:

+----------------------------------------------------------------------------------------+
|id|entity_static_id|val1|val2|valid_from           |valid_to            |unique_group|rn|
+----------------------------------------------------------------------------------------+
|10|1               |60  |75  | 2018-01-01 05:00:00 | 2018-01-01 05:59:59|2           |1 |
|9 |1               |60  |75  | 2018-01-01 04:00:00 | 2018-01-01 04:59:59|1           |2 |
|4 |1               |50  |75  | 2018-01-01 03:00:00 | 2018-01-01 03:59:59|4           |3 |
|3 |1               |50  |75  | 2018-01-01 02:00:00 | 2018-01-01 02:59:59|3           |4 |
|2 |1               |50  |75  | 2018-01-01 01:00:00 | 2018-01-01 01:59:59|2           |5 |
|1 |1               |50  |75  | 2018-01-01 00:00:00 | 2018-01-01 00:59:59|1           |6 |
|13|2               |60  |75  | 2018-01-01 06:00:00 | 2018-01-01 06:59:59|5           |1 |
|12|2               |70  |75  | 2018-01-01 05:00:00 | 2018-01-01 05:59:59|2           |2 |
|11|2               |70  |75  | 2018-01-01 04:00:00 | 2018-01-01 04:59:59|1           |3 |
|8 |2               |60  |75  | 2018-01-01 03:00:00 | 2018-01-01 03:59:59|4           |4 |
|7 |2               |60  |75  | 2018-01-01 02:00:00 | 2018-01-01 02:59:59|3           |5 |
|6 |2               |60  |75  | 2018-01-01 01:00:00 | 2018-01-01 01:59:59|2           |6 |
|5 |2               |60  |75  | 2018-01-01 00:00:00 | 2018-01-01 00:59:59|1           |7 |
+----------------------------------------------------------------------------------------+
第二步是为每个唯一组添加行数和总行数,因为最后一个是降序的,每个vil后面值相等的行具有相同的和,在本例中称为tar

,step2 as
(
select
*
,unique_group+rn tar
from step1
)
第2步给出:

+--------------------------------------------------------------------------------------------+
|id|entity_static_id|val1|val2|valid_from           |valid_to            |unique_group|rn|tar|
+--------------------------------------------------------------------------------------------+
|10|1               |60  |75  | 2018-01-01 05:00:00 | 2018-01-01 05:59:59|2           |1 |3  |
|9 |1               |60  |75  | 2018-01-01 04:00:00 | 2018-01-01 04:59:59|1           |2 |3  |
|4 |1               |50  |75  | 2018-01-01 03:00:00 | 2018-01-01 03:59:59|4           |3 |7  |
|3 |1               |50  |75  | 2018-01-01 02:00:00 | 2018-01-01 02:59:59|3           |4 |7  |
|2 |1               |50  |75  | 2018-01-01 01:00:00 | 2018-01-01 01:59:59|2           |5 |7  |
|1 |1               |50  |75  | 2018-01-01 00:00:00 | 2018-01-01 00:59:59|1           |6 |7  |
|13|2               |60  |75  | 2018-01-01 06:00:00 | 2018-01-01 06:59:59|5           |1 |6  |
|12|2               |70  |75  | 2018-01-01 05:00:00 | 2018-01-01 05:59:59|2           |2 |4  |
|11|2               |70  |75  | 2018-01-01 04:00:00 | 2018-01-01 04:59:59|1           |3 |4  |
|8 |2               |60  |75  | 2018-01-01 03:00:00 | 2018-01-01 03:59:59|4           |4 |8  |
|7 |2               |60  |75  | 2018-01-01 02:00:00 | 2018-01-01 02:59:59|3           |5 |8  |
|6 |2               |60  |75  | 2018-01-01 01:00:00 | 2018-01-01 01:59:59|2           |6 |8  |
|5 |2               |60  |75  | 2018-01-01 00:00:00 | 2018-01-01 00:59:59|1           |7 |8  |
+--------------------------------------------------------------------------------------------+
最后,通过使用min和maxm以及group by正确值,您可以找到有效的起始日期和终止日期:

select
min(id) id
,entity_static_id
,val1
,val2
,min(valid_from) valid_from
,max(valid_to) valid_to
from step2
group by entity_static_id,val1
    ,val2   
    ,tar
order by entity_static_id,valid_from
总的来说,代码是:

with entity_dynamic  as
(
select
*
from 
(values
('1','1','50','75',' 2018-01-01 00:00:00 ',' 2018-01-01 00:59:59')
,('2','1','50','75',' 2018-01-01 01:00:00 ',' 2018-01-01 01:59:59')
,('3','1','50','75',' 2018-01-01 02:00:00 ',' 2018-01-01 02:59:59')
,('4','1','50','75',' 2018-01-01 03:00:00 ',' 2018-01-01 03:59:59')
,('5','2','60','75',' 2018-01-01 00:00:00 ',' 2018-01-01 00:59:59')
,('6','2','60','75',' 2018-01-01 01:00:00 ',' 2018-01-01 01:59:59')
,('7','2','60','75',' 2018-01-01 02:00:00 ',' 2018-01-01 02:59:59')
,('8','2','60','75',' 2018-01-01 03:00:00 ',' 2018-01-01 03:59:59')
,('9','1','60','75',' 2018-01-01 04:00:00 ',' 2018-01-01 04:59:59')
,('10','1','60','75',' 2018-01-01 05:00:00 ',' 2018-01-01 05:59:59')
,('11','2','70','75',' 2018-01-01 04:00:00 ',' 2018-01-01 04:59:59')
,('12','2','70','75',' 2018-01-01 05:00:00 ',' 2018-01-01 05:59:59')
,('13','2','60','75',' 2018-01-01 06:00:00 ',' 2018-01-01 06:59:59')
)
 a(id , entity_static_id , val1 , val2 , valid_from , valid_to)
 )
 ,step1 as
 (
 select 
 id , entity_static_id , val1 , val2 , valid_from , valid_to
 ,row_number() over (partition by entity_static_id,val1,val2 order by valid_from) unique_group
 ,ROW_NUMBER() over (partition by entity_static_id order by valid_from desc) rn

 from entity_dynamic 
)
,step2 as
(
select
*
,dense_rank() over (partition by entity_static_id order by unique_group) f
,unique_group+rn tar
from step1
)
select
min(id) id
,entity_static_id
,val1
,val2
,min(valid_from) valid_from
,max(valid_to) valid_to
from step2
group by entity_static_id,val1
    ,val2   
    ,tar
order by entity_static_id,valid_from
结果是

 +------------------------------------------------------------------------+
|id|entity_static_id|val1|val2|valid_from           |valid_to            |
+------------------------------------------------------------------------+
|1 |1               |50  |75  | 2018-01-01 00:00:00 | 2018-01-01 03:59:59|
|10|1               |60  |75  | 2018-01-01 04:00:00 | 2018-01-01 05:59:59|
|5 |2               |60  |75  | 2018-01-01 00:00:00 | 2018-01-01 03:59:59|
|11|2               |70  |75  | 2018-01-01 04:00:00 | 2018-01-01 05:59:59|
|13|2               |60  |75  | 2018-01-01 06:00:00 | 2018-01-01 06:59:59|
+------------------------------------------------------------------------+

如果组由实体_dynamic定义,那么这就是您所需要的全部

with entity_dynamic  as
( select *
  from (values ('1' ,'1','50','75',' 2018-01-01 00:00:00 ',' 2018-01-01 00:59:59')
              ,('2' ,'1','50','75',' 2018-01-01 01:00:00 ',' 2018-01-01 01:59:59')
              ,('3' ,'1','50','75',' 2018-01-01 02:00:00 ',' 2018-01-01 02:59:59')
              ,('4' ,'1','50','75',' 2018-01-01 03:00:00 ',' 2018-01-01 03:59:59')
              ,('5' ,'2','60','75',' 2018-01-01 00:00:00 ',' 2018-01-01 00:59:59')
              ,('6' ,'2','60','75',' 2018-01-01 01:00:00 ',' 2018-01-01 01:59:59')
              ,('7' ,'2','60','75',' 2018-01-01 02:00:00 ',' 2018-01-01 02:59:59')
              ,('8' ,'2','60','75',' 2018-01-01 03:00:00 ',' 2018-01-01 03:59:59')
       ) a(id , entity_static_id , val1 , val2 , valid_from , valid_to) 
) 
, entity_dynamicPlus as 
( select * 
       , ROW_NUMBER() over (partition by entity_static_id order by valid_to asc ) as rnA 
       , ROW_NUMBER() over (partition by entity_static_id order by valid_to desc) as rnD 
   from entity_dynamic 
) 
select eStart.id, eStart.entity_static_id, eStart.val1, eStart.val2, eStart.valid_from, eEnd.valid_to 
     , eEnd.valid_to 
from entity_dynamicPlus as eStart 
join entity_dynamicPlus as eEnd 
  on eStart.entity_static_id = eEnd.entity_static_id 
 and eStart.rnA = 1 
 and eEnd.rnD   = 1
order by eStart.entity_static_id

例如,查看ROW_NUMBER()以查找可以处理的重复值。查找重复记录不是问题,可以通过使用统计类似记录的子查询轻松实现。我在确定哪些记录可以聚合(由于它们的时间顺序)以及如何聚合这些记录方面存在问题。我将很快对此进行深入研究。现在看来这就是我要找的,谢谢!谢谢你,结果很好。由于我无法真正理解如何正确处理
valid\u to
列的
NULL
值,因此我只聚合了没有
NULL
值的记录,然后应用了一个
UNION
。总的来说,它将表格缩小了99,73%(记录计数),考虑到每条记录的平均重复数接近1k,这是完全合理的