Sql server 将原始数据转换为关系数据简介_Sql Server_Stored Procedures_Etl

Sql server 将原始数据转换为关系数据简介

sql-server stored-procedures

Sql server 将原始数据转换为关系数据简介,sql-server,stored-procedures,etl,Sql Server,Stored Procedures,Etl,我被直接丢进了一张桌子。现在我需要把这些乱七八糟的东西变成有用的东西。转储有重复项和不一致项。。。好时光到目前为止，我一直在尝试各种方法：（-希望你能帮助我给定此示例数据集： ExcelDump +----+------+------+------+ | ID | Col1 | Col2 | Col3 | +----+------+------+------+ | 1 | | | C | | 1 | | B | C | | 1 | A

我被直接丢进了一张桌子。现在我需要把这些乱七八糟的东西变成有用的东西。转储有重复项和不一致项。。。好时光

到目前为止，我一直在尝试各种方法：（-希望你能帮助我

给定此示例数据集：

ExcelDump
+----+------+------+------+
| ID | Col1 | Col2 | Col3 |
+----+------+------+------+
|  1 |      |      | C    |
|  1 |      | B    | C    |
|  1 | A    | B    | D    |
|  1 | E    | B    | C    |
|  2 | A    | B    | C    |
|  2 | A    | B    | C    |
|  3 | A    | B    | C    |
|  3 | A    | B    | F    |
|  4 | A    | B    | C    |
|  4 | G    | B    | C    |
+----+------+------+------+

一个可能的结果可能是：

OutputTable
+----+------+------+------+
| ID | Col1 | Col2 | Col3 |
+----+------+------+------+
|  1 | A    | B    | C    |
|  2 | A    | B    | C    |
|  3 | A    | B    | C    |
|  4 | A    | B    | C    |
+----+------+------+------+

+----+------+------+------+
| ID | Col1 | Col2 | Col3 |
+----+------+------+------+
|  1 | E    | B    | C    |
|  2 | A    | B    | C    |
|  3 | A    | B    | F    |
|  4 | G    | B    | C    |
+----+------+------+------+

漂亮整洁。唯一ID密钥和数据以合理的方式合并在一起

如何选择正确的数据？您可能已经注意到另一个可能的结果可能是：

OutputTable
+----+------+------+------+
| ID | Col1 | Col2 | Col3 |
+----+------+------+------+
|  1 | A    | B    | C    |
|  2 | A    | B    | C    |
|  3 | A    | B    | C    |
|  4 | A    | B    | C    |
+----+------+------+------+

+----+------+------+------+
| ID | Col1 | Col2 | Col3 |
+----+------+------+------+
|  1 | E    | B    | C    |
|  2 | A    | B    | C    |
|  3 | A    | B    | F    |
|  4 | G    | B    | C    |
+----+------+------+------+

这就是它变得复杂的地方。我希望能够根据我可以操作的一些条件来选择最有意义的集合

例如，我想设置一个条件，上面写着：“选择最常用（非null）的值，如果没有找到最常用的值，则取第一个找到的非null值。” 此条件应应用于按ID分组的选择。这种情况的结果是：

+----+------+------+------+
| ID | Col1 | Col2 | Col3 |
+----+------+------+------+
|  1 | A    | B    | C    |
|  2 | A    | B    | C    |
|  3 | A    | B    | C    |
|  4 | A    | B    | C    |
+----+------+------+------+

如果我后来发现该假设是错误的，应该是：“选择最常见（非空）的值，如果没有找到最常见的值，则选择最后一个发现的非空值。”

因此，基本上我想根据每组ID上的一组条件来选择值。

如前所述，您可以通过一个简单的

group BY

来实现这一点：

SELECT 
    id, 
    Col1 = MAX(Col1),
    Col2 = MAX(Col2),
    Col3 = MAX(Col3)
FROM
   ExcelDump
GROUP BY
   id

此模式将为您提供每列每个id值的最高非空值。

如前所述，您可以通过一个简单的

分组执行此操作：
SELECT 
    id, 
    Col1 = MAX(Col1),
    Col2 = MAX(Col2),
    Col3 = MAX(Col3)
FROM
   ExcelDump
GROUP BY
   id

此模式将为您提供每列每个id值的最高非空值。
我修改了我的解决方案，以考虑问题中添加的额外信息。下面的查询将获得您指定的第二个排序优先级。为了获得第一个排序优先级，您需要将外部应用中的“max”更改为“min”，并将“sortOrder desc”到“sortOrder asc”。请记住，如果您有多个最频繁的领带，例如A、A、B、B、C和A排在第一位，那么在下面的代码中，它将与B一起出现，因为这是最高计数，并且在2 A之后
-- setup test table
create table ExcelDump(
    id int
,   Col1 char(1)
,   Col2 char(1)
,   Col3 char(1)
)

insert into ExcelDump values(1,null,null,'C')
insert into ExcelDump values(1,null,'B','C')
insert into ExcelDump values(1,'A','B','D')
insert into ExcelDump values(1,'E','B','C')
insert into ExcelDump values(2,'A','B','C')
insert into ExcelDump values(2,'A','B','C')
insert into ExcelDump values(3,'A','B','C')
insert into ExcelDump values(3,'A','B','F')
insert into ExcelDump values(4,'A','B','C')
insert into ExcelDump values(4,'G','B','C')

-- create temp tables to make it easier to debug
select distinct
    id
into #distinct
from ExcelDump

-- number order isn't guaranteed but should be sorting them as first come first serve from the original table if no indexes exist
select
    row_number() over(order by (select 1)) as numberOrder
,   ID
,   Col1
,   Col2
,   Col3
into #sorted
from ExcelDump

-- actual query
select
    ui.Id
,   col1.Col1
,   col2.Col2
,   col3.Col3
from #distinct ui
  outer apply (
        select top 1
            ed.Col1
        ,   count(*) as cnt
        ,   max(ed.numberOrder) as sortOrder
        from #sorted ed
        where ed.id = ui.id
        and ed.Col1 is not null -- ignore nulls
        group by ed.Col1
        order by cnt desc, sortOrder desc -- get most common value, then get last one found if there are multiple
    ) col1
  outer apply (
        select top 1
            ed.Col2
        ,   count(*) as cnt
        ,   max(ed.numberOrder) as sortOrder
        from #sorted ed
        where ed.id = ui.id
        and ed.Col2 is not null -- ignore nulls
        group by ed.Col2
        order by cnt desc, sortOrder desc -- get most common value, then get last one found if there are multiple
    ) col2
  outer apply (
        select top 1
            ed.Col3
        ,   count(*) as cnt
        ,   max(ed.numberOrder) as sortOrder
        from #sorted ed
        where ed.id = ui.id
        and ed.Col3 is not null -- ignore nulls
        group by ed.Col3
        order by cnt desc, sortOrder desc -- get most common value, then get last one found if there are multiple
    ) col3

我修改了我的解决方案，以考虑问题中添加的额外信息。下面的查询将获得指定的第二个排序优先级。为了获得第一个排序优先级，您需要将外部应用中的“max”更改为“min”，并将“sortOrder desc”更改为“sortOrder asc”“。请记住，如果您有多条领带最常出现，例如A、A、B、B、C和A排在第一位，则在下面的代码中，它将与B一起出现，因为这是最高计数，并且在2个A之后
-- setup test table
create table ExcelDump(
    id int
,   Col1 char(1)
,   Col2 char(1)
,   Col3 char(1)
)

insert into ExcelDump values(1,null,null,'C')
insert into ExcelDump values(1,null,'B','C')
insert into ExcelDump values(1,'A','B','D')
insert into ExcelDump values(1,'E','B','C')
insert into ExcelDump values(2,'A','B','C')
insert into ExcelDump values(2,'A','B','C')
insert into ExcelDump values(3,'A','B','C')
insert into ExcelDump values(3,'A','B','F')
insert into ExcelDump values(4,'A','B','C')
insert into ExcelDump values(4,'G','B','C')

-- create temp tables to make it easier to debug
select distinct
    id
into #distinct
from ExcelDump

-- number order isn't guaranteed but should be sorting them as first come first serve from the original table if no indexes exist
select
    row_number() over(order by (select 1)) as numberOrder
,   ID
,   Col1
,   Col2
,   Col3
into #sorted
from ExcelDump

-- actual query
select
    ui.Id
,   col1.Col1
,   col2.Col2
,   col3.Col3
from #distinct ui
  outer apply (
        select top 1
            ed.Col1
        ,   count(*) as cnt
        ,   max(ed.numberOrder) as sortOrder
        from #sorted ed
        where ed.id = ui.id
        and ed.Col1 is not null -- ignore nulls
        group by ed.Col1
        order by cnt desc, sortOrder desc -- get most common value, then get last one found if there are multiple
    ) col1
  outer apply (
        select top 1
            ed.Col2
        ,   count(*) as cnt
        ,   max(ed.numberOrder) as sortOrder
        from #sorted ed
        where ed.id = ui.id
        and ed.Col2 is not null -- ignore nulls
        group by ed.Col2
        order by cnt desc, sortOrder desc -- get most common value, then get last one found if there are multiple
    ) col2
  outer apply (
        select top 1
            ed.Col3
        ,   count(*) as cnt
        ,   max(ed.numberOrder) as sortOrder
        from #sorted ed
        where ed.id = ui.id
        and ed.Col3 is not null -- ignore nulls
        group by ed.Col3
        order by cnt desc, sortOrder desc -- get most common value, then get last one found if there are multiple
    ) col3

您还可以使用游标在临时ExcelDump表中迭代以筛选每一行。您可以将筛选的结果存储到另一个临时表中，该临时表可以有自己的约束，如“唯一”或“不为空”（如有必要），并且，通过使用游标，您可以编写专门的代码来处理您需要的每个验证。
您还可以使用使用游标在临时ExcelDump表中迭代以筛选每一行。您可以将筛选的结果存储到另一个临时表中，该临时表可以有自己的约束，如unique或not null（如有必要），并且通过使用游标，您可以编写专门的代码来处理所需的每个验证。
您是如何选择值的>BCol1
的ID=3
而不是A
？@Lamak为什么Col1
的A
是ID=1
而不是NULL
@ConradFrix-我只是假设op是通过ID
进行分组的，而ID=1
只有一行的列值不同ent为NULL或空。ID=3
@Lamak的情况不是很好。我不处理那个atm，我怀疑它是输入数据中的一个错误，必须在输入中更正或删除。不确定我为什么包括它：/@Lamak你也可以假设“最后一行”对于给定的ID，这是需要的。这就有了{3，B，Null，Null}
。这间接说明了一些未定义的要求需要澄清。您是如何选择Col1
的B
值作为ID=3
而不是A
？@Lamak也为什么Col1
A
作为ID=1
而不是NULL
@ConradFrix-I just假设op是通过Id
进行分组的，对于Id=1
，只有一行的列值不同于NULL
或空。但对于Id=3
@Lamak，情况并非如此。我不处理atm，我怀疑这是输入数据中的错误，必须纠正或删除我不知道我为什么把它包括进去：/@Lamak你也可以假设给定ID的“最后一行”是所需要的。这就有了{3，B，Null，Null}
。这间接地说明了一些未定义的需求需要澄清。这看起来是我可以使用的。性能不是问题，因为这将是ETL中的转换步骤。我更新了问题的细节，您是否仍然认为这是最好的方法？我已经修改了我给出的答案我必须考虑你最新的问题。我不会说这一定是“最好的”方法因为剥猫皮的方法不止一种。这只是一种可行的方法。这看起来是我可以使用的方法。性能不是问题，因为这将是ETL中的转换步骤。我更新了问题的细节，你是否仍然认为这是最好的方法？我已经修改了我给出的答案，以考虑到这不是你最新的问题。我不会说这一定是“最好”的方法，因为给猫剥皮的方法不止一种。这只是一种有效的方法。