在多对多关系、SQL图形连接组件中对所有相关记录进行分组
希望我错过了一个简单的解决方案 我有两张桌子。其中一个包含公司列表。第二个包含出版商列表。两者之间的映射是多对多的。我想做的是捆绑或分组表A中与表B中的出版商有任何关系的所有公司,反之亦然 最终的结果类似于GROUPID是关键字段。第1行和第2行属于同一组,因为它们共享同一家公司。第3行位于同一组中,因为发布者Y已映射到公司A。第4行位于组中,因为公司B已通过发布者Y映射到组1 简单地说,只要公司和出版商之间存在任何形式的共享关系,就应该将这一对分配给同一个组在多对多关系、SQL图形连接组件中对所有相关记录进行分组,sql,sql-server,sql-server-2012,Sql,Sql Server,Sql Server 2012,希望我错过了一个简单的解决方案 我有两张桌子。其中一个包含公司列表。第二个包含出版商列表。两者之间的映射是多对多的。我想做的是捆绑或分组表A中与表B中的出版商有任何关系的所有公司,反之亦然 最终的结果类似于GROUPID是关键字段。第1行和第2行属于同一组,因为它们共享同一家公司。第3行位于同一组中,因为发布者Y已映射到公司A。第4行位于组中,因为公司B已通过发布者Y映射到组1 简单地说,只要公司和出版商之间存在任何形式的共享关系,就应该将这一对分配给同一个组 ROW GROUPID
ROW GROUPID Company Publisher
1 1 A Y
2 1 A X
3 1 B Y
4 1 B Z
5 2 C W
6 2 C P
7 2 D W
更新:
我的赏金版本:给定上面小提琴中简单公司和出版商对的表格,填充上面的GROUPID字段。可以将其视为创建一个包含所有相关父母/子女的家庭ID
SQL Server 2012您正在尝试查找图形中所有连接的组件,这只能以迭代方式完成。如果您知道任何连接组件的最大宽度,即从一家公司/出版商到另一家公司/出版商的链接的最大数量,原则上您可以这样做:
SELECT
MIN(x2.groupID) AS groupID,
x1.Company,
x1.Publisher
FROM Table1 AS x1
INNER JOIN (
SELECT
MIN(x2.Company) AS groupID,
x1.Company,
x1.Publisher
FROM Table1 AS x1
INNER JOIN Table1 AS x2
ON x1.Publisher = x2.Publisher
GROUP BY
x1.Publisher,
x1.Company
) AS x2
ON x1.Company = x2.Company
GROUP BY
x1.Publisher,
x1.Company;
您必须在Company和Publisher上交替嵌套子查询连接,最深的子查询是MINCompany而不是MINgroupID,以达到最大迭代深度
不过,我并不真的建议这样做;在SQL之外这样做会更干净
免责声明:我对SQL Server 2012或任何其他版本一无所知;它可能有一些额外的脚本功能,可以让您动态地执行此迭代。这是我的解决方案
正如我所想,关系的本质要求循环
以下是SQL:
--drop TABLE Table1
CREATE TABLE Table1
([row] int identity (1,1),GroupID INT NULL,[Company] varchar(2), [Publisher] varchar(2))
;
INSERT INTO Table1
(Company, Publisher)
select
left(newid(), 2), left(newid(), 2)
declare @i int = 1
while @i < 8
begin
;with cte(Company, Publisher) as (
select
left(newid(), 2), left(newid(), 2)
from Table1
)
insert into Table1(Company, Publisher)
select distinct c.Company, c.Publisher
from cte as c
where not exists (select * from Table1 as t where t.Company = c.Company and t.Publisher = c.Publisher)
set @i = @i + 1
end;
CREATE NONCLUSTERED INDEX IX_Temp1 on Table1 (Company)
CREATE NONCLUSTERED INDEX IX_Temp2 on Table1 (Publisher)
declare @counter int=0
declare @row int=0
declare @lastnullcount int=0
declare @currentnullcount int=0
WHILE EXISTS (
SELECT *
FROM Table1
where GroupID is null
)
BEGIN
SET @counter=@counter+1
SET @lastnullcount =0
SELECT TOP 1
@row=[row]
FROM Table1
where GroupID is null
order by [row] asc
SELECT @currentnullcount=count(*) from table1 where groupid is null
WHILE @lastnullcount <> @currentnullcount
BEGIN
SELECT @lastnullcount=count(*)
from table1
where groupid is null
UPDATE Table1
SET GroupID=@counter
WHERE [row]=@row
UPDATE t2
SET t2.GroupID=@counter
FROM Table1 t1
INNER JOIN Table1 t2 on t1.Company=t2.Company
WHERE t1.GroupID=@counter
AND t2.GroupID IS NULL
UPDATE t2
SET t2.GroupID=@counter
FROM Table1 t1
INNER JOIN Table1 t2 on t1.publisher=t2.publisher
WHERE t1.GroupID=@counter
AND t2.GroupID IS NULL
SELECT @currentnullcount=count(*)
from table1
where groupid is null
END
END
SELECT * FROM Table1
编辑:
在实际表中添加了我期望的索引,并且与Roman正在使用的其他数据集更加一致。我考虑过使用,但是,据我所知,在SQL Server中不可能使用UNION连接递归CTE的锚定成员和递归成员,我认为在PostgreSQL中可以这样做,所以不可能消除重复项
declare @i int
with cte as (
select
GroupID,
row_number() over(order by Company) as rn
from Table1
)
update cte set GroupID = rn
select @i = @@rowcount
-- while some rows updated
while @i > 0
begin
update T1 set
GroupID = T2.GroupID
from Table1 as T1
inner join (
select T2.Company, min(T2.GroupID) as GroupID
from Table1 as T2
group by T2.Company
) as T2 on T2.Company = T1.Company
where T1.GroupID > T2.GroupID
select @i = @@rowcount
update T1 set
GroupID = T2.GroupID
from Table1 as T1
inner join (
select T2.Publisher, min(T2.GroupID) as GroupID
from Table1 as T2
group by T2.Publisher
) as T2 on T2.Publisher = T1.Publisher
where T1.GroupID > T2.GroupID
-- will be > 0 if any rows updated
select @i = @i + @@rowcount
end
;with cte as (
select
GroupID,
dense_rank() over(order by GroupID) as rn
from Table1
)
update cte set GroupID = rn
我还尝试了广度优先搜索算法。我认为它可以更快,在复杂性方面更好,所以我将在这里提供一个解决方案。我发现它并不比SQL方法快,不过:
declare @Company nvarchar(2), @Publisher nvarchar(2), @GroupID int
declare @Queue table (
Company nvarchar(2), Publisher nvarchar(2), ID int identity(1, 1),
primary key(Company, Publisher)
)
select @GroupID = 0
while 1 = 1
begin
select top 1 @Company = Company, @Publisher = Publisher
from Table1
where GroupID is null
if @@rowcount = 0 break
select @GroupID = @GroupID + 1
insert into @Queue(Company, Publisher)
select @Company, @Publisher
while 1 = 1
begin
select top 1 @Company = Company, @Publisher = Publisher
from @Queue
order by ID asc
if @@rowcount = 0 break
update Table1 set
GroupID = @GroupID
where Company = @Company and Publisher = @Publisher
delete from @Queue where Company = @Company and Publisher = @Publisher
;with cte as (
select Company, Publisher from Table1 where Company = @Company and GroupID is null
union all
select Company, Publisher from Table1 where Publisher = @Publisher and GroupID is null
)
insert into @Queue(Company, Publisher)
select distinct c.Company, c.Publisher
from cte as c
where not exists (select * from @Queue as q where q.Company = c.Company and q.Publisher = c.Publisher)
end
end
我已经测试了我的版本和Gordon Linoff的版本,以检查它的性能。看起来CTE要糟糕得多,我等不及了,因为它已经在1000多行上完成了
这里是随机数据的例子。我的结果是:
128行:
我的RBAR解决方案:190ms
我的SQL解决方案:27ms
Gordon Linoff的解决方案:958ms
256行:
我的RBAR解决方案:560ms
我的SQL解决方案:1226ms
Gordon Linoff的解决方案:45371ms
这是随机数据,所以结果可能不太一致。我认为时间可以通过索引来改变,但不认为它可以改变整个画面
旧版本-使用临时表,只计算GroupID而不接触初始表:
declare @i int
-- creating table to gather all possible GroupID for each row
create table #Temp
(
Company varchar(1), Publisher varchar(1), GroupID varchar(1),
primary key (Company, Publisher, GroupID)
)
-- initializing it with data
insert into #Temp (Company, Publisher, GroupID)
select Company, Publisher, Company
from Table1
select @i = @@rowcount
-- while some rows inserted into #Temp
while @i > 0
begin
-- expand #Temp in both directions
;with cte as (
select
T2.Company, T1.Publisher,
T1.GroupID as GroupID1, T2.GroupID as GroupID2
from #Temp as T1
inner join #Temp as T2 on T2.Company = T1.Company
union
select
T1.Company, T2.Publisher,
T1.GroupID as GroupID1, T2.GroupID as GroupID2
from #Temp as T1
inner join #Temp as T2 on T2.Publisher = T1.Publisher
), cte2 as (
select
Company, Publisher,
case when GroupID1 < GroupID2 then GroupID1 else GroupID2 end as GroupID
from cte
)
insert into #Temp
select Company, Publisher, GroupID
from cte2
-- don't insert duplicates
except
select Company, Publisher, GroupID
from #Temp
-- will be > 0 if any row inserted
select @i = @@rowcount
end
select
Company, Publisher,
dense_rank() over(order by min(GroupID)) as GroupID
from #Temp
group by Company, Publisher
这是一个使用XML的递归解决方案:
with a as ( -- recursive result, containing shorter subsets and duplicates
select cast('<c>' + company + '</c>' as xml) as companies
,cast('<p>' + publisher + '</p>' as xml) as publishers
from Table1
union all
select a.companies.query('for $c in distinct-values((for $i in /c return string($i),
sql:column("t.company")))
order by $c
return <c>{$c}</c>')
,a.publishers.query('for $p in distinct-values((for $i in /p return string($i),
sql:column("t.publisher")))
order by $p
return <p>{$p}</p>')
from a join Table1 t
on ( a.companies.exist('/c[text() = sql:column("t.company")]') = 0
or a.publishers.exist('/p[text() = sql:column("t.publisher")]') = 0)
and ( a.companies.exist('/c[text() = sql:column("t.company")]') = 1
or a.publishers.exist('/p[text() = sql:column("t.publisher")]') = 1)
), b as ( -- remove the shorter versions from earlier steps of the recursion and the duplicates
select distinct -- distinct cannot work on xml types, hence cast to nvarchar
cast(companies as nvarchar) as companies
,cast(publishers as nvarchar) as publishers
,DENSE_RANK() over(order by cast(companies as nvarchar), cast(publishers as nvarchar)) as groupid
from a
where not exists (select 1 from a as s -- s is a proper subset of a
where (cast('<s>' + cast(s.companies as varchar)
+ '</s><a>' + cast(a.companies as varchar) + '</a>' as xml)
).value('if((count(/s/c) > count(/a/c))
and (some $s in /s/c/text() satisfies
(some $a in /a/c/text() satisfies $s = $a))
) then 1 else 0', 'int') = 1
)
and not exists (select 1 from a as s -- s is a proper subset of a
where (cast('<s>' + cast(s.publishers as nvarchar)
+ '</s><a>' + cast(a.publishers as nvarchar) + '</a>' as xml)
).value('if((count(/s/p) > count(/a/p))
and (some $s in /s/p/text() satisfies
(some $a in /a/p/text() satisfies $s = $a))
) then 1 else 0', 'int') = 1
)
), c as ( -- cast back to xml
select cast(companies as xml) as companies
,cast(publishers as xml) as publishers
,groupid
from b
)
select Co.company.value('(./text())[1]', 'varchar') as company
,Pu.publisher.value('(./text())[1]', 'varchar') as publisher
,c.groupid
from c
cross apply companies.nodes('/c') as Co(company)
cross apply publishers.nodes('/p') as Pu(publisher)
where exists(select 1 from Table1 t -- restrict to only the combinations that exist in the source
where t.company = Co.company.value('(./text())[1]', 'varchar')
and t.publisher = Pu.publisher.value('(./text())[1]', 'varchar')
)
在中间步骤中,公司集和发布者集保存在XML字段中,由于SQL Server的某些限制,如无法在XML列上分组或使用distinct,因此需要在XML和nvarchar之间进行转换。您的问题是查找连接子图的图漫游问题。这更具挑战性,因为您的数据结构有两种类型的节点—公司和发布者,而不是一种类型 您可以用一个递归CTE来解决这个问题。逻辑如下 首先,将问题转化为只有一种节点类型的图。我通过使用发布者信息使节点和边在公司之间链接来实现这一点。这只是一个连接:
select t1.company as node1, t2.company as node2
from table1 t1 join
table1 t2
on t1.publisher = t2.publisher
)
为了提高效率,您还可以添加t1.company t2.company,但这并不是绝对必要的
现在,这是一个简单的图遍历问题,其中递归CTE用于创建两个节点之间的所有连接。递归CTE使用join遍历图形。在这个过程中,它会保存一个访问过的所有节点的列表。在SQL Server中,这需要存储在字符串中
代码需要确保在给定路径中不会两次访问节点,因为这可能导致无限递归和错误。如果以上称为边,则生成所有连接节点对的CTE如下所示:
cte as (
select e.node1, e.node2, cast('|'+e.node1+'|'+e.node2+'|' as varchar(max)) as nodes,
1 as level
from edges e
union all
select c.node1, e.node2, c.nodes+e.node2+'|', 1+c.level
from cte c join
edges e
on c.node2 = e.node1 and
c.nodes not like '|%'+e.node2+'%|'
)
with edges as (
select t1.company as node1, t2.company as node2
from table1 t1 join
table1 t2
on t1.publisher = t2.publisher
),
cte as (
select e.node1, e.node2,
cast('|'+e.node1+'|'+e.node2+'|' as varchar(max)) as nodes,
1 as level
from edges e
union all
select c.node1, e.node2,
c.nodes+e.node2+'|',
1+c.level
from cte c join
edges e
on c.node2 = e.node1 and
c.nodes not like '|%'+e.node2+'%|'
),
nodes as (
select node1,
(case when min(node2) < node1 then min(node2) else node1 end
) as grp
from cte
group by node1
)
select t.company, t.publisher, grp.GroupId
from table1 t join
(select n.node1, dense_rank() over (order by grp) as GroupId
from nodes n
) grp
on t.company = grp.node1;
现在,使用此已连接节点列表,为每个节点指定其连接到的所有节点(包括其自身)中的最小值。这用作连接子图的标识符。也就是说,所有公司通过
出版商将有相同的最低要求
最后两个步骤是将此最小值枚举为GroupId,并将GroupId连接回原始数据
完整和我可能添加的测试查询如下所示:
cte as (
select e.node1, e.node2, cast('|'+e.node1+'|'+e.node2+'|' as varchar(max)) as nodes,
1 as level
from edges e
union all
select c.node1, e.node2, c.nodes+e.node2+'|', 1+c.level
from cte c join
edges e
on c.node2 = e.node1 and
c.nodes not like '|%'+e.node2+'%|'
)
with edges as (
select t1.company as node1, t2.company as node2
from table1 t1 join
table1 t2
on t1.publisher = t2.publisher
),
cte as (
select e.node1, e.node2,
cast('|'+e.node1+'|'+e.node2+'|' as varchar(max)) as nodes,
1 as level
from edges e
union all
select c.node1, e.node2,
c.nodes+e.node2+'|',
1+c.level
from cte c join
edges e
on c.node2 = e.node1 and
c.nodes not like '|%'+e.node2+'%|'
),
nodes as (
select node1,
(case when min(node2) < node1 then min(node2) else node1 end
) as grp
from cte
group by node1
)
select t.company, t.publisher, grp.GroupId
from table1 t join
(select n.node1, dense_rank() over (order by grp) as GroupId
from nodes n
) grp
on t.company = grp.node1;
请注意,这适用于查找任何连通子图。它不假设任何特定数量的级别
编辑:
这方面的表现问题令人烦恼。至少,使用Publisher上的索引可以更好地运行上述查询。更好的办法是采纳@MikaelEriksson的建议,将边缘放在单独的表格中
另一个问题是,您是否在公司或出版商之间寻找等价类。我采取了使用公司的方法,因为我认为这有更好的解释性。我的回应倾向是基于大量的评论,认为CTE无法做到这一点
我猜您可以从中获得合理的性能,尽管这需要比OP中提供的更多的数据和系统知识。但是,最好的性能很可能来自于多查询方法。有点晚了,由于SQLFiddle似乎已经停机,我不得不猜测您的数据结构。尽管如此,这似乎是一个有趣的挑战,这就是我从中得到的: 设置:
IF OBJECT_ID('t_link') IS NOT NULL DROP TABLE t_link
IF OBJECT_ID('t_company') IS NOT NULL DROP TABLE t_company
IF OBJECT_ID('t_publisher') IS NOT NULL DROP TABLE t_publisher
IF OBJECT_ID('tempdb..#link_A') IS NOT NULL DROP TABLE #link_A
IF OBJECT_ID('tempdb..#link_B') IS NOT NULL DROP TABLE #link_B
GO
CREATE TABLE t_company ( company_id int IDENTITY(1, 1) NOT NULL PRIMARY KEY,
company_name varchar(100) NOT NULL)
GO
CREATE TABLE t_publisher (publisher_id int IDENTITY(1, 1) NOT NULL PRIMARY KEY,
publisher_name varchar(100) NOT NULL)
CREATE TABLE t_link (company_id int NOT NULL FOREIGN KEY (company_id) REFERENCES t_company (company_id),
publisher_id int NOT NULL FOREIGN KEY (publisher_id) REFERENCES t_publisher (publisher_id),
PRIMARY KEY (company_id, publisher_id),
group_id int NULL
)
GO
-- example content
-- ROW GROUPID Company Publisher
--1 1 A Y
--2 1 A X
--3 1 B Y
--4 1 B Z
--5 2 C W
--6 2 C P
--7 2 D W
INSERT t_company (company_name) VALUES ('A'), ('B'), ('C'), ('D')
INSERT t_publisher (publisher_name) VALUES ('X'), ('Y'), ('Z'), ('W'), ('P')
INSERT t_link (company_id, publisher_id)
SELECT company_id, publisher_id
FROM t_company, t_publisher
WHERE (company_name = 'A' AND publisher_name = 'Y')
OR (company_name = 'A' AND publisher_name = 'X')
OR (company_name = 'B' AND publisher_name = 'Y')
OR (company_name = 'B' AND publisher_name = 'Z')
OR (company_name = 'C' AND publisher_name = 'W')
OR (company_name = 'C' AND publisher_name = 'P')
OR (company_name = 'D' AND publisher_name = 'W')
GO
/*
-- volume testing
TRUNCATE TABLE t_link
DELETE t_company
DELETE t_publisher
DECLARE @company_count int = 1000,
@publisher_count int = 450,
@links_count int = 800
INSERT t_company (company_name)
SELECT company_name = Convert(varchar(100), NewID())
FROM master.dbo.fn_int_list(1, @company_count)
UPDATE STATISTICS t_company
INSERT t_publisher (publisher_name)
SELECT publisher_name = Convert(varchar(100), NewID())
FROM master.dbo.fn_int_list(1, @publisher_count)
UPDATE STATISTICS t_publisher
-- Random links between the companies & publishers
DECLARE @count int
SELECT @count = 0
WHILE @count < @links_count
BEGIN
SELECT TOP 30 PERCENT row_id = IDENTITY(int, 1, 1), company_id = company_id + 0
INTO #link_A
FROM t_company
ORDER BY NewID()
SELECT TOP 30 PERCENT row_id = IDENTITY(int, 1, 1), publisher_id = publisher_id + 0
INTO #link_B
FROM t_publisher
ORDER BY NewID()
INSERT TOP (@links_count - @count) t_link (company_id, publisher_id)
SELECT A.company_id,
B.publisher_id
FROM #link_A A
JOIN #link_B B
ON A.row_id = B.row_id
WHERE NOT EXISTS ( SELECT *
FROM t_link old
WHERE old.company_id = A.company_id
AND old.publisher_id = B.publisher_id)
SELECT @count = @count + @@ROWCOUNT
DROP TABLE #link_A
DROP TABLE #link_B
END
*/
实际分组:
IF OBJECT_ID('tempdb..#links') IS NOT NULL DROP TABLE #links
GO
-- apply grouping
-- init
SELECT row_id = IDENTITY(int, 1, 1),
company_id,
publisher_id,
group_id = 0
INTO #links
FROM t_link
-- don't see an index that would be actually helpful here right-away, using row_id to avoid HEAP
CREATE CLUSTERED INDEX idx0 ON #links (row_id)
--CREATE INDEX idx1 ON #links (company_id)
--CREATE INDEX idx2 ON #links (publisher_id)
UPDATE #links
SET group_id = row_id
-- start grouping
WHILE @@ROWCOUNT > 0
BEGIN
UPDATE #links
SET group_id = new_group_id
FROM #links upd
CROSS APPLY (SELECT new_group_id = Min(group_id)
FROM #links new
WHERE new.company_id = upd.company_id
OR new.publisher_id = upd.publisher_id
) x
WHERE upd.group_id > new_group_id
-- select * from #links
END
-- remove 'holes'
UPDATE #links
SET group_id = (SELECT COUNT(DISTINCT o.group_id)
FROM #links o
WHERE o.group_id <= upd.group_id)
FROM #links upd
GO
UPDATE t_link
SET group_id = new.group_id
FROM t_link upd
LEFT OUTER JOIN #links new
ON new.company_id = upd.company_id
AND new.publisher_id = upd.publisher_id
GO
SELECT row = ROW_NUMBER() OVER (ORDER BY group_id, company_name, publisher_name),
l.group_id,
c.company_name, -- c.company_id,
p.publisher_name -- , p.publisher_id
from t_link l
JOIN t_company c
ON l.company_id = c.company_id
JOIN t_publisher p
ON p.publisher_id = l.publisher_id
ORDER BY 1
乍一看,这种方法还没有被其他任何人尝试过,有趣的是,看看如何以各种方式做到这一点。。。最好不要预先阅读,因为这会破坏谜题=
就我所理解的需求而言,结果看起来与预期的一样,示例和性能也不太差,尽管没有实际的迹象表明应该处理多少记录;不确定它将如何扩展,但也不要期望有太多问题 您使用的是哪一版本的SQL Server?我将撤消您刚才所做的编辑,以更简单的方式进行编辑,并使用“代码示例”按钮。@GoatCO我已经提交了一个rollback.oops,谢谢。下次我会修正:@SpectralGhost是的,这不是我的问题,我只是利用它来获得一些想法。通常在孩子/家长问题中,使用递归cte来连接连续的对,但我从未见过一个将两个方向的结果分组为“家庭”的示例。正如你在我的回答中所看到的,递归解决方案是可能的。我也考虑过使用CTE,但SQL Server在递归深度方面存在限制;不确定OP是否会遇到。可以使用选项maxrecursion 0。主要的问题是如何消除重复项MAXRECURSION 0是我的朋友,这一部分不会困扰我。@RomanPekar。我不想卷入性能大战。但您的表结构(主键位于company和publisher上)针对您的方法进行了优化。在SQLFiddle上测试时,我看到了一些不同的计时方式,即输入一个identity主键,因此在这些字段上没有索引。但是把时间安排在一起真是太好了。因为你是做性能测试的人,所以我在这里发表我的评论。@GordonLinoff的解决方案受到了影响,因为每次调用CTE的请求部分都会执行边缘CTE。最好将边缘CTE存储在一个适当索引的临时表中,并将其用于请求CTE I。不知道有多少。@GordonLinoff用一些性能测试来检查我的更新答案,我认为如果组很长的话,这是不可能处理超过10000行的。但今天没有时间测试。感谢您的全面回答,如果可能的话,我们将平分赏金,但由于罗曼在性能测试方面的努力,我们还是同意他的答案。@GoatCO除非问题结束,否则可能会出现新的答案。上下票数也会改变州。然后你的评论可能会失去它的真实价值。当提及其他答案时,您可以使用指向“共享”URL的链接,并避免使用“其他”、“上面”和“下面”等术语。用sql术语来说,这就像记录“生日”而不是“年龄”;您指的是一些不可变的而不是变量。@GoatCO仅供参考,经过性能测试,我的解决方案在本地SQL server实例上的性能明显优于我在本线程中看到的任何CTE示例。让Roman把它加入到他的表演中。我已经通过一组例子,从1000张记录到100万张记录,找到了所有答案,但我会重新访问它。我现在时间不多了,下面是对两个解决方案的快速检查-@RomanPekar真正奇怪的是,我在本地SQL server实例中运行它得到了完全不同的结果。按比例放大行,您就可以真正开始看到差异。不知道为什么SQL fiddle如此不同。