Sql server 如何查找字符串的一部分是否多次出现在SQL SERVER上
我在SQL Server中有一项任务要做。我有一个包含两列(ID,Sortkey)的表,如下所示Sql server 如何查找字符串的一部分是否多次出现在SQL SERVER上,sql-server,string-comparison,Sql Server,String Comparison,我在SQL Server中有一项任务要做。我有一个包含两列(ID,Sortkey)的表,如下所示 ID Sortkey 1 00 2 01 3 0101 4 0102 5 02 6 03 7 0301 8 030101 9 04 10 0401 如果类似的字符串开始出现在表中,我有一个任务要在字符串前面写“+”,如果不出现则
ID Sortkey
1 00
2 01
3 0101
4 0102
5 02
6 03
7 0301
8 030101
9 04
10 0401
如果类似的字符串开始出现在表中,我有一个任务要在字符串前面写“+”,如果不出现则写“-”
输出应如下所示:
ID Sortkey
1 -00
2 +01
3 -0101
4 -0102
5 -02
6 +03
7 +0301
8 -030101
9 +04
10 -0401
我尝试过使用count(*),但不知道如何计算是否存在字符串中类似部分的记录。我设想了两种可能性的解决方案,一种是搜索是否有任何字符串包含与我正在查看的整个字符串相同的部分,并在字符串前面返回+,另一种是将返回-
非常感谢您可以使用
外部应用
对类似行进行计数,然后当计数大于0时,使用+
,否则使用-
:
DECLARE @T TABLE (ID INT, SortKey VARCHAR(10));
INSERT @T (ID, SortKey)
VALUES
(1, '00'), (2, '01'), (3, '0101'), (4, '0102'), (5, '02'),
(6, '03'), (7, '0301'), (8, '030101'), (9, '04'), (10, '0401');
SELECT T1.ID, SortKey = CASE WHEN d.SimilarKeys > 0 THEN '+' ELSE '-' END + T1.SortKey
FROM @T AS T1
OUTER APPLY
( SELECT COUNT(*)
FROM @T AS T2
WHERE T2.SortKey LIKE T1.SortKey + '%'
AND T2.ID != T1.ID
) AS d (SimilarKeys);
性能测试
我已经在另一个答案中对性能进行了评论,因此认为最好至少包括我如何测试它:
IF OBJECT_ID(N'dbo.T', 'U') IS NOT NULL DROP TABLE dbo.T;
CREATE TABLE dbo.T(ID INT NOT NULL PRIMARY KEY, SortKey VARCHAR(10));
INSERT dbo.T (ID, SortKey)
SELECT TOP 100000
ROW_NUMBER() OVER(ORDER BY (SELECT NULL)),
RIGHT('0000' + CONVERT(VARCHAR(10), FLOOR(RAND(CHECKSUM(NEWID())) * 10000)),
CEILING(RAND(CHECKSUM(NEWID())) * 8))
FROM sys.all_objects a
CROSS JOIN sys.all_objects b;
我用于测试的查询是:
查询1
SELECT COUNT(CASE WHEN d.SimilarKeys > 0 THEN '+' ELSE '-' END + T1.SortKey)
FROM dbo.T AS T1
OUTER APPLY
( SELECT COUNT(*)
FROM dbo.T AS T2
WHERE T2.SortKey LIKE T1.SortKey + '%'
AND T2.ID != T1.ID
) AS d (SimilarKeys);
查询2
WITH cte AS
(
SELECT t1.ID, t1.Sortkey,
SUM(CASE WHEN t2.sortkey like t1.sortkey + '%' THEN 1 ELSE 0 END)
OVER (PARTITION BY t1.ID) AS ContainsCount,
ROW_NUMBER() OVER (PARTITION BY t1.ID ORDER BY t1.id) AS rnr
FROM dbo.T AS t1
LEFT JOIN dbo.T AS t2
ON t1.ID <> t2.ID
)
SELECT COUNT(CASE WHEN ContainsCount > 0 THEN '+' ELSE '-' END + Sortkey) AS Sortkey
FROM cte
WHERE rnr = 1;
在本例中,每次运行Query1大约需要14秒,但这次10分钟后我放弃了运行Query2。我将表中的行数减少到1000行,查询2最终运行完成(8秒),很明显,当查看IO时,它的性能如此糟糕:
查询1
SELECT COUNT(CASE WHEN d.SimilarKeys > 0 THEN '+' ELSE '-' END + T1.SortKey)
FROM dbo.T AS T1
OUTER APPLY
( SELECT COUNT(*)
FROM dbo.T AS T2
WHERE T2.SortKey LIKE T1.SortKey + '%'
AND T2.ID != T1.ID
) AS d (SimilarKeys);
表“T”。扫描计数1001,逻辑读取2031
查询2
WITH cte AS
(
SELECT t1.ID, t1.Sortkey,
SUM(CASE WHEN t2.sortkey like t1.sortkey + '%' THEN 1 ELSE 0 END)
OVER (PARTITION BY t1.ID) AS ContainsCount,
ROW_NUMBER() OVER (PARTITION BY t1.ID ORDER BY t1.id) AS rnr
FROM dbo.T AS t1
LEFT JOIN dbo.T AS t2
ON t1.ID <> t2.ID
)
SELECT COUNT(CASE WHEN ContainsCount > 0 THEN '+' ELSE '-' END + Sortkey) AS Sortkey
FROM cte
WHERE rnr = 1;
表“工作台”。扫描计数15,逻辑读取2061426
表“工作台”。扫描计数0,逻辑读取0
表“T”。扫描计数8,逻辑读数25
因此,查询2只需要在1000条记录上读取200万次,这就解释了性能缓慢的原因。您可以使用
外部应用来计算类似的行,然后当计数大于0时,使用+
,否则使用-
:
DECLARE @T TABLE (ID INT, SortKey VARCHAR(10));
INSERT @T (ID, SortKey)
VALUES
(1, '00'), (2, '01'), (3, '0101'), (4, '0102'), (5, '02'),
(6, '03'), (7, '0301'), (8, '030101'), (9, '04'), (10, '0401');
SELECT T1.ID, SortKey = CASE WHEN d.SimilarKeys > 0 THEN '+' ELSE '-' END + T1.SortKey
FROM @T AS T1
OUTER APPLY
( SELECT COUNT(*)
FROM @T AS T2
WHERE T2.SortKey LIKE T1.SortKey + '%'
AND T2.ID != T1.ID
) AS d (SimilarKeys);
性能测试
我已经在另一个答案中对性能进行了评论,因此认为最好至少包括我如何测试它:
IF OBJECT_ID(N'dbo.T', 'U') IS NOT NULL DROP TABLE dbo.T;
CREATE TABLE dbo.T(ID INT NOT NULL PRIMARY KEY, SortKey VARCHAR(10));
INSERT dbo.T (ID, SortKey)
SELECT TOP 100000
ROW_NUMBER() OVER(ORDER BY (SELECT NULL)),
RIGHT('0000' + CONVERT(VARCHAR(10), FLOOR(RAND(CHECKSUM(NEWID())) * 10000)),
CEILING(RAND(CHECKSUM(NEWID())) * 8))
FROM sys.all_objects a
CROSS JOIN sys.all_objects b;
我用于测试的查询是:
查询1
SELECT COUNT(CASE WHEN d.SimilarKeys > 0 THEN '+' ELSE '-' END + T1.SortKey)
FROM dbo.T AS T1
OUTER APPLY
( SELECT COUNT(*)
FROM dbo.T AS T2
WHERE T2.SortKey LIKE T1.SortKey + '%'
AND T2.ID != T1.ID
) AS d (SimilarKeys);
查询2
WITH cte AS
(
SELECT t1.ID, t1.Sortkey,
SUM(CASE WHEN t2.sortkey like t1.sortkey + '%' THEN 1 ELSE 0 END)
OVER (PARTITION BY t1.ID) AS ContainsCount,
ROW_NUMBER() OVER (PARTITION BY t1.ID ORDER BY t1.id) AS rnr
FROM dbo.T AS t1
LEFT JOIN dbo.T AS t2
ON t1.ID <> t2.ID
)
SELECT COUNT(CASE WHEN ContainsCount > 0 THEN '+' ELSE '-' END + Sortkey) AS Sortkey
FROM cte
WHERE rnr = 1;
在本例中,每次运行Query1大约需要14秒,但这次10分钟后我放弃了运行Query2。我将表中的行数减少到1000行,查询2最终运行完成(8秒),很明显,当查看IO时,它的性能如此糟糕:
查询1
SELECT COUNT(CASE WHEN d.SimilarKeys > 0 THEN '+' ELSE '-' END + T1.SortKey)
FROM dbo.T AS T1
OUTER APPLY
( SELECT COUNT(*)
FROM dbo.T AS T2
WHERE T2.SortKey LIKE T1.SortKey + '%'
AND T2.ID != T1.ID
) AS d (SimilarKeys);
表“T”。扫描计数1001,逻辑读取2031
查询2
WITH cte AS
(
SELECT t1.ID, t1.Sortkey,
SUM(CASE WHEN t2.sortkey like t1.sortkey + '%' THEN 1 ELSE 0 END)
OVER (PARTITION BY t1.ID) AS ContainsCount,
ROW_NUMBER() OVER (PARTITION BY t1.ID ORDER BY t1.id) AS rnr
FROM dbo.T AS t1
LEFT JOIN dbo.T AS t2
ON t1.ID <> t2.ID
)
SELECT COUNT(CASE WHEN ContainsCount > 0 THEN '+' ELSE '-' END + Sortkey) AS Sortkey
FROM cte
WHERE rnr = 1;
表“工作台”。扫描计数15,逻辑读取2061426
表“工作台”。扫描计数0,逻辑读取0
表“T”。扫描计数8,逻辑读数25
因此,查询2只需要对1000条记录执行200万次读取,这就解释了性能低下的原因。这种Sql在大型表上可能不是最有效的,但它会产生所需的结果
SELECT CASE WHEN (SELECT COUNT(*) FROM Table1 T2 WHERE T2.SortKey LIKE T1.SortKey + '%') > 1 THEN '-' ELSE '+' END + T1.SortKey AS SortKey
FROM Table1 T1
这种Sql在大型表上可能不是最有效的,但它会产生预期的结果
SELECT CASE WHEN (SELECT COUNT(*) FROM Table1 T2 WHERE T2.SortKey LIKE T1.SortKey + '%') > 1 THEN '-' ELSE '+' END + T1.SortKey AS SortKey
FROM Table1 T1
您可以尝试使用子选择
Update t1
set sortkey = CONCAT(
(CASE WHEN (
SELECT count(*)
from @table t2
where t2.SortKey like Concat(t1.SortKey, '%')
) > 1
THEN '+'
ELSE '-'
END)
, sortKey)
from @table t1
基本思想是计算所有记录,即计算SortKey类似于SortKey%的所有行
这意味着如果有两行具有相同的排序键,那么它们都将获得+
如果你想避免,你可以
and t2.sortkey <> t1.sortkey
和t2.sortkey t1.sortkey
在选择状态的where结尾,您可以尝试使用子选择
Update t1
set sortkey = CONCAT(
(CASE WHEN (
SELECT count(*)
from @table t2
where t2.SortKey like Concat(t1.SortKey, '%')
) > 1
THEN '+'
ELSE '-'
END)
, sortKey)
from @table t1
基本思想是计算所有记录,即计算SortKey类似于SortKey%的所有行
这意味着如果有两行具有相同的排序键,那么它们都将获得+
如果你想避免,你可以
and t2.sortkey <> t1.sortkey
和t2.sortkey t1.sortkey
在“选择位置”中的何处结束