计算SQL中两个字符串之间的相似字数

计算SQL中两个字符串之间的相似字数,sql,sql-server,string,algorithm,tsql,Sql,Sql Server,String,Algorithm,Tsql,我计划编写一个包含两个输入字符串和一个单词相似度百分比的TSQL函数作为输出,例如: SELECT [dbo].[FN_CalcSimilarWords]('Golden horses hotel','Hotel Golden Horses') 返回: 3/3 2/3 或 返回: 3/3 2/3 我在考虑在解析字符串后循环并比较单词,还有其他更好的方法吗?如果您想在SQL中实现这一点,我将采用以下方法 使用拆分例程创建两个临时表,称为Words1和Words2 现在加入表并获得计数,即

我计划编写一个包含两个输入字符串和一个单词相似度百分比的TSQL函数作为输出,例如:

SELECT [dbo].[FN_CalcSimilarWords]('Golden horses hotel','Hotel Golden Horses')
返回:

3/3
2/3

返回:

3/3
2/3

我在考虑在解析字符串后循环并比较单词,还有其他更好的方法吗?

如果您想在SQL中实现这一点,我将采用以下方法

使用拆分例程创建两个临时表,称为Words1和Words2

现在加入表并获得计数,即

select count(*) 
from Words1 w1 
join Words2 w2 on w1.word=w2.word
让SQL按其优化的方式执行

下面是如何从两个表中获取计数

select count(distinct w1.word) as Matches,
       count(distinct w1.word) as FromW1,
       count(distinct w2.word) as FromW2
    from #Words1 w1 
    left join #Words2 w2 on w1.word=w2.word

如果您想在SQL中实现这一点,我将采用以下方法

使用拆分例程创建两个临时表,称为Words1和Words2

现在加入表并获得计数,即

select count(*) 
from Words1 w1 
join Words2 w2 on w1.word=w2.word
让SQL按其优化的方式执行

下面是如何从两个表中获取计数

select count(distinct w1.word) as Matches,
       count(distinct w1.word) as FromW1,
       count(distinct w2.word) as FromW2
    from #Words1 w1 
    left join #Words2 w2 on w1.word=w2.word

如果在部署CLR程序集方面没有任何限制,则可以尝试此路由并比较性能。

如果在部署CLR程序集方面没有限制,您可以尝试此方法并比较性能。

如果您不担心常见项目的精确计数,则可以使用SQL Server全文搜索功能。
ContainsTable
FREETEXT
函数都返回秩。请参见此处的详细信息


如果您不担心常见项目的精确计数,可以使用SQL Server全文搜索功能。
ContainsTable
FREETEXT
函数都返回秩。请参见此处的详细信息

原始答案:

我在学校看到过这种技术

编辑

修改以解决@t-clausen.dk评论中的问题:

MS SQL Server 2012架构设置

CREATE TABLE StringTable 
(
    Id INT IDentity,
    String varchar(max)
)

INSERT INTO StringTable
VALUES ('xx xx Golden horses Malaysia'),
        ('xx xx xx xx xx')
WITH StringsCTE 
AS
(
    SELECT ID,String As StringValue, 
            CASE CHARINDEX(' ', String)
                WHEN 0 THEN String
                ELSE LEFT(String, CHARINDEX(' ',String) -1)
            END AS Word,
            1 as Position,
            CASE CHARINDEX(' ',String)
                WHEN 0 THEN ''
                ELSE RIGHT(String, LEN(String) - CHARINDEX(' ',String))
            END AS RestOfLine
    FROM StringTable
    UNION ALL

    SELECT Id,S.StringValue, 
            CASE CHARINDEX(' ',RestOfLine)
                WHEN 0 THEN RestOfLine
                ELSE LEFT(RestOfLine, CHARINDEX(' ',RestOfLine) -1)
            END, 
            Position + 1, 
            CASE CHARINDEX(' ',RestOfLine)
                WHEN 0 THEN ''
                ELSE RIGHT(RestOfLine, LEN(RestOfLine) - CHARINDEX(' ',RestOfLine))
            END
    FROM StringsCTE S
    WHERE s.RestOfLine != ''
),
WordsPerString
As
(
    SELECT S.Id, COUNT(s.Word) As NumberOfWords
    FROM StringsCTE S
    GROUP BY S.Id
)
SELECT COUNT(*) As Matches, (SELECT MAX(NumberOfWords) FROM WordsPerString) as Total
FROM StringsCTE S1
INNER JOIN StringsCTE S2
    ON S1.Word = S2.Word AND S1.Id <> S2.Id
WHERE S1.Id = 1 AND 
    NOT EXISTS -- Not already matched
  (SELECT * FROM StringsCTE S3 WHERE S3.Word = S2.Word AND S3.Id <> S1.ID AND S3.Position < S2.Position)
| MATCHES | TOTAL |
|---------|-------|
|       2 |     5 |
查询1

CREATE TABLE StringTable 
(
    Id INT IDentity,
    String varchar(max)
)

INSERT INTO StringTable
VALUES ('xx xx Golden horses Malaysia'),
        ('xx xx xx xx xx')
WITH StringsCTE 
AS
(
    SELECT ID,String As StringValue, 
            CASE CHARINDEX(' ', String)
                WHEN 0 THEN String
                ELSE LEFT(String, CHARINDEX(' ',String) -1)
            END AS Word,
            1 as Position,
            CASE CHARINDEX(' ',String)
                WHEN 0 THEN ''
                ELSE RIGHT(String, LEN(String) - CHARINDEX(' ',String))
            END AS RestOfLine
    FROM StringTable
    UNION ALL

    SELECT Id,S.StringValue, 
            CASE CHARINDEX(' ',RestOfLine)
                WHEN 0 THEN RestOfLine
                ELSE LEFT(RestOfLine, CHARINDEX(' ',RestOfLine) -1)
            END, 
            Position + 1, 
            CASE CHARINDEX(' ',RestOfLine)
                WHEN 0 THEN ''
                ELSE RIGHT(RestOfLine, LEN(RestOfLine) - CHARINDEX(' ',RestOfLine))
            END
    FROM StringsCTE S
    WHERE s.RestOfLine != ''
),
WordsPerString
As
(
    SELECT S.Id, COUNT(s.Word) As NumberOfWords
    FROM StringsCTE S
    GROUP BY S.Id
)
SELECT COUNT(*) As Matches, (SELECT MAX(NumberOfWords) FROM WordsPerString) as Total
FROM StringsCTE S1
INNER JOIN StringsCTE S2
    ON S1.Word = S2.Word AND S1.Id <> S2.Id
WHERE S1.Id = 1 AND 
    NOT EXISTS -- Not already matched
  (SELECT * FROM StringsCTE S3 WHERE S3.Word = S2.Word AND S3.Id <> S1.ID AND S3.Position < S2.Position)
| MATCHES | TOTAL |
|---------|-------|
|       2 |     5 |
原始答复:

我在学校看到过这种技术

编辑

修改以解决@t-clausen.dk评论中的问题:

MS SQL Server 2012架构设置

CREATE TABLE StringTable 
(
    Id INT IDentity,
    String varchar(max)
)

INSERT INTO StringTable
VALUES ('xx xx Golden horses Malaysia'),
        ('xx xx xx xx xx')
WITH StringsCTE 
AS
(
    SELECT ID,String As StringValue, 
            CASE CHARINDEX(' ', String)
                WHEN 0 THEN String
                ELSE LEFT(String, CHARINDEX(' ',String) -1)
            END AS Word,
            1 as Position,
            CASE CHARINDEX(' ',String)
                WHEN 0 THEN ''
                ELSE RIGHT(String, LEN(String) - CHARINDEX(' ',String))
            END AS RestOfLine
    FROM StringTable
    UNION ALL

    SELECT Id,S.StringValue, 
            CASE CHARINDEX(' ',RestOfLine)
                WHEN 0 THEN RestOfLine
                ELSE LEFT(RestOfLine, CHARINDEX(' ',RestOfLine) -1)
            END, 
            Position + 1, 
            CASE CHARINDEX(' ',RestOfLine)
                WHEN 0 THEN ''
                ELSE RIGHT(RestOfLine, LEN(RestOfLine) - CHARINDEX(' ',RestOfLine))
            END
    FROM StringsCTE S
    WHERE s.RestOfLine != ''
),
WordsPerString
As
(
    SELECT S.Id, COUNT(s.Word) As NumberOfWords
    FROM StringsCTE S
    GROUP BY S.Id
)
SELECT COUNT(*) As Matches, (SELECT MAX(NumberOfWords) FROM WordsPerString) as Total
FROM StringsCTE S1
INNER JOIN StringsCTE S2
    ON S1.Word = S2.Word AND S1.Id <> S2.Id
WHERE S1.Id = 1 AND 
    NOT EXISTS -- Not already matched
  (SELECT * FROM StringsCTE S3 WHERE S3.Word = S2.Word AND S3.Id <> S1.ID AND S3.Position < S2.Position)
| MATCHES | TOTAL |
|---------|-------|
|       2 |     5 |
查询1

CREATE TABLE StringTable 
(
    Id INT IDentity,
    String varchar(max)
)

INSERT INTO StringTable
VALUES ('xx xx Golden horses Malaysia'),
        ('xx xx xx xx xx')
WITH StringsCTE 
AS
(
    SELECT ID,String As StringValue, 
            CASE CHARINDEX(' ', String)
                WHEN 0 THEN String
                ELSE LEFT(String, CHARINDEX(' ',String) -1)
            END AS Word,
            1 as Position,
            CASE CHARINDEX(' ',String)
                WHEN 0 THEN ''
                ELSE RIGHT(String, LEN(String) - CHARINDEX(' ',String))
            END AS RestOfLine
    FROM StringTable
    UNION ALL

    SELECT Id,S.StringValue, 
            CASE CHARINDEX(' ',RestOfLine)
                WHEN 0 THEN RestOfLine
                ELSE LEFT(RestOfLine, CHARINDEX(' ',RestOfLine) -1)
            END, 
            Position + 1, 
            CASE CHARINDEX(' ',RestOfLine)
                WHEN 0 THEN ''
                ELSE RIGHT(RestOfLine, LEN(RestOfLine) - CHARINDEX(' ',RestOfLine))
            END
    FROM StringsCTE S
    WHERE s.RestOfLine != ''
),
WordsPerString
As
(
    SELECT S.Id, COUNT(s.Word) As NumberOfWords
    FROM StringsCTE S
    GROUP BY S.Id
)
SELECT COUNT(*) As Matches, (SELECT MAX(NumberOfWords) FROM WordsPerString) as Total
FROM StringsCTE S1
INNER JOIN StringsCTE S2
    ON S1.Word = S2.Word AND S1.Id <> S2.Id
WHERE S1.Id = 1 AND 
    NOT EXISTS -- Not already matched
  (SELECT * FROM StringsCTE S3 WHERE S3.Word = S2.Word AND S3.Id <> S1.ID AND S3.Position < S2.Position)
| MATCHES | TOTAL |
|---------|-------|
|       2 |     5 |

使用此解决方案,我假设您希望删除重复项。切换第一个和第二个参数对结果没有影响

它返回一个值,而不是百分比,因为函数只能返回1个值或一个表。我假设你想要0到1之间的值,如果你乘以100,2/3=0.67或67%

CREATE function f_functionx
(
  @str1 varchar(2000),
  @str2 varchar(2000)
)
returns decimal(5,2)
as
BEGIN
DECLARE @returnvalue decimal(5,2)
DECLARE @list1 table(value varchar(50))
INSERT @list1
SELECT t.c.value('.', 'VARCHAR(2000)')
FROM (
    SELECT x = CAST('<t>' + 
        REPLACE(@str1, ' ', '</t><t>') + '</t>' AS XML)
) a
CROSS APPLY x.nodes('/t') t(c)

DECLARE @list2 table(value varchar(50))
INSERT @list2
SELECT t.c.value('.', 'VARCHAR(2000)')
FROM (
    SELECT x = CAST('<t>' + 
        REPLACE(@str2, ' ', '</t><t>') + '</t>' AS XML)
) a
CROSS APPLY x.nodes('/t') t(c)


;WITH isect as
(
  SELECT count(*) match FROM
  (
    SELECT value FROM @list1
    INTERSECT
    SELECT value FROM @list2
  ) x
), total as
(
  SELECT max(cnt) cnt
  FROM
  (
    SELECT count(distinct value) cnt FROM @list1
    UNION ALL
    SELECT count(distinct value) FROM @list2
  ) x
)
SELECT 
  @returnvalue = cast(isect.match as decimal(9,2)) / total.cnt 
FROM total
CROSS JOIN isect

RETURN @returnvalue
END

GO
返回:

1
0.67

使用此解决方案,我假设您希望删除重复项。切换第一个和第二个参数对结果没有影响

它返回一个值,而不是百分比,因为函数只能返回1个值或一个表。我假设你想要0到1之间的值,如果你乘以100,2/3=0.67或67%

CREATE function f_functionx
(
  @str1 varchar(2000),
  @str2 varchar(2000)
)
returns decimal(5,2)
as
BEGIN
DECLARE @returnvalue decimal(5,2)
DECLARE @list1 table(value varchar(50))
INSERT @list1
SELECT t.c.value('.', 'VARCHAR(2000)')
FROM (
    SELECT x = CAST('<t>' + 
        REPLACE(@str1, ' ', '</t><t>') + '</t>' AS XML)
) a
CROSS APPLY x.nodes('/t') t(c)

DECLARE @list2 table(value varchar(50))
INSERT @list2
SELECT t.c.value('.', 'VARCHAR(2000)')
FROM (
    SELECT x = CAST('<t>' + 
        REPLACE(@str2, ' ', '</t><t>') + '</t>' AS XML)
) a
CROSS APPLY x.nodes('/t') t(c)


;WITH isect as
(
  SELECT count(*) match FROM
  (
    SELECT value FROM @list1
    INTERSECT
    SELECT value FROM @list2
  ) x
), total as
(
  SELECT max(cnt) cnt
  FROM
  (
    SELECT count(distinct value) cnt FROM @list1
    UNION ALL
    SELECT count(distinct value) FROM @list2
  ) x
)
SELECT 
  @returnvalue = cast(isect.match as decimal(9,2)) / total.cnt 
FROM total
CROSS JOIN isect

RETURN @returnvalue
END

GO
返回:

1
0.67


SQL在字符串操作方面不是很强,如果您允许在服务器上安装一个CLR例程,您可能需要考虑CLR例程。@ Spky,是的,我可以安装它,任何想法或有用的链接来实现一个解决方案。这里有一些关于堆栈溢出的链接,在C语言中它非常容易,不幸的是,这里有一个这样的链接。我仅限于SQL,出于某些原因,我不想在C#中这样做。如果您仅限于SQL,请参阅下面我的答案。使用split函数创建两个临时表,然后连接它们。最慢的部分将是拆分函数,但除非你有一个非常大的单词List.Sql在字符串操作中不是很强,否则你可能不会注意到,如果你可以在你的服务器上安装一个CLR例程,你可能想考虑一个CLR例程。有什么想法或有用的链接来实现一个解决方案吗?这里有一些关于堆栈溢出的链接,在C#中这其实很容易,这里有一个这样的链接不幸的是,我限于SQL,出于某些原因,我不想在C#中这样做。如果你限于SQL,请参阅下面我的答案。使用split函数创建两个临时表,然后连接它们。最慢的部分是split函数,但除非您有一个非常大的单词列表,否则可能不会引起注意。感谢您的回答,这将只返回相似单词的计数,有没有办法在不重新调用split函数的情况下也返回单词1和单词2的计数?感谢更新答案,但是它返回的结果不正确,匹配项和FromW1始终相等。匹配项将始终相等,但代码假定#words1大于words 2。我将很快调整代码以返回两个表中较大的一个表counts谢谢您的回答,这将只返回相似单词的计数,是否有任何方法也可以返回单词1和单词2的计数而不重新调用拆分函数?感谢更新答案,但返回的结果不正确,匹配项和FromW1始终相等。匹配项将始终相等,但代码假定#字1大于字2。我将很快调整代码以返回两个表中较大的一个,这两个值位于stringtable中:“xx xx xx Golden horses Malaysia”,“xx xx xx xx xx xx”,它将返回5列,其中包含10个匹配项。所以它有200%match@t-clausen.dk更新了答案以解决您的问题。请在stringtable中尝试这两个值:“xx xx马来西亚金马”,“xx xx xx xx xx”,它将返回5列,其中10个匹配项。所以它有200%match@t-clausen.dk更新了答案以解决您的问题。这似乎是一个天才的答案,不幸的是,我无法比较性能,因为大多数记录都包含
&
。SQL正在抛出:
XML解析:。。。非法名称字符
,我相信我必须将其替换为
&
这似乎是一个天才的答案,不幸的是,我无法比较性能,因为大多数记录都包含
&
。SQL正在抛出:
XML解析:。。。非法名称字符
,我相信我必须将其替换为
&