有没有比这更好/更快的方法来模糊匹配SQL中的客户数据?
我的客户表大约有130万行。试图匹配客户的计划没有按预期运行,导致大量客户无法与现有客户匹配 该计划面临的最大问题之一是,它试图使用计费电子邮件作为匹配项。我们的计划是,我们的客户来自许多不同的市场,其中包括加密客户电子邮件的市场,如Amazon.com 因此,电子邮件匹配只是特定网站和市场上的一种选择 我想知道是否有更好的方法(比我在下面尝试的方法)在我的客户表上进行“4向匹配”,这样我就可以查看是否有许多针对不同列的匹配 例如:有没有比这更好/更快的方法来模糊匹配SQL中的客户数据?,sql,sql-server,sql-server-2014,Sql,Sql Server,Sql Server 2014,我的客户表大约有130万行。试图匹配客户的计划没有按预期运行,导致大量客户无法与现有客户匹配 该计划面临的最大问题之一是,它试图使用计费电子邮件作为匹配项。我们的计划是,我们的客户来自许多不同的市场,其中包括加密客户电子邮件的市场,如Amazon.com 因此,电子邮件匹配只是特定网站和市场上的一种选择 我想知道是否有更好的方法(比我在下面尝试的方法)在我的客户表上进行“4向匹配”,这样我就可以查看是否有许多针对不同列的匹配 例如: ListAEmail
ListAEmail | ListAFullname | ListAAddress | ListAPhone
------------------------------------------------------------------------------
2b************@marketplace.amazon.com | jeff neal | 49 willow | 4*******7
------------------------------------------------------------------------------
将是4个中的2个与此匹配
ListBEmail | ListBFullname | ListBAddress | ListBPhone
------------------------------------------------------------------------------
1********@gmail.com | jeff neal | 7-49 willow | 4*******1
------------------------------------------------------------------------------
它在[ListBulfNeN]上匹配100%,在[ListBoTe]上匹配83.33%,所以我认为这是一个匹配的客户,并且我想为他的两个订单分配相同的客户ID。
我认为我下面的存储过程(我从中修改)可以进行优化,但我没有看到它。它已经运行了15个多小时。任何帮助都将不胜感激SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
Create procedure [dbo].[FuzzyMatchBigData] (
@MatchScore float = .8
) AS
TRUNCATE TABLE [dbo].[TempMatch]
INSERT INTO [dbo].[TempMatch]
SELECT ListA.BillEmail as ListAEmail
,ListA.FirstName+' '+ListA.LastName AS ListAFullname
,ListA.Numbers AS ListAAddress
,ListA.Phone as ListAPhone
,ListB.BillEmail as ListBEmail
,ListB.FirstName+' '+ListB.LastName AS ListBFullname
,ListB.Numbers AS ListBAddress
,ListB.Phone as ListBPhone
,CAST(0 AS float) as Matchscore0
,CAST(0 AS float) as Matchscore1
,CAST(0 AS float) as Matchscore2
,CAST(0 AS float) as Matchscore3
FROM (
SELECT CASE WHEN LOWER(BillEmail) = '' OR LOWER(BillEmail) = 'N/A' THEN CONVERT(NVARCHAR,ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT)))
ELSE LOWER(BillEmail)
END as BillEmail
,CASE WHEN LOWER([BillFirstName] + ' ' + [BillLastName]) = '' THEN CONVERT(NVARCHAR,ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT)))
ELSE LOWER([BillFirstName] + ' ' + [BillLastName])
END as BillName
,LOWER([BillFirstName]) as FirstName
,LOWER([BillLastName]) as LastName
,LOWER(BillCity) as BillCity
,LOWER(BillCompany) as BillCompany
,LOWER(BillCountryCode) as BillCountryCode
,CASE WHEN replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
LOWER(BillPhone),'-',''),')',''),'(',''),' ',''),'*',''),',',''),'+',''),'&',''),'.',''),'=',''),'/','') = '' THEN CONVERT(NVARCHAR,ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT)))
ELSE replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
LOWER(BillPhone),'-',''),')',''),'(',''),' ',''),'*',''),',',''),'+',''),'&',''),'.',''),'=',''),'/','')
END as Phone
,LOWER(BillPostalCode) as BillPostalCode
,LOWER(BillStateProvCode) as BillStateProvCode
,replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
LOWER([BillStreet1] + ' ' + [BillStreet2]),'-',''),')',''),'(',''),'*',''),',',''),'+',''),'&',''),'.',''),'=',''),'/',''),'#','') as 'Address'
,CASE WHEN LOWER(REPLACE(REPLACE(BillNumbers, CHAR(13), ' '), CHAR(10), ' ')) = '' THEN CONVERT(NVARCHAR,ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT)))
ELSE LOWER(REPLACE(REPLACE(BillNumbers, CHAR(13), ' '), CHAR(10), ' '))
END as Numbers
FROM (
SELECT [CustomerID]
,CASE WHEN BillEmail = '' OR BillEmail = 'N/A' THEN ShipEmail
ELSE BillEmail
END as BillEmail
,CASE WHEN BillFirstName = '' THEN REPLACE(REPLACE([ShipFirstName], CHAR(13), ' '), CHAR(10), ' ')
ELSE REPLACE(REPLACE([BillFirstName], CHAR(13), ' '), CHAR(10), ' ')
END AS BillFirstName
,CASE WHEN BillLastName = '' THEN REPLACE(REPLACE([ShipLastName], CHAR(13), ' '), CHAR(10), ' ')
ELSE REPLACE(REPLACE([BillLastName], CHAR(13), ' '), CHAR(10), ' ')
END AS BillLastName
,CASE WHEN BillCity = '' THEN REPLACE(REPLACE([ShipCity], CHAR(13), ' '), CHAR(10), ' ')
ELSE REPLACE(REPLACE([BillCity], CHAR(13), ' '), CHAR(10), ' ')
END AS BillCity
,CASE WHEN BillCompany = '' THEN REPLACE(REPLACE([ShipCompany], CHAR(13), ' '), CHAR(10), ' ')
ELSE REPLACE(REPLACE([BillCompany], CHAR(13), ' '), CHAR(10), ' ')
END AS BillCompany
,CASE WHEN BillCountryCode = ''THEN REPLACE(REPLACE([ShipCountryCode], CHAR(13), ' '), CHAR(10), ' ')
ELSE REPLACE(REPLACE([BillCountryCode], CHAR(13), ' '), CHAR(10), ' ')
END as BillCountryCode
,CASE WHEN BillPhone = '' THEN REPLACE(REPLACE([ShipPhone], CHAR(13), ' '), CHAR(10), ' ')
ELSE REPLACE(REPLACE([BillPhone], CHAR(13), ' '), CHAR(10), ' ')
END AS BillPhone
,CASE WHEN BillPostalCode = '' THEN REPLACE(REPLACE([ShipPostalCode], CHAR(13), ' '), CHAR(10), ' ')
ELSE REPLACE(REPLACE([BillPostalCode], CHAR(13), ' '), CHAR(10), ' ')
END AS BillPostalCode
,CASE WHEN BillStateProvCode = '' THEN REPLACE(REPLACE([ShipStateProvCode], CHAR(13), ' '), CHAR(10), ' ')
ELSE REPLACE(REPLACE([BillStateProvCode], CHAR(13), ' '), CHAR(10), ' ')
END AS BillStateProvCode
,CASE WHEN BillStreet1 = '' THEN REPLACE(REPLACE([ShipStreet1], CHAR(13), ' '), CHAR(10), ' ')
ELSE REPLACE(REPLACE([BillStreet1], CHAR(13), ' '), CHAR(10), ' ')
END AS BillStreet1
,CASE WHEN SUBSTRING([BillStreet1], 1, CHARINDEX(' ', [BillStreet1],CHARINDEX(' ', BillStreet1) + 1)) = ''
THEN SUBSTRING([ShipStreet1], 1, CHARINDEX(' ', [ShipStreet1],CHARINDEX(' ', ShipStreet1) + 1))
ELSE SUBSTRING([BillStreet1], 1, CHARINDEX(' ', [BillStreet1],CHARINDEX(' ', BillStreet1) + 1))
END as BillNumbers
,CASE WHEN BillStreet2 = '' THEN REPLACE(REPLACE([ShipStreet2], CHAR(13), ' '), CHAR(10), ' ')
ELSE REPLACE(REPLACE([BillStreet2], CHAR(13), ' '), CHAR(10), ' ')
END AS BillStreet2
,CASE WHEN BillStreet3 = '' THEN REPLACE(REPLACE([ShipStreet3], CHAR(13), ' '), CHAR(10), ' ')
ELSE REPLACE(REPLACE([BillStreet3], CHAR(13), ' '), CHAR(10), ' ')
END AS BillStreet3
FROM [Customer]
) AS Data2
) ListA
JOIN (
SELECT CASE WHEN LOWER(BillEmail) = '' OR LOWER(BillEmail) = 'N/A' THEN CONVERT(NVARCHAR,ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT)))
ELSE LOWER(BillEmail)
END as BillEmail
,CASE WHEN LOWER([BillFirstName] + ' ' + [BillLastName]) = '' THEN CONVERT(NVARCHAR,ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT)))
ELSE LOWER([BillFirstName] + ' ' + [BillLastName])
END as BillName
,LOWER([BillFirstName]) as FirstName
,LOWER([BillLastName]) as LastName
,LOWER(BillCity) as BillCity
,LOWER(BillCompany) as BillCompany
,LOWER(BillCountryCode) as BillCountryCode
,CASE WHEN replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
LOWER(BillPhone),'-',''),')',''),'(',''),' ',''),'*',''),',',''),'+',''),'&',''),'.',''),'=',''),'/','') = '' THEN CONVERT(NVARCHAR,ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT)))
ELSE replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
LOWER(BillPhone),'-',''),')',''),'(',''),' ',''),'*',''),',',''),'+',''),'&',''),'.',''),'=',''),'/','')
END as Phone
,LOWER(BillPostalCode) as BillPostalCode
,LOWER(BillStateProvCode) as BillStateProvCode
,replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
LOWER([BillStreet1] + ' ' + [BillStreet2]),'-',''),')',''),'(',''),'*',''),',',''),'+',''),'&',''),'.',''),'=',''),'/',''),'#','') as 'Address'
,CASE WHEN LOWER(REPLACE(REPLACE(BillNumbers, CHAR(13), ' '), CHAR(10), ' ')) = '' THEN CONVERT(NVARCHAR,ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT)))
ELSE LOWER(REPLACE(REPLACE(BillNumbers, CHAR(13), ' '), CHAR(10), ' '))
END as Numbers
FROM (
SELECT [CustomerID]
,CASE WHEN BillEmail = '' OR BillEmail = 'N/A' THEN ShipEmail
ELSE BillEmail
END as BillEmail
,CASE WHEN BillFirstName = '' THEN REPLACE(REPLACE([ShipFirstName], CHAR(13), ' '), CHAR(10), ' ')
ELSE REPLACE(REPLACE([BillFirstName], CHAR(13), ' '), CHAR(10), ' ')
END AS BillFirstName
,CASE WHEN BillLastName = '' THEN REPLACE(REPLACE([ShipLastName], CHAR(13), ' '), CHAR(10), ' ')
ELSE REPLACE(REPLACE([BillLastName], CHAR(13), ' '), CHAR(10), ' ')
END AS BillLastName
,CASE WHEN BillCity = '' THEN REPLACE(REPLACE([ShipCity], CHAR(13), ' '), CHAR(10), ' ')
ELSE REPLACE(REPLACE([BillCity], CHAR(13), ' '), CHAR(10), ' ')
END AS BillCity
,CASE WHEN BillCompany = '' THEN REPLACE(REPLACE([ShipCompany], CHAR(13), ' '), CHAR(10), ' ')
ELSE REPLACE(REPLACE([BillCompany], CHAR(13), ' '), CHAR(10), ' ')
END AS BillCompany
,CASE WHEN BillCountryCode = ''THEN REPLACE(REPLACE([ShipCountryCode], CHAR(13), ' '), CHAR(10), ' ')
ELSE REPLACE(REPLACE([BillCountryCode], CHAR(13), ' '), CHAR(10), ' ')
END as BillCountryCode
,CASE WHEN BillPhone = '' THEN REPLACE(REPLACE([ShipPhone], CHAR(13), ' '), CHAR(10), ' ')
ELSE REPLACE(REPLACE([BillPhone], CHAR(13), ' '), CHAR(10), ' ')
END AS BillPhone
,CASE WHEN BillPostalCode = '' THEN REPLACE(REPLACE([ShipPostalCode], CHAR(13), ' '), CHAR(10), ' ')
ELSE REPLACE(REPLACE([BillPostalCode], CHAR(13), ' '), CHAR(10), ' ')
END AS BillPostalCode
,CASE WHEN BillStateProvCode = '' THEN REPLACE(REPLACE([ShipStateProvCode], CHAR(13), ' '), CHAR(10), ' ')
ELSE REPLACE(REPLACE([BillStateProvCode], CHAR(13), ' '), CHAR(10), ' ')
END AS BillStateProvCode
,CASE WHEN BillStreet1 = '' THEN REPLACE(REPLACE([ShipStreet1], CHAR(13), ' '), CHAR(10), ' ')
ELSE REPLACE(REPLACE([BillStreet1], CHAR(13), ' '), CHAR(10), ' ')
END AS BillStreet1
,CASE WHEN SUBSTRING([BillStreet1], 1, CHARINDEX(' ', [BillStreet1],CHARINDEX(' ', BillStreet1) + 1)) = ''
THEN SUBSTRING([ShipStreet1], 1, CHARINDEX(' ', [ShipStreet1],CHARINDEX(' ', ShipStreet1) + 1))
ELSE SUBSTRING([BillStreet1], 1, CHARINDEX(' ', [BillStreet1],CHARINDEX(' ', BillStreet1) + 1))
END as BillNumbers
,CASE WHEN BillStreet2 = '' THEN REPLACE(REPLACE([ShipStreet2], CHAR(13), ' '), CHAR(10), ' ')
ELSE REPLACE(REPLACE([BillStreet2], CHAR(13), ' '), CHAR(10), ' ')
END AS BillStreet2
,CASE WHEN BillStreet3 = '' THEN REPLACE(REPLACE([ShipStreet3], CHAR(13), ' '), CHAR(10), ' ')
ELSE REPLACE(REPLACE([BillStreet3], CHAR(13), ' '), CHAR(10), ' ')
END AS BillStreet3
FROM [Customer]
)as Data3
) ListB
ON MDS1.mdq.Similarity(ListA.FirstName+' '+ListA.LastName, ListB.FirstName+' '+ListB.LastName, 3, 1.0, @MatchScore) >= @MatchScore
OR MDS1.mdq.Similarity(ListA.BillEmail,ListB.BillEmail, 3, 1.0, @MatchScore) >= @MatchScore
OR MDS1.mdq.Similarity(ListA.Numbers,ListB.Numbers, 3, 1.0, @MatchScore) >= @MatchScore
OR MDS1.mdq.Similarity(ListA.Phone,ListB.Phone, 3, 1.0, @MatchScore) >= @MatchScore
UPDATE [TempMatch]
SET MatchScore0 = MDS1.mdq.Similarity(ListAEmail,ListBEmail, 3, 1.0, @MatchScore)
UPDATE [TempMatch]
SET MatchScore1 = MDS1.mdq.Similarity(ListAFullname, ListBFullname, 3, 1.0, @MatchScore)
UPDATE [TempMatch]
SET MatchScore2 = MDS1.mdq.Similarity(ListAAddress, ListBAddress, 3, 1.0, @MatchScore)
UPDATE [TempMatch]
SET MatchScore3 = MDS1.mdq.Similarity(ListAPhone, ListBPhone, 3, 1.0, @MatchScore)
编辑
根据@John Pasquet的建议,我可以用他的建议创建一个新表,并将我的查询缩短为这个
DECLARE @MatchScore float = .8
SELECT *
FROM (
SELECT ListA.CustomerID as ListACustomerID
,ListA.BillEmail as ListAEmail
,ListA.FirstName+' '+ListA.LastName AS ListAFullname
,ListA.Numbers AS ListAAddress
,ListA.Phone as ListAPhone
,ListB.CustomerID as ListBCustomerID
,ListB.BillEmail as ListBEmail
,ListB.FirstName+' '+ListB.LastName AS ListBFullname
,ListB.Numbers AS ListBAddress
,ListB.Phone as ListBPhone
,MDS1.mdq.Similarity(ListA.FirstName+' '+ListA.LastName, ListB.FirstName+' '+ListB.LastName, 3, 1.0, @MatchScore) as NameScore
,MDS1.mdq.Similarity(ListA.BillEmail, ListB.BillEmail, 3, 1.0, @MatchScore) as EmailScore
,MDS1.mdq.Similarity(ListA.Numbers,ListB.Numbers,3,1.0, @MatchScore) as NumberScore
,MDS1.mdq.Similarity(ListA.Phone,ListB.Phone,3,1.0, @MatchScore) as PhoneScore
FROM (
SELECT [CustomerID]
,BillEmail
,BillFirstName AS FirstName
,BillLastName AS LastName
,BillPhone AS Phone
,BillStreet1
,CASE WHEN SUBSTRING([BillStreet1], 1, CHARINDEX(' ', [BillStreet1],CHARINDEX(' ', BillStreet1) + 1)) = ''
THEN SUBSTRING([ShipStreet1], 1, CHARINDEX(' ', [ShipStreet1],CHARINDEX(' ', ShipStreet1) + 1))
ELSE SUBSTRING([BillStreet1], 1, CHARINDEX(' ', [BillStreet1],CHARINDEX(' ', BillStreet1) + 1))
END as Numbers
FROM [CustomerFix]
) ListA
JOIN (
SELECT [CustomerID]
,BillEmail
,BillFirstName AS FirstName
,BillLastName AS LastName
,BillPhone AS Phone
,BillStreet1
,CASE WHEN SUBSTRING([BillStreet1], 1, CHARINDEX(' ', [BillStreet1],CHARINDEX(' ', BillStreet1) + 1)) = ''
THEN SUBSTRING([ShipStreet1], 1, CHARINDEX(' ', [ShipStreet1],CHARINDEX(' ', ShipStreet1) + 1))
ELSE SUBSTRING([BillStreet1], 1, CHARINDEX(' ', [BillStreet1],CHARINDEX(' ', BillStreet1) + 1))
END as Numbers
FROM [CustomerFix]
) ListB
ON MDS1.mdq.Similarity(ListA.FirstName+' '+ListA.LastName, ListB.FirstName+' '+ListB.LastName, 3, 1.0, @MatchScore) >= @MatchScore
OR MDS1.mdq.Similarity(ListA.BillEmail,ListB.BillEmail, 3, 1.0, @MatchScore) >= @MatchScore
OR MDS1.mdq.Similarity(ListA.Numbers,ListB.Numbers, 3, 1.0, @MatchScore) >= @MatchScore
OR MDS1.mdq.Similarity(ListA.Phone,ListB.Phone, 3, 1.0, @MatchScore) >= @MatchScore
) as Data5
WHERE (NameScore+EmailScore+NumberScore+PhoneScore) > 1
我将此作为一个查询运行以进行测试,因此我将缓慢但肯定地获得结果。它仍然是非常CPU密集型的,因为有130万条记录。我希望在创建将清理和更新Customer表的其余存储过程之前,可以对其进行进一步优化
在我完成这一初始数据清理后,我将制作一个SP,在新客户进来时对其进行清理
编辑#2
添加了额外的列并重新索引了表,这样我就可以通过不比较连接的字符串来减少CPU的使用。我看到,仅仅通过添加2个额外的列和重新索引,速度至少提高了10倍
DECLARE @MatchScore float = .8
SELECT *
FROM (
SELECT ListA.CustomerID as ListACustomerID
,ListA.BillEmail as ListAEmail
,ListA.[Name] AS ListAFullname
,ListA.Numbers AS ListAAddress
,ListA.Phone as ListAPhone
,ListB.CustomerID as ListBCustomerID
,ListB.BillEmail as ListBEmail
,ListB.[Name] AS ListBFullname
,ListB.Numbers AS ListBAddress
,ListB.Phone as ListBPhone
,MDS1.mdq.Similarity(ListA.[Name], ListB.[Name], 3, 1.0, @MatchScore) as NameScore
,MDS1.mdq.Similarity(ListA.BillEmail, ListB.BillEmail, 3, 1.0, @MatchScore) as EmailScore
,MDS1.mdq.Similarity(ListA.Numbers,ListB.Numbers,3,1.0, @MatchScore) as NumberScore
,MDS1.mdq.Similarity(ListA.Phone,ListB.Phone,3,1.0, @MatchScore) as PhoneScore
FROM (
SELECT [CustomerID]
,BillEmail
,BillFullName as [Name]
,BillPhone AS Phone
,BillNumbers as Numbers
FROM [CustomerFix]
) ListA
JOIN (
SELECT [CustomerID]
,BillEmail
,BillFullName as [Name]
,BillPhone AS Phone
,BillNumbers as Numbers
FROM [CustomerFix]
) ListB
ON MDS1.mdq.Similarity(ListA.[Name], ListB.[Name], 3, 1.0, @MatchScore) >= @MatchScore
OR MDS1.mdq.Similarity(ListA.BillEmail,ListB.BillEmail, 3, 1.0, @MatchScore) >= @MatchScore
OR MDS1.mdq.Similarity(ListA.Numbers,ListB.Numbers, 3, 1.0, @MatchScore) >= @MatchScore
OR MDS1.mdq.Similarity(ListA.Phone,ListB.Phone, 3, 1.0, @MatchScore) >= @MatchScore
) as Data5
WHERE (NameScore+EmailScore+NumberScore+PhoneScore) > 1
这是一个巨大的CPU量。当您在所有这些之后进行比较时,您将失去索引的所有功能。更好的方法是使用所有REPLACE和LOWER语句创建额外的列。然后在这些索引上创建索引,然后运行查询。您还可以运行具有前10名或类似内容的SELECT。我会对它进行分析,看看它的效率有多低,但我怀疑你能对它做多少优化。这些数据看起来是隐藏的,而不是加密的。是这样吗?@johnpsquet我将按照您的建议创建额外的列。我会发布我的结果。感谢you@scsimon标有*******的数据只是我在保护客户。该数据实际上存储为nvarchar的s@JohnPasquet&@SqlZim我昨晚清理了表中不完整的记录。这使记录计数下降到110万,我采纳了建议,用连接的名称创建一个附加列,并将它们添加到索引中。我还继续创建了一个新列,用于别名为“数字”的列。运行我的初始查询进行测试,我已经看到速度至少提高了10-15倍。我们正朝着正确的方向前进。今天我会通知你我的进展。谢谢你们,这是一个巨大的CPU量。当您在所有这些之后进行比较时,您将失去索引的所有功能。更好的方法是使用所有REPLACE和LOWER语句创建额外的列。然后在这些索引上创建索引,然后运行查询。您还可以运行具有前10名或类似内容的SELECT。我会对它进行分析,看看它的效率有多低,但我怀疑你能对它做多少优化。这些数据看起来是隐藏的,而不是加密的。是这样吗?@johnpsquet我将按照您的建议创建额外的列。我会发布我的结果。感谢you@scsimon标有*******的数据只是我在保护客户。该数据实际上存储为nvarchar的s@JohnPasquet&@SqlZim我昨晚清理了表中不完整的记录。这使记录计数下降到110万,我采纳了建议,用连接的名称创建一个附加列,并将它们添加到索引中。我还继续创建了一个新列,用于别名为“数字”的列。运行我的初始查询进行测试,我已经看到速度至少提高了10-15倍。我们正朝着正确的方向前进。今天我会通知你我的进展。谢谢各位。