Sql 基于地址相似性的样本查找

Sql 基于地址相似性的样本查找,sql,sql-server,tsql,Sql,Sql Server,Tsql,我有一个数据集,它有三个字段:地址、地址中的数字和地址中的字母 IF OBJECT_ID ('tempdb..#addresses') IS NOT NULL DROP TABLE #addresses create table #addresses ( address_numbers varchar(50), address_all varchar(100), address_letters varchar(100) ) insert into #addresses

我有一个数据集,它有三个字段:地址、地址中的数字和地址中的字母

IF OBJECT_ID ('tempdb..#addresses') IS NOT NULL
DROP TABLE #addresses

create table #addresses (
    address_numbers varchar(50),
    address_all varchar(100),
    address_letters varchar(100)
)

insert into #addresses
values ('12345678','123 Something Rd, Somewhere NY 45678', 'SOMETHINGRDSOMEWHERENY'),
       ('12345678','123 Something Road, Somewhere NY 45678', 'SOMETHINGROADSOMEWHERENY'),
       ('23445678','234 Something Road, Somewhere NY 45678', 'SOMETHINGROADSOMEWHERENY')
我想通过相似性在相同的编号内找到地址组。我知道如何找到两个文本字符串之间的相似性

select *
from #addresses a
left outer join #addresses b on a.address_numbers = b.address_numbers and MDS_DB.MDQ.SIMILARITY(a.address_letters ,b.address_letters , 2, 0, .90) >= .90
…但我不知道如何为原始数据中的每个地址分配示例/分组代码。预期结果如下所示:

IF OBJECT_ID ('tempdb..#addresses_desired_result') IS NOT NULL
DROP TABLE #addresses_desired_result

create table #addresses_desired_result (
    address_numbers varchar(50),
    address_all varchar(100),
    address_letters varchar(100),
    address_group varchar(100)
)

insert into #addresses_desired_result
values ('12345678','123 Something Rd, Somewhere NY 45678', 'SOMETHINGRDSOMEWHERENY', '123 Something Rd, Somewhere NY 45678'),
       ('12345678','123 Something Road, Somewhere NY 45678', 'SOMETHINGROADSOMEWHERENY', '123 Something Rd, Somewhere NY 45678'),
       ('23445678','234 Something Road, Somewhere NY 45678', 'SOMETHINGROADSOMEWHERENY', '234 Something Road, Somewhere NY 45678')

select *
from #addresses_desired_result
address\u group
可以是组中的一个地址,也可以是一个整数。目标是通过示例/组编号将地址和示例的不同列表连接回更大的事务表和组记录

如何将示例地址/组号分配给同一编号内的每组类似地址?

要获得澄清:

IF OBJECT_ID ('tempdb..#addresses') IS NOT NULL
DROP TABLE #addresses

create table #addresses (
    id int identity(1,1),
    address_numbers varchar(50),
    address_all varchar(100),
    address_letters varchar(100)
)

insert into #addresses
values ('12345678','123 Something Rd, Somewhere NY 45678', 'SOMETHINGRDSOMEWHERENY'),
       ('12345678','123 Something Road, Somewhere NY 45678', 'SOMETHINGROADSOMEWHERENY'),
       ('23445678','234 Something Road, Somewhere NY 45678', 'SOMETHINGROADSOMEWHERENY')


select A.address_numbers, A.address_all, A.address_letters, 
  isnull(B.address_all, A.address_all) as address_group
from #addresses A
left join
(
select A.id, B.address_all,
  row_number() over(order by case when B.address_all + ' ' like '% rd %' then 1 when B.address_all + ' ' like '% road %' then 2 end,
    case when B.address_all + ' ' like '% st %' then 1 when B.address_all + ' ' like '% street %' then 2 end) AS RowNr
from #addresses A
  cross join #addresses B
  where left(A.address_all, 5) = left(b.address_all, 5)  --place similarity function here
   and A.id <> B.id
) B on A.id = B.id and B.RowNr = 1

我用left(address_all,5)代替了相似性函数,但您可以执行任何您喜欢的计算。

地址标准化是一个滑铁卢。这就是我使用GoogleAPI的原因。根据数据量的不同,初始批量可能会非常耗时,但结果是值得的。看一看Yes——了解其他地址标准化方法。出于这个目的,我对在相同的剥离地址编号中只对90%以上相似的地址进行分组的低质量结果没有意见。对连接逻辑有什么想法吗?很明显,地址结果中有完整的地址。您能否从解析出来的数据(按地址\数字和地址\字母进行分区)中执行Min()OVER(),以获得分组值?您可以使用order by来设置格式首选项,然后就需要对单词进行规范化。在过去,我创建了一个表,其中(例如)ROAD变为RD,STREET变为ST。我还使用邮政编码数据库来验证和规范城市、州和邮政编码。如你所知,城市和城镇可能有多个名称。例如,东普罗维登斯也被称为Riverside。@JJ32如果有max()而不是min()。此外,地址号码也有风险。Zip+4非常不一致,会极大地改变密钥。谢谢-我正在寻找任何彼此具有90%或更多相似性且相同地址号码的地址放入一个组中。我认为这比我想的方向更正确,但我不确定如何添加相似性部分,或避免硬编码您插入CASE语句中的一些规范化。我明白为什么这很困难,我认为您必须首先根据地址是否相似将其分组。然后给它分配一个任意代码(可能是min(address_all)。然后,进行分区。抱歉,如果我对问题的理解不清楚的话。是的-这描述得很好。我一直被如何进行初始分组所困扰。请容忍我,我认为问题的一部分是,并非所有地址都有相似性,因此与“分组”无关在上。我尝试将相似性测试分离,然后根据排序首选项返回一个值(作为组)。我使用left(address_all,5)代替相似性函数。
address_numbers address_all                             address_letters             address_group
12345678        123 Something Rd, Somewhere NY 45678    SOMETHINGRDSOMEWHERENY      123 Something Rd, Somewhere NY 45678
12345678        123 Something Road, Somewhere NY 45678  SOMETHINGROADSOMEWHERENY    123 Something Rd, Somewhere NY 45678
23445678        234 Something Road, Somewhere NY 45678  SOMETHINGROADSOMEWHERENY    234 Something Road, Somewhere NY 45678