Oracle 甲骨文与文本相似性的合并

Oracle 甲骨文与文本相似性的合并,oracle,oracle11g,Oracle,Oracle11g,我有两张桌子:from_country和to_country。我想将新记录和更新记录带到国家/地区 定义和数据 -- CREATE TABLE from_country ( country_code varchar2(255) not null ); -- CREATE TABLE to_country ( country_code varchar2(255) not null ); -- Meaning match INSERT INTO from_country (country

我有两张桌子:from_country和to_country。我想将新记录和更新记录带到
国家/地区

定义和数据

--
CREATE TABLE from_country
(
  country_code varchar2(255) not null
);

--
CREATE TABLE to_country
(
  country_code varchar2(255) not null
);

-- Meaning match
INSERT INTO from_country
(country_code)
VALUES
('United States of America');

-- Match 100%
INSERT INTO from_country
(country_code)
VALUES
('UGANDA');

-- Meaning match, but with domain knowledge
INSERT INTO from_country
(country_code)
VALUES
('CON CORRECT');

-- Brand new country
INSERT INTO from_country
(country_code)
VALUES
('NEW');


-- 
INSERT INTO to_country
(country_code)
VALUES
('USA');

-- Match 100%
INSERT INTO to_country
(country_code)
VALUES
('UGANDA');

-- Meaning match, but with domain knowledge
INSERT INTO to_country
(country_code)
VALUES
('CON');
我需要运行“合并到”,以便将数据从
从_县带到
到_国家

这是我的第一次尝试,但它只做了一个相等的,这是不够好的。我需要一些机智,这样它才能做有意义的匹配。 如果有人知道怎么做,请提供您的解决方案

merge into 
  to_country to_t
using
  from_country from_t
on
  (to_t.country_code = from_t.country_code)
when not matched then insert (
  country_code
)
values (
  from_t.country_code
);
简而言之,这就是我想要的

from_table:
United States of America
UGANDA
CON CORRECT
NEW


to_table:
USA
UGANDA
CON
甲骨文并入

the new to_country table:
United States of America
UGANDA
CON CORRECT
NEW
sql fiddle:


请注意,这是我的简化示例。我有更大的数据集。

因为匹配不能保证唯一,所以您必须编写一个查询,使用某种决策只返回一个匹配

下面是一个简化的情况,它使用一个简单的匹配,然后在存在多个匹配时只选择一个值:

merge into to_country t
using (
  select * from (
    select t.rowid as trowid
          ,f.country_code as fcode
          ,t.country_code as tcode
          ,case when t.country_code is null then 1 else
             row_number()
             over (partition by t.country_code
                   order by f.country_code)
           end as match_no
    from from_country f
    left join to_country t
    on f.country_code like t.country_code || '%'
  ) where match_no = 1
  ) s
on (s.trowid = t.rowid)
when matched then update set country_code = s.fcode
when not matched then insert (country_code) values (s.fcode);
在以下国家/地区导致:

USA
UGANDA
CON CORRECT
United States of America
现在已经解决了,您只需要使匹配算法更智能。这是您需要查看整个数据集的地方,以查看有哪些类型的错误-即打字错误等

为此,您可以在Oracle提供的UTL\u MATCH中尝试一些过程:-例如编辑距离或JARO\u WINKLER

下面是一个使用Jaro Winkler算法的示例:

merge into to_country t
using (
  select * from (
    select t.rowid as trowid
          ,f.country_code as fcode
          ,t.country_code as tcode
          ,case when t.country_code is null then 1
           else row_number() over (
                partition by t.country_code
                order by utl_match.jaro_winkler_similarity(f.country_code,t.country_code) desc)
           end as match_no
    from from_country f
    left join to_country t
    on utl_match.jaro_winkler_similarity(f.country_code,t.country_code) > 70
  ) where match_no = 1
  ) s
on (s.trowid = t.rowid)
when matched then update set country_code = s.fcode
when not matched then insert (country_code) values (s.fcode);
SQL Fiddle:

请注意,我选择了一个大于70%的任意截止值。这是因为乌干达和美国的Jaro Winkler相似性为70

其结果如下:

United States of America
USA
UGANDA
CON NEW
要查看这些算法的运行情况,请运行以下操作:

select f.country_code as fcode
      ,t.country_code as tcode
      ,utl_match.edit_distance_similarity(f.country_code,t.country_code) as ed
      ,utl_match.jaro_winkler_similarity(f.country_code,t.country_code) as jw
from from_country f
cross join to_country t
order by 2, 4 desc;

FCODE                     TCODE    ED   JW
========================  ======  ===  ===
CON NEW                   CON      43   86
CON CORRECT               CON      28   83
UGANDA                    CON      17   50
United States of America  CON       0    0

UGANDA                    UGANDA  100  100
United States of America  UGANDA    9   46
CON NEW                   UGANDA   15   43
CON CORRECT               UGANDA    0   41

UGANDA                    USA      34   70
United States of America  USA      13   62
CON CORRECT               USA       0    0
CON NEW                   USA       0    0
SQL Fiddle: