Oracle 甲骨文与文本相似性的合并_Oracle_Oracle11g

Oracle 甲骨文与文本相似性的合并

oracle oracle11g

Oracle 甲骨文与文本相似性的合并,oracle,oracle11g,Oracle,Oracle11g,我有两张桌子：from_country和to_country。我想将新记录和更新记录带到国家/地区定义和数据 -- CREATE TABLE from_country ( country_code varchar2(255) not null ); -- CREATE TABLE to_country ( country_code varchar2(255) not null ); -- Meaning match INSERT INTO from_country (country

我有两张桌子：from_country和to_country。我想将新记录和更新记录带到

国家/地区

定义和数据

--
CREATE TABLE from_country
(
  country_code varchar2(255) not null
);

--
CREATE TABLE to_country
(
  country_code varchar2(255) not null
);

-- Meaning match
INSERT INTO from_country
(country_code)
VALUES
('United States of America');

-- Match 100%
INSERT INTO from_country
(country_code)
VALUES
('UGANDA');

-- Meaning match, but with domain knowledge
INSERT INTO from_country
(country_code)
VALUES
('CON CORRECT');

-- Brand new country
INSERT INTO from_country
(country_code)
VALUES
('NEW');


-- 
INSERT INTO to_country
(country_code)
VALUES
('USA');

-- Match 100%
INSERT INTO to_country
(country_code)
VALUES
('UGANDA');

-- Meaning match, but with domain knowledge
INSERT INTO to_country
(country_code)
VALUES
('CON');

我需要运行“合并到”，以便将数据从

从_县带到到_国家

这是我的第一次尝试，但它只做了一个相等的，这是不够好的。我需要一些机智，这样它才能做有意义的匹配。
如果有人知道怎么做，请提供您的解决方案
merge into 
  to_country to_t
using
  from_country from_t
on
  (to_t.country_code = from_t.country_code)
when not matched then insert (
  country_code
)
values (
  from_t.country_code
);

简而言之，这就是我想要的
from_table:
United States of America
UGANDA
CON CORRECT
NEW


to_table:
USA
UGANDA
CON

甲骨文并入
the new to_country table:
United States of America
UGANDA
CON CORRECT
NEW

sql fiddle：
请注意，这是我的简化示例。我有更大的数据集。
因为匹配不能保证唯一，所以您必须编写一个查询，使用某种决策只返回一个匹配
下面是一个简化的情况，它使用一个简单的匹配，然后在存在多个匹配时只选择一个值：
merge into to_country t
using (
  select * from (
    select t.rowid as trowid
          ,f.country_code as fcode
          ,t.country_code as tcode
          ,case when t.country_code is null then 1 else
             row_number()
             over (partition by t.country_code
                   order by f.country_code)
           end as match_no
    from from_country f
    left join to_country t
    on f.country_code like t.country_code || '%'
  ) where match_no = 1
  ) s
on (s.trowid = t.rowid)
when matched then update set country_code = s.fcode
when not matched then insert (country_code) values (s.fcode);

在以下国家/地区导致：
USA
UGANDA
CON CORRECT
United States of America

现在已经解决了，您只需要使匹配算法更智能。这是您需要查看整个数据集的地方，以查看有哪些类型的错误-即打字错误等
为此，您可以在Oracle提供的UTL\u MATCH中尝试一些过程：-例如编辑距离或JARO\u WINKLER
下面是一个使用Jaro Winkler算法的示例：
merge into to_country t
using (
  select * from (
    select t.rowid as trowid
          ,f.country_code as fcode
          ,t.country_code as tcode
          ,case when t.country_code is null then 1
           else row_number() over (
                partition by t.country_code
                order by utl_match.jaro_winkler_similarity(f.country_code,t.country_code) desc)
           end as match_no
    from from_country f
    left join to_country t
    on utl_match.jaro_winkler_similarity(f.country_code,t.country_code) > 70
  ) where match_no = 1
  ) s
on (s.trowid = t.rowid)
when matched then update set country_code = s.fcode
when not matched then insert (country_code) values (s.fcode);

SQL Fiddle：
请注意，我选择了一个大于70%的任意截止值。这是因为乌干达和美国的Jaro Winkler相似性为70
其结果如下：
United States of America
USA
UGANDA
CON NEW

要查看这些算法的运行情况，请运行以下操作：
select f.country_code as fcode
      ,t.country_code as tcode
      ,utl_match.edit_distance_similarity(f.country_code,t.country_code) as ed
      ,utl_match.jaro_winkler_similarity(f.country_code,t.country_code) as jw
from from_country f
cross join to_country t
order by 2, 4 desc;

FCODE                     TCODE    ED   JW
========================  ======  ===  ===
CON NEW                   CON      43   86
CON CORRECT               CON      28   83
UGANDA                    CON      17   50
United States of America  CON       0    0

UGANDA                    UGANDA  100  100
United States of America  UGANDA    9   46
CON NEW                   UGANDA   15   43
CON CORRECT               UGANDA    0   41

UGANDA                    USA      34   70
United States of America  USA      13   62
CON CORRECT               USA       0    0
CON NEW                   USA       0    0

SQL Fiddle：