Oracle 甲骨文与文本相似性的合并
我有两张桌子:from_country和to_country。我想将新记录和更新记录带到Oracle 甲骨文与文本相似性的合并,oracle,oracle11g,Oracle,Oracle11g,我有两张桌子:from_country和to_country。我想将新记录和更新记录带到国家/地区 定义和数据 -- CREATE TABLE from_country ( country_code varchar2(255) not null ); -- CREATE TABLE to_country ( country_code varchar2(255) not null ); -- Meaning match INSERT INTO from_country (country
国家/地区
定义和数据
--
CREATE TABLE from_country
(
country_code varchar2(255) not null
);
--
CREATE TABLE to_country
(
country_code varchar2(255) not null
);
-- Meaning match
INSERT INTO from_country
(country_code)
VALUES
('United States of America');
-- Match 100%
INSERT INTO from_country
(country_code)
VALUES
('UGANDA');
-- Meaning match, but with domain knowledge
INSERT INTO from_country
(country_code)
VALUES
('CON CORRECT');
-- Brand new country
INSERT INTO from_country
(country_code)
VALUES
('NEW');
--
INSERT INTO to_country
(country_code)
VALUES
('USA');
-- Match 100%
INSERT INTO to_country
(country_code)
VALUES
('UGANDA');
-- Meaning match, but with domain knowledge
INSERT INTO to_country
(country_code)
VALUES
('CON');
我需要运行“合并到”,以便将数据从从_县带到到_国家
这是我的第一次尝试,但它只做了一个相等的,这是不够好的。我需要一些机智,这样它才能做有意义的匹配。
如果有人知道怎么做,请提供您的解决方案
merge into
to_country to_t
using
from_country from_t
on
(to_t.country_code = from_t.country_code)
when not matched then insert (
country_code
)
values (
from_t.country_code
);
简而言之,这就是我想要的
from_table:
United States of America
UGANDA
CON CORRECT
NEW
to_table:
USA
UGANDA
CON
甲骨文并入
the new to_country table:
United States of America
UGANDA
CON CORRECT
NEW
sql fiddle:
请注意,这是我的简化示例。我有更大的数据集。因为匹配不能保证唯一,所以您必须编写一个查询,使用某种决策只返回一个匹配
下面是一个简化的情况,它使用一个简单的匹配,然后在存在多个匹配时只选择一个值:
merge into to_country t
using (
select * from (
select t.rowid as trowid
,f.country_code as fcode
,t.country_code as tcode
,case when t.country_code is null then 1 else
row_number()
over (partition by t.country_code
order by f.country_code)
end as match_no
from from_country f
left join to_country t
on f.country_code like t.country_code || '%'
) where match_no = 1
) s
on (s.trowid = t.rowid)
when matched then update set country_code = s.fcode
when not matched then insert (country_code) values (s.fcode);
在以下国家/地区导致:
USA
UGANDA
CON CORRECT
United States of America
现在已经解决了,您只需要使匹配算法更智能。这是您需要查看整个数据集的地方,以查看有哪些类型的错误-即打字错误等
为此,您可以在Oracle提供的UTL\u MATCH中尝试一些过程:-例如编辑距离或JARO\u WINKLER
下面是一个使用Jaro Winkler算法的示例:
merge into to_country t
using (
select * from (
select t.rowid as trowid
,f.country_code as fcode
,t.country_code as tcode
,case when t.country_code is null then 1
else row_number() over (
partition by t.country_code
order by utl_match.jaro_winkler_similarity(f.country_code,t.country_code) desc)
end as match_no
from from_country f
left join to_country t
on utl_match.jaro_winkler_similarity(f.country_code,t.country_code) > 70
) where match_no = 1
) s
on (s.trowid = t.rowid)
when matched then update set country_code = s.fcode
when not matched then insert (country_code) values (s.fcode);
SQL Fiddle:
请注意,我选择了一个大于70%的任意截止值。这是因为乌干达和美国的Jaro Winkler相似性为70
其结果如下:
United States of America
USA
UGANDA
CON NEW
要查看这些算法的运行情况,请运行以下操作:
select f.country_code as fcode
,t.country_code as tcode
,utl_match.edit_distance_similarity(f.country_code,t.country_code) as ed
,utl_match.jaro_winkler_similarity(f.country_code,t.country_code) as jw
from from_country f
cross join to_country t
order by 2, 4 desc;
FCODE TCODE ED JW
======================== ====== === ===
CON NEW CON 43 86
CON CORRECT CON 28 83
UGANDA CON 17 50
United States of America CON 0 0
UGANDA UGANDA 100 100
United States of America UGANDA 9 46
CON NEW UGANDA 15 43
CON CORRECT UGANDA 0 41
UGANDA USA 34 70
United States of America USA 13 62
CON CORRECT USA 0 0
CON NEW USA 0 0
SQL Fiddle: