Sql 按优先级顺序从表中删除重复项
我有一个包含示例数据的表:Sql 按优先级顺序从表中删除重复项,sql,postgresql,Sql,Postgresql,我有一个包含示例数据的表: +----+----------------------------+---------------------------------+----------------------------------+--------+----------+-----------+---------+-----+ | id | url | description | descr
+----+----------------------------+---------------------------------+----------------------------------+--------+----------+-----------+---------+-----+
| id | url | description | description_hash | city | latitude | longitude | service | sid |
+----+----------------------------+---------------------------------+----------------------------------+--------+----------+-----------+---------+-----+
| 1 | www.website.com/sdadsd12d1 | Some description here version 1 | 94b35433ecd64545db9c9129b877ea49 | Paris | 48.85670 | 2.35146 | website | 1 |
| 2 | www.page.com/gfdg3df2f2 | Some description here version 1 | 94b35433ecd64545db9c9129b877ea49 | Paris | 48.85670 | 2.35146 | page | 2 |
| 3 | www.site.com/sdjbhsjhd17 | Some description here version 1 | 94b35433ecd64545db9c9129b877ea49 | Paris | 48.85670 | 2.35146 | site | 3 |
| 4 | www.site.com/sdsdadqwd12 | Some description here version 1 | 94b35433ecd64545db9c9129b877ea49 | Berlin | 52.51704 | 13.38886 | site | 3 |
| 5 | www.page.com/dgdg2wg3 | Some description here version 2 | 764ed2b4f0d28e45332816c7beedb706 | Berlin | 52.51704 | 13.38886 | page | 2 |
| 6 | www.webpage.com/8f8fj2h | Some description here version 2 | 764ed2b4f0d28e45332816c7beedb706 | Berlin | 52.51704 | 13.38886 | webpage | 4 |
+----+----------------------------+---------------------------------+----------------------------------+--------+----------+-----------+---------+-----+
我的任务是删除重复的行。我想要描述、散列、服务和纬度(城市)的独特组合。直到今天,我一直在使用以下查询:
update my_table_data
set description_hash = md5(description::text)
where description_hash is null;
DROP VIEW temp_view_duplicates;
CREATE VIEW temp_view_duplicates AS WITH A
AS (
SELECT Distinct
description_hash
, service
FROM my_table_data
)
, B
AS (
SELECT description_hash
FROM A
GROUP BY
description_hash
HAVING COUNT(*) > 1
), C
AS (
SELECT A.description_hash,
A.service
FROM A
JOIN B
ON A.description_hash = B.description_hash
order by description_hash
), D AS
(
select distinct latitude, description_hash, service
from my_table_data
where description_hash in (SELECT description_hash FROM C)
order by description_hash
), E AS
(SELECT description_hash, latitude
FROM D
GROUP BY
description_hash, latitude
HAVING COUNT(*) > 1)
SELECT min(ctid) as min_ctid, description_hash, latitude
FROM my_table_data
WHERE description_hash in (SELECT description_hash FROM E)
group by description_hash, latitude
order by description_hash;
DELETE FROM my_table_data a USING (
SELECT min_ctid, description_hash, latitude
FROM temp_view_duplicates
) b
WHERE a.description_hash = b.description_hash AND a.latitude = b.latitude
AND a.ctid <> b.min_ctid;
现在,我想将查询更改为在删除时考虑服务顺序(优先级)(sid)的查询
2个带有优先级列表的示例结果:
优先权:
- 我正在使用postgresql
- 我之所以使用md5哈希,是因为描述很长,需要花费太多时间
- 我希望每天对1M行执行此查询
create table my_table_data(
id int,
url text,
description text,
description_hash text,
city text,
latitude double precision,
longitude double precision,
service text,
sid int
);
insert into my_table_data(id, url, description, description_hash, city, latitude, longitude, service, sid)
values(1, 'www.website.com/sdadsd12d1', 'Some description here version 1', '94b35433ecd64545db9c9129b877ea49', 'Paris', 48.85670, 2.35146, 'website', 1);
insert into my_table_data(id, url, description, description_hash, city, latitude, longitude, service, sid)
values(2, 'www.page.com/gfdg3df2f2', 'Some description here version 1', '94b35433ecd64545db9c9129b877ea49', 'Paris', 48.85670, 2.35146, 'page', 2);
insert into my_table_data(id, url, description, description_hash, city, latitude, longitude, service, sid)
values(3, 'www.site.com/sdjbhsjhd17', 'Some description here version 1', '94b35433ecd64545db9c9129b877ea49', 'Paris', 48.85670, 2.35146, 'site', 3);
insert into my_table_data(id, url, description, description_hash, city, latitude, longitude, service, sid)
values(4, 'www.site.com/sdsdadqwd12', 'Some description here version 1', '94b35433ecd64545db9c9129b877ea49', 'Berlin', 52.51704, 13.38886, 'site', 3);
insert into my_table_data(id, url, description, description_hash, city, latitude, longitude, service, sid)
values(5, 'www.page.com/dgdg2wg3', 'Some description here version 2', '764ed2b4f0d28e45332816c7beedb706', 'Berlin', 52.51704, 13.38886, 'page', 2);
insert into my_table_data(id, url, description, description_hash, city, latitude, longitude, service, sid)
values(6, 'www.webpage.com/8f8fj2h', 'Some description here version 2', '764ed2b4f0d28e45332816c7beedb706', 'Berlin', 52.51704, 13.38886, 'webpage', 4);
我不知道如何回答这个问题,但让我试试 从我的最佳猜测来看,根据您的定义,“重复”是相同位置(纬度/经度或城市)的相同描述(或描述哈希) 如果是这种情况,那么为了删除这些记录并根据定义优先考虑其中的某些记录,应该可以使用
行号
分析功能进行排序
例如,此查询将优先选择最低的“sid”,其中任何行号为1的内容都将保留,其他内容都将删除。如果您需要sid或二级/三级条件以外的其他条件,则只需将这些条件添加到“order by:”
在这种情况下,可以通过简单的步骤删除这些记录:
with dupes as (
select
id, row_number() over (partition by description_hash, latitude, longitude order by sid) as rn
from my_table_data
)
delete from my_table_data m
where exists (
select null
from dupes d
where
d.id = m.id and
d.rn > 1
)
解析函数和半连接都非常有效,1M记录应该非常快
希望这能为您提供完成任务所需的构建块。可读文本格式的示例数据会有所帮助。你想要的优先顺序的解释也是如此。示例数据中没有称为
优先级
,因此您的描述不清楚。优先级列表在示例中。jpg结果。我添加了一个又一个。不要发布图片。看见参考“代码/错误”同样适用于样本数据和结果。有关创建格式化文本的信息,请参见。然后将结果复制/粘贴到只包含“`”的行之间的问题中。将“优先级列表”添加到图像中不起作用。你没有解释那是什么,也没有解释它们是什么。我相信你完全明白这是什么,但只有你明白。一个包含4个项目的列表,无论如何重新排列,仍然只是一个项目列表。抱歉,各位犯了错误。我添加了优先级描述、发生了什么、删除了图像并添加了用于测试的sql。我希望现在它是明确的,谢谢你的提示@Hambone!!!我想到了类似的逻辑,但不知道命令“row_number()over partition by”。我使用了你的查询+自定义排序功能,它以我指定的方式对数据进行排序,而不仅仅是ASC/DESC。它实际上很有效,比我以前的查询速度快得多。自定义排序功能->
+----+--------------------------+---------------------------------+----------------------------------+--------+----------+-----------+---------+-----+
| id | url | description | description_hash | city | latitude | longitude | service | sid |
+----+--------------------------+---------------------------------+----------------------------------+--------+----------+-----------+---------+-----+
| 3 | www.site.com/sdjbhsjhd17 | Some description here version 1 | 94b35433ecd64545db9c9129b877ea49 | Paris | 48.85670 | 2.35146 | site | 3 |
| 4 | www.site.com/sdsdadqwd12 | Some description here version 1 | 94b35433ecd64545db9c9129b877ea49 | Berlin | 52.51704 | 13.38886 | site | 3 |
| 5 | www.page.com/dgdg2wg3 | Some description here version 2 | 764ed2b4f0d28e45332816c7beedb706 | Berlin | 52.51704 | 13.38886 | page | 2 |
+----+--------------------------+---------------------------------+----------------------------------+--------+----------+-----------+---------+-----+
create table my_table_data(
id int,
url text,
description text,
description_hash text,
city text,
latitude double precision,
longitude double precision,
service text,
sid int
);
insert into my_table_data(id, url, description, description_hash, city, latitude, longitude, service, sid)
values(1, 'www.website.com/sdadsd12d1', 'Some description here version 1', '94b35433ecd64545db9c9129b877ea49', 'Paris', 48.85670, 2.35146, 'website', 1);
insert into my_table_data(id, url, description, description_hash, city, latitude, longitude, service, sid)
values(2, 'www.page.com/gfdg3df2f2', 'Some description here version 1', '94b35433ecd64545db9c9129b877ea49', 'Paris', 48.85670, 2.35146, 'page', 2);
insert into my_table_data(id, url, description, description_hash, city, latitude, longitude, service, sid)
values(3, 'www.site.com/sdjbhsjhd17', 'Some description here version 1', '94b35433ecd64545db9c9129b877ea49', 'Paris', 48.85670, 2.35146, 'site', 3);
insert into my_table_data(id, url, description, description_hash, city, latitude, longitude, service, sid)
values(4, 'www.site.com/sdsdadqwd12', 'Some description here version 1', '94b35433ecd64545db9c9129b877ea49', 'Berlin', 52.51704, 13.38886, 'site', 3);
insert into my_table_data(id, url, description, description_hash, city, latitude, longitude, service, sid)
values(5, 'www.page.com/dgdg2wg3', 'Some description here version 2', '764ed2b4f0d28e45332816c7beedb706', 'Berlin', 52.51704, 13.38886, 'page', 2);
insert into my_table_data(id, url, description, description_hash, city, latitude, longitude, service, sid)
values(6, 'www.webpage.com/8f8fj2h', 'Some description here version 2', '764ed2b4f0d28e45332816c7beedb706', 'Berlin', 52.51704, 13.38886, 'webpage', 4);
select
id, url, description, description_hash, city, latitude, longitude, service, sid,
row_number() over (partition by description_hash, latitude, longitude order by sid) as rn
from my_table_data
with dupes as (
select
id, row_number() over (partition by description_hash, latitude, longitude order by sid) as rn
from my_table_data
)
delete from my_table_data m
where exists (
select null
from dupes d
where
d.id = m.id and
d.rn > 1
)