Sql 按优先级顺序从表中删除重复项

Sql 按优先级顺序从表中删除重复项,sql,postgresql,Sql,Postgresql,我有一个包含示例数据的表: +----+----------------------------+---------------------------------+----------------------------------+--------+----------+-----------+---------+-----+ | id | url | description | descr

我有一个包含示例数据的表:

+----+----------------------------+---------------------------------+----------------------------------+--------+----------+-----------+---------+-----+
| id |            url             |           description           |         description_hash         |  city  | latitude | longitude | service | sid |
+----+----------------------------+---------------------------------+----------------------------------+--------+----------+-----------+---------+-----+
|  1 | www.website.com/sdadsd12d1 | Some description here version 1 | 94b35433ecd64545db9c9129b877ea49 | Paris  | 48.85670 | 2.35146   | website |   1 |
|  2 | www.page.com/gfdg3df2f2    | Some description here version 1 | 94b35433ecd64545db9c9129b877ea49 | Paris  | 48.85670 | 2.35146   | page    |   2 |
|  3 | www.site.com/sdjbhsjhd17   | Some description here version 1 | 94b35433ecd64545db9c9129b877ea49 | Paris  | 48.85670 | 2.35146   | site    |   3 |
|  4 | www.site.com/sdsdadqwd12   | Some description here version 1 | 94b35433ecd64545db9c9129b877ea49 | Berlin | 52.51704 | 13.38886  | site    |   3 |
|  5 | www.page.com/dgdg2wg3      | Some description here version 2 | 764ed2b4f0d28e45332816c7beedb706 | Berlin | 52.51704 | 13.38886  | page    |   2 |
|  6 | www.webpage.com/8f8fj2h    | Some description here version 2 | 764ed2b4f0d28e45332816c7beedb706 | Berlin | 52.51704 | 13.38886  | webpage |   4 |
+----+----------------------------+---------------------------------+----------------------------------+--------+----------+-----------+---------+-----+
我的任务是删除重复的行。我想要描述、散列、服务和纬度(城市)的独特组合。直到今天,我一直在使用以下查询:

    update my_table_data
    set description_hash = md5(description::text)
    where description_hash is null;

    DROP VIEW temp_view_duplicates;
    CREATE VIEW temp_view_duplicates AS WITH A   
    AS  (
       SELECT Distinct
              description_hash
         ,    service
       FROM  my_table_data
    )
    ,   B  
    AS  (
        SELECT description_hash
        FROM   A
        GROUP BY
               description_hash
        HAVING COUNT(*) > 1
    ), C
    AS (
    SELECT  A.description_hash,
            A.service
    FROM    A
        JOIN B
            ON A.description_hash = B.description_hash
            order by description_hash
    ), D AS
    (
    select distinct latitude, description_hash, service
    from my_table_data
    where description_hash in (SELECT description_hash FROM C)
    order by description_hash
    ), E AS
    (SELECT description_hash, latitude
    FROM   D
    GROUP BY
           description_hash, latitude
    HAVING COUNT(*) > 1)
      SELECT min(ctid) as min_ctid, description_hash, latitude
    FROM   my_table_data
        WHERE description_hash in (SELECT description_hash FROM E)
        group by description_hash, latitude
        order by description_hash;
                                    
    DELETE FROM my_table_data a USING (
      SELECT min_ctid, description_hash, latitude
        FROM  temp_view_duplicates
      ) b
      WHERE a.description_hash = b.description_hash AND a.latitude = b.latitude
      AND a.ctid <> b.min_ctid;
现在,我想将查询更改为在删除时考虑服务顺序(优先级)(sid)的查询

2个带有优先级列表的示例结果:

优先权:

  • 网页
  • 页面
  • 场地
  • 网站
  • 优先权:

  • 场地
  • 页面
  • 网站
  • 网页
  • 概述:

    • 我正在使用postgresql
    • 我之所以使用md5哈希,是因为描述很长,需要花费太多时间
    • 我希望每天对1M行执行此查询
    有人知道吗?我整天都在想这个,我有问题。我考虑根据按描述排序的散列来定制排序行

    编辑:

    优先权是什么

    当前查询随机删除重复记录,即基于ctid,因此我无法控制删除哪些记录

    我的问题是,我想控制它,并能够通过一系列优先级来定义它,从最重要的站点到最不重要的站点

    删除的逻辑应该如下所示——当你点击一个重复的站点时,确认它们来自哪个站点,并将优先级列表中的那个站点保留在最高位置

    用于测试的SQL:

    create table my_table_data(
    id int,
        url text,
        description text,
        description_hash text,
        city text,
        latitude double precision,
        longitude double precision,
        service text,
        sid int
    );
    
    insert into my_table_data(id, url, description, description_hash, city, latitude, longitude, service, sid) 
    values(1, 'www.website.com/sdadsd12d1', 'Some description here version 1',  '94b35433ecd64545db9c9129b877ea49', 'Paris',    48.85670,   2.35146,    'website',  1);
    
    insert into my_table_data(id, url, description, description_hash, city, latitude, longitude, service, sid) 
    values(2, 'www.page.com/gfdg3df2f2',    'Some description here version 1',  '94b35433ecd64545db9c9129b877ea49', 'Paris',    48.85670,   2.35146,    'page', 2);
    
    insert into my_table_data(id, url, description, description_hash, city, latitude, longitude, service, sid) 
    values(3, 'www.site.com/sdjbhsjhd17',   'Some description here version 1',  '94b35433ecd64545db9c9129b877ea49', 'Paris',    48.85670,   2.35146,    'site', 3);
    
    insert into my_table_data(id, url, description, description_hash, city, latitude, longitude, service, sid) 
    values(4, 'www.site.com/sdsdadqwd12',   'Some description here version 1',  '94b35433ecd64545db9c9129b877ea49', 'Berlin',   52.51704,   13.38886,   'site', 3);
    
    insert into my_table_data(id, url, description, description_hash, city, latitude, longitude, service, sid) 
    values(5, 'www.page.com/dgdg2wg3',  'Some description here version 2',  '764ed2b4f0d28e45332816c7beedb706', 'Berlin',   52.51704,   13.38886,   'page', 2);
    
    insert into my_table_data(id, url, description, description_hash, city, latitude, longitude, service, sid) 
    values(6, 'www.webpage.com/8f8fj2h',    'Some description here version 2',  '764ed2b4f0d28e45332816c7beedb706', 'Berlin',   52.51704,   13.38886,   'webpage',  4);
    

    我不知道如何回答这个问题,但让我试试

    从我的最佳猜测来看,根据您的定义,“重复”是相同位置(纬度/经度或城市)的相同描述(或描述哈希)

    如果是这种情况,那么为了删除这些记录并根据定义优先考虑其中的某些记录,应该可以使用
    行号
    分析功能进行排序

    例如,此查询将优先选择最低的“sid”,其中任何行号为1的内容都将保留,其他内容都将删除。如果您需要sid或二级/三级条件以外的其他条件,则只需将这些条件添加到“order by:”

    在这种情况下,可以通过简单的步骤删除这些记录:

    with dupes as (
      select
        id, row_number() over (partition by description_hash, latitude, longitude order by sid) as rn
      from my_table_data
    )
    delete from my_table_data m
    where exists (
      select null
      from dupes d
      where
        d.id = m.id and
        d.rn > 1
    )
    
    解析函数和半连接都非常有效,1M记录应该非常快


    希望这能为您提供完成任务所需的构建块。

    可读文本格式的示例数据会有所帮助。你想要的优先顺序的解释也是如此。示例数据中没有称为
    优先级
    ,因此您的描述不清楚。优先级列表在示例中。jpg结果。我添加了一个又一个。不要发布图片。看见参考“代码/错误”同样适用于样本数据和结果。有关创建格式化文本的信息,请参见。然后将结果复制/粘贴到只包含“`”的行之间的问题中。将“优先级列表”添加到图像中不起作用。你没有解释那是什么,也没有解释它们是什么。我相信你完全明白这是什么,但只有你明白。一个包含4个项目的列表,无论如何重新排列,仍然只是一个项目列表。抱歉,各位犯了错误。我添加了优先级描述、发生了什么、删除了图像并添加了用于测试的sql。我希望现在它是明确的,谢谢你的提示@Hambone!!!我想到了类似的逻辑,但不知道命令“row_number()over partition by”。我使用了你的查询+自定义排序功能,它以我指定的方式对数据进行排序,而不仅仅是ASC/DESC。它实际上很有效,比我以前的查询速度快得多。自定义排序功能->
    +----+--------------------------+---------------------------------+----------------------------------+--------+----------+-----------+---------+-----+
    | id |           url            |           description           |         description_hash         |  city  | latitude | longitude | service | sid |
    +----+--------------------------+---------------------------------+----------------------------------+--------+----------+-----------+---------+-----+
    |  3 | www.site.com/sdjbhsjhd17 | Some description here version 1 | 94b35433ecd64545db9c9129b877ea49 | Paris  | 48.85670 | 2.35146   | site    |   3 |
    |  4 | www.site.com/sdsdadqwd12 | Some description here version 1 | 94b35433ecd64545db9c9129b877ea49 | Berlin | 52.51704 | 13.38886  | site    |   3 |
    |  5 | www.page.com/dgdg2wg3    | Some description here version 2 | 764ed2b4f0d28e45332816c7beedb706 | Berlin | 52.51704 | 13.38886  | page    |   2 |
    +----+--------------------------+---------------------------------+----------------------------------+--------+----------+-----------+---------+-----+
    
    create table my_table_data(
    id int,
        url text,
        description text,
        description_hash text,
        city text,
        latitude double precision,
        longitude double precision,
        service text,
        sid int
    );
    
    insert into my_table_data(id, url, description, description_hash, city, latitude, longitude, service, sid) 
    values(1, 'www.website.com/sdadsd12d1', 'Some description here version 1',  '94b35433ecd64545db9c9129b877ea49', 'Paris',    48.85670,   2.35146,    'website',  1);
    
    insert into my_table_data(id, url, description, description_hash, city, latitude, longitude, service, sid) 
    values(2, 'www.page.com/gfdg3df2f2',    'Some description here version 1',  '94b35433ecd64545db9c9129b877ea49', 'Paris',    48.85670,   2.35146,    'page', 2);
    
    insert into my_table_data(id, url, description, description_hash, city, latitude, longitude, service, sid) 
    values(3, 'www.site.com/sdjbhsjhd17',   'Some description here version 1',  '94b35433ecd64545db9c9129b877ea49', 'Paris',    48.85670,   2.35146,    'site', 3);
    
    insert into my_table_data(id, url, description, description_hash, city, latitude, longitude, service, sid) 
    values(4, 'www.site.com/sdsdadqwd12',   'Some description here version 1',  '94b35433ecd64545db9c9129b877ea49', 'Berlin',   52.51704,   13.38886,   'site', 3);
    
    insert into my_table_data(id, url, description, description_hash, city, latitude, longitude, service, sid) 
    values(5, 'www.page.com/dgdg2wg3',  'Some description here version 2',  '764ed2b4f0d28e45332816c7beedb706', 'Berlin',   52.51704,   13.38886,   'page', 2);
    
    insert into my_table_data(id, url, description, description_hash, city, latitude, longitude, service, sid) 
    values(6, 'www.webpage.com/8f8fj2h',    'Some description here version 2',  '764ed2b4f0d28e45332816c7beedb706', 'Berlin',   52.51704,   13.38886,   'webpage',  4);
    
    select
      id, url, description, description_hash, city, latitude, longitude, service, sid,
      row_number() over (partition by description_hash, latitude, longitude order by sid) as rn
    from my_table_data
    
    with dupes as (
      select
        id, row_number() over (partition by description_hash, latitude, longitude order by sid) as rn
      from my_table_data
    )
    delete from my_table_data m
    where exists (
      select null
      from dupes d
      where
        d.id = m.id and
        d.rn > 1
    )