在唯一约束之前清理SQL数据

在唯一约束之前清理SQL数据,sql,performance,postgresql,unique,duplicate-removal,Sql,Performance,Postgresql,Unique,Duplicate Removal,在对两列进行唯一约束之前,我想先清理表中的一些数据 CREATE TABLE test ( a integer NOT NULL, b integer NOT NULL, c integer NOT NULL, CONSTRAINT a_pk PRIMARY KEY (a) ); INSERT INTO test (a,b,c) VALUES (1,2,3) ,(2,2,3) ,(3,4,3) ,(4,4,4) ,(5,4,5) ,(6,4,4) ,(7,4,4); -- SELE

在对两列进行唯一约束之前,我想先清理表中的一些数据

CREATE TABLE test (
 a integer NOT NULL,
 b integer NOT NULL,
 c integer NOT NULL,
 CONSTRAINT a_pk PRIMARY KEY (a)
);

INSERT INTO test (a,b,c) VALUES
 (1,2,3)
,(2,2,3)
,(3,4,3)
,(4,4,4)
,(5,4,5)
,(6,4,4)
,(7,4,4);

-- SELECT a FROM test WHERE ????
输出应为2,6,7

我正在查找第一行之后的所有重复b、c的行

例:

行1,2的2,3为b,c 第1行是可以的,因为它是第一行,第2行不是

第4、6、7行的4、4为b、c 第4行是可以的,因为它是第一行,第6、7行不是

我会:

DELETE FROM test WHERE a = those IDs;
。。并添加唯一约束

我在考虑一个关于测试本身的交叉点,但不确定从那里去哪里

select o.a from test o
where exists ( select 'x' 
                 from test i
                where i.c = o.c
                  and i.b = o.b
                  and i.a < o.a
            );
多亏了一位同事

使用的速度应该比:


我做了几个测试。事实证明,EXISTS变体的速度要快得多——正如我所预期的,与之相反

PostgreSQL 9.1.2上具有10.000行的测试台,具有适当的设置:

CREATE TEMP TABLE test (
  a serial
 ,b int NOT NULL
 ,c int NOT NULL
);

INSERT INTO test (b,c)
SELECT (random()* 100)::int AS b, (random()* 100)::int AS c
FROM   generate_series(1, 10000);

ALTER TABLE test ADD CONSTRAINT a_pk PRIMARY KEY (a);
在第一轮和第二轮测试之间,我运行了:

ANALYZE test;
当我最终应用删除时,删除了3368个重复项。如果副本的数量大大增加或减少,性能可能会有所不同

我用EXPLAIN-ANALYZE将每个查询运行了几次,得到了最好的结果。一般来说,最好的与第一个或最差的几乎没有区别。 不带DELETE的裸选择显示类似的结果

1.有秩CTE 总运行时间:150.411毫秒 总运行时间:149.853毫秒-分析后

WITH x AS (
    SELECT a
          ,rank() OVER (PARTITION BY b, c ORDER BY a) AS rk
    FROM   test
    )
DELETE FROM test
USING  x
WHERE  x.a = test.a
AND    rk > 1;
WITH x AS (
    SELECT a
          ,row_number() OVER (PARTITION BY b, c ORDER BY a) AS rn
    FROM   test
    )
DELETE FROM test
USING  x
WHERE  x.a = test.a
AND    rn > 1;
DELETE FROM test
USING (
    SELECT a
          ,row_number() OVER (PARTITION BY b, c ORDER BY a) AS rn
    FROM   test
    )  x
WHERE  x.a = test.a
AND    rn > 1;
DELETE FROM test t
WHERE EXISTS (
    SELECT 1
    FROM   test t1
    WHERE  t1.a < t.a
    AND   (t1.b, t1.c) = (t.b, t.c)
    );
2.带行号的CTE 总运行时间:148.240毫秒 总运行时间:147.711毫秒-分析后

WITH x AS (
    SELECT a
          ,rank() OVER (PARTITION BY b, c ORDER BY a) AS rk
    FROM   test
    )
DELETE FROM test
USING  x
WHERE  x.a = test.a
AND    rk > 1;
WITH x AS (
    SELECT a
          ,row_number() OVER (PARTITION BY b, c ORDER BY a) AS rn
    FROM   test
    )
DELETE FROM test
USING  x
WHERE  x.a = test.a
AND    rn > 1;
DELETE FROM test
USING (
    SELECT a
          ,row_number() OVER (PARTITION BY b, c ORDER BY a) AS rn
    FROM   test
    )  x
WHERE  x.a = test.a
AND    rn > 1;
DELETE FROM test t
WHERE EXISTS (
    SELECT 1
    FROM   test t1
    WHERE  t1.a < t.a
    AND   (t1.b, t1.c) = (t.b, t.c)
    );
3.子查询中的行号 总运行时间:134.753毫秒 总运行时间:134.298毫秒-分析后

WITH x AS (
    SELECT a
          ,rank() OVER (PARTITION BY b, c ORDER BY a) AS rk
    FROM   test
    )
DELETE FROM test
USING  x
WHERE  x.a = test.a
AND    rk > 1;
WITH x AS (
    SELECT a
          ,row_number() OVER (PARTITION BY b, c ORDER BY a) AS rn
    FROM   test
    )
DELETE FROM test
USING  x
WHERE  x.a = test.a
AND    rn > 1;
DELETE FROM test
USING (
    SELECT a
          ,row_number() OVER (PARTITION BY b, c ORDER BY a) AS rn
    FROM   test
    )  x
WHERE  x.a = test.a
AND    rn > 1;
DELETE FROM test t
WHERE EXISTS (
    SELECT 1
    FROM   test t1
    WHERE  t1.a < t.a
    AND   (t1.b, t1.c) = (t.b, t.c)
    );
4.存在半联接 总运行时间:143.777毫秒 总运行时间:69.072毫秒-分析后

WITH x AS (
    SELECT a
          ,rank() OVER (PARTITION BY b, c ORDER BY a) AS rk
    FROM   test
    )
DELETE FROM test
USING  x
WHERE  x.a = test.a
AND    rk > 1;
WITH x AS (
    SELECT a
          ,row_number() OVER (PARTITION BY b, c ORDER BY a) AS rn
    FROM   test
    )
DELETE FROM test
USING  x
WHERE  x.a = test.a
AND    rn > 1;
DELETE FROM test
USING (
    SELECT a
          ,row_number() OVER (PARTITION BY b, c ORDER BY a) AS rn
    FROM   test
    )  x
WHERE  x.a = test.a
AND    rn > 1;
DELETE FROM test t
WHERE EXISTS (
    SELECT 1
    FROM   test t1
    WHERE  t1.a < t.a
    AND   (t1.b, t1.c) = (t.b, t.c)
    );
只有在我强制planner避免合并联接之后,planner才使用哈希半联接,再次占用了一半的时间:

SET enable_mergejoin = off
总运行时间:850.615毫秒 使现代化
从那时起,查询计划器得到了改进。在PostgreSQL 9.1.7的重新测试中直接进行了哈希半联接。

我想尝试一下:

delete from
  test
where
  a not in (
    select   min(a)
    from     test
    group by b,c)
在我的机器上运行20到60毫秒,不管它值多少,分析不会影响计划

 Delete on test  (cost=237.50..412.50 rows=5000 width=6)
   ->  Seq Scan on test  (cost=237.50..412.50 rows=5000 width=6)
         Filter: (NOT (hashed SubPlan 1))
         SubPlan 1
           ->  HashAggregate  (cost=225.00..235.00 rows=1000 width=12)
                 ->  Seq Scan on test  (cost=0.00..150.00 rows=10000 width=12)

我发布了一个指向相反方向的测试电池的结果。我低估了博士后的力量。“存在”确实比“等级”快一些。你的同事给了你很好的建议。我的测试结果似乎一致。这很奇怪。在我的核心i3-2120工作站上,装有Fedora 16 64位软件包安装的Postgres 9.1.3,我在5-7毫秒的运行时间内完成了10000行测试,根据explain analyze and exists,测试确实比我的排名快了1毫秒。@Tometzky:我的测试服务器的硬件已经有6年的历史了。看起来它已经无法与当前的硬件竞争了。@ErwinBrandstetter为了公平竞争,你能比较一下我发布的方法在硬件上的性能吗?@Davidadridge:我想让你参考一下,我刚刚完成了一项针对这种情况的性能测试。这就是我首先更新这个问题的原因。这看起来像是解释输出。尝试使用相同的方法获取实际时间。