在唯一约束之前清理SQL数据
在对两列进行唯一约束之前,我想先清理表中的一些数据在唯一约束之前清理SQL数据,sql,performance,postgresql,unique,duplicate-removal,Sql,Performance,Postgresql,Unique,Duplicate Removal,在对两列进行唯一约束之前,我想先清理表中的一些数据 CREATE TABLE test ( a integer NOT NULL, b integer NOT NULL, c integer NOT NULL, CONSTRAINT a_pk PRIMARY KEY (a) ); INSERT INTO test (a,b,c) VALUES (1,2,3) ,(2,2,3) ,(3,4,3) ,(4,4,4) ,(5,4,5) ,(6,4,4) ,(7,4,4); -- SELE
CREATE TABLE test (
a integer NOT NULL,
b integer NOT NULL,
c integer NOT NULL,
CONSTRAINT a_pk PRIMARY KEY (a)
);
INSERT INTO test (a,b,c) VALUES
(1,2,3)
,(2,2,3)
,(3,4,3)
,(4,4,4)
,(5,4,5)
,(6,4,4)
,(7,4,4);
-- SELECT a FROM test WHERE ????
输出应为2,6,7
我正在查找第一行之后的所有重复b、c的行
例:
行1,2的2,3为b,c
第1行是可以的,因为它是第一行,第2行不是
第4、6、7行的4、4为b、c
第4行是可以的,因为它是第一行,第6、7行不是
我会:
DELETE FROM test WHERE a = those IDs;
。。并添加唯一约束
我在考虑一个关于测试本身的交叉点,但不确定从那里去哪里
select o.a from test o
where exists ( select 'x'
from test i
where i.c = o.c
and i.b = o.b
and i.a < o.a
);
多亏了一位同事 使用的速度应该比:
我做了几个测试。事实证明,EXISTS变体的速度要快得多——正如我所预期的,与之相反 PostgreSQL 9.1.2上具有10.000行的测试台,具有适当的设置:
CREATE TEMP TABLE test (
a serial
,b int NOT NULL
,c int NOT NULL
);
INSERT INTO test (b,c)
SELECT (random()* 100)::int AS b, (random()* 100)::int AS c
FROM generate_series(1, 10000);
ALTER TABLE test ADD CONSTRAINT a_pk PRIMARY KEY (a);
在第一轮和第二轮测试之间,我运行了:
ANALYZE test;
当我最终应用删除时,删除了3368个重复项。如果副本的数量大大增加或减少,性能可能会有所不同
我用EXPLAIN-ANALYZE将每个查询运行了几次,得到了最好的结果。一般来说,最好的与第一个或最差的几乎没有区别。
不带DELETE的裸选择显示类似的结果
1.有秩CTE
总运行时间:150.411毫秒
总运行时间:149.853毫秒-分析后
WITH x AS (
SELECT a
,rank() OVER (PARTITION BY b, c ORDER BY a) AS rk
FROM test
)
DELETE FROM test
USING x
WHERE x.a = test.a
AND rk > 1;
WITH x AS (
SELECT a
,row_number() OVER (PARTITION BY b, c ORDER BY a) AS rn
FROM test
)
DELETE FROM test
USING x
WHERE x.a = test.a
AND rn > 1;
DELETE FROM test
USING (
SELECT a
,row_number() OVER (PARTITION BY b, c ORDER BY a) AS rn
FROM test
) x
WHERE x.a = test.a
AND rn > 1;
DELETE FROM test t
WHERE EXISTS (
SELECT 1
FROM test t1
WHERE t1.a < t.a
AND (t1.b, t1.c) = (t.b, t.c)
);
2.带行号的CTE
总运行时间:148.240毫秒
总运行时间:147.711毫秒-分析后
WITH x AS (
SELECT a
,rank() OVER (PARTITION BY b, c ORDER BY a) AS rk
FROM test
)
DELETE FROM test
USING x
WHERE x.a = test.a
AND rk > 1;
WITH x AS (
SELECT a
,row_number() OVER (PARTITION BY b, c ORDER BY a) AS rn
FROM test
)
DELETE FROM test
USING x
WHERE x.a = test.a
AND rn > 1;
DELETE FROM test
USING (
SELECT a
,row_number() OVER (PARTITION BY b, c ORDER BY a) AS rn
FROM test
) x
WHERE x.a = test.a
AND rn > 1;
DELETE FROM test t
WHERE EXISTS (
SELECT 1
FROM test t1
WHERE t1.a < t.a
AND (t1.b, t1.c) = (t.b, t.c)
);
3.子查询中的行号
总运行时间:134.753毫秒
总运行时间:134.298毫秒-分析后
WITH x AS (
SELECT a
,rank() OVER (PARTITION BY b, c ORDER BY a) AS rk
FROM test
)
DELETE FROM test
USING x
WHERE x.a = test.a
AND rk > 1;
WITH x AS (
SELECT a
,row_number() OVER (PARTITION BY b, c ORDER BY a) AS rn
FROM test
)
DELETE FROM test
USING x
WHERE x.a = test.a
AND rn > 1;
DELETE FROM test
USING (
SELECT a
,row_number() OVER (PARTITION BY b, c ORDER BY a) AS rn
FROM test
) x
WHERE x.a = test.a
AND rn > 1;
DELETE FROM test t
WHERE EXISTS (
SELECT 1
FROM test t1
WHERE t1.a < t.a
AND (t1.b, t1.c) = (t.b, t.c)
);
4.存在半联接
总运行时间:143.777毫秒
总运行时间:69.072毫秒-分析后
WITH x AS (
SELECT a
,rank() OVER (PARTITION BY b, c ORDER BY a) AS rk
FROM test
)
DELETE FROM test
USING x
WHERE x.a = test.a
AND rk > 1;
WITH x AS (
SELECT a
,row_number() OVER (PARTITION BY b, c ORDER BY a) AS rn
FROM test
)
DELETE FROM test
USING x
WHERE x.a = test.a
AND rn > 1;
DELETE FROM test
USING (
SELECT a
,row_number() OVER (PARTITION BY b, c ORDER BY a) AS rn
FROM test
) x
WHERE x.a = test.a
AND rn > 1;
DELETE FROM test t
WHERE EXISTS (
SELECT 1
FROM test t1
WHERE t1.a < t.a
AND (t1.b, t1.c) = (t.b, t.c)
);
只有在我强制planner避免合并联接之后,planner才使用哈希半联接,再次占用了一半的时间:
SET enable_mergejoin = off
总运行时间:850.615毫秒
使现代化
从那时起,查询计划器得到了改进。在PostgreSQL 9.1.7的重新测试中直接进行了哈希半联接。我想尝试一下:
delete from
test
where
a not in (
select min(a)
from test
group by b,c)
在我的机器上运行20到60毫秒,不管它值多少,分析不会影响计划
Delete on test (cost=237.50..412.50 rows=5000 width=6)
-> Seq Scan on test (cost=237.50..412.50 rows=5000 width=6)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> HashAggregate (cost=225.00..235.00 rows=1000 width=12)
-> Seq Scan on test (cost=0.00..150.00 rows=10000 width=12)
我发布了一个指向相反方向的测试电池的结果。我低估了博士后的力量。“存在”确实比“等级”快一些。你的同事给了你很好的建议。我的测试结果似乎一致。这很奇怪。在我的核心i3-2120工作站上,装有Fedora 16 64位软件包安装的Postgres 9.1.3,我在5-7毫秒的运行时间内完成了10000行测试,根据explain analyze and exists,测试确实比我的排名快了1毫秒。@Tometzky:我的测试服务器的硬件已经有6年的历史了。看起来它已经无法与当前的硬件竞争了。@ErwinBrandstetter为了公平竞争,你能比较一下我发布的方法在硬件上的性能吗?@Davidadridge:我想让你参考一下,我刚刚完成了一项针对这种情况的性能测试。这就是我首先更新这个问题的原因。这看起来像是解释输出。尝试使用相同的方法获取实际时间。