Sql 删除其他表中不匹配的记录
有两个由id链接的表:Sql 删除其他表中不匹配的记录,sql,postgresql,exists,bigdata,sql-delete,Sql,Postgresql,Exists,Bigdata,Sql Delete,有两个由id链接的表: item_tbl (id) link_tbl (item_id) item\u tbl中有一些记录在link\u tbl中没有匹配的行。计算其金额的选择应为: SELECT COUNT(*) FROM link_tbl lnk LEFT JOIN item_tbl itm ON lnk.item_id=itm.id WHERE itm.id IS NULL 我想从link\u tbl中删除那些孤立记录(在另一个表中没有匹配项的记录),但我能想到的唯一方法是: DELE
item_tbl (id)
link_tbl (item_id)
item\u tbl
中有一些记录在link\u tbl
中没有匹配的行。计算其金额的选择应为:
SELECT COUNT(*)
FROM link_tbl lnk LEFT JOIN item_tbl itm ON lnk.item_id=itm.id
WHERE itm.id IS NULL
我想从link\u tbl
中删除那些孤立记录(在另一个表中没有匹配项的记录),但我能想到的唯一方法是:
DELETE FROM link_tbl lnk
WHERE lnk.item_id NOT IN (SELECT itm.id FROM item_tbl itm)
有262086253在
链接中记录
3033811在项目中
16844347链接中的孤立记录
服务器有4GB RAM和8核CPU
EXPLAIN DELETE FROM link_tbl lnk
WHERE lnk.item_id NOT IN (SELECT itm.id FROM item_tbl itm)
返回:
Delete on link lnk (cost=0.00..11395249378057.98 rows=131045918 width=6)
-> Seq Scan on link lnk (cost=0.00..11395249378057.98 rows=131045918 width=6)
Filter: (NOT (SubPlan 1))
SubPlan 1
-> Materialize (cost=0.00..79298.10 rows=3063207 width=4)
-> Seq Scan on item itm (cost=0.00..52016.07 rows=3063207 width=4)
问题是:
有没有更好的方法从链接中删除孤立记录
上面的解释有多准确,或者删除这些记录需要多长时间
- 编辑:根据Erwin Brandstetter评论进行修复。
- 编辑:PostgreSql版本为9.1
- 编辑:postgresql.config的某些部分
- 共享缓冲区=368MB
- 温度缓冲区=32MB
- 工作内存=32MB
- 维护工作内存=64MB
- 最大堆栈深度=6MB
- fsync=off
- 同步提交=关闭
- 完整页面写入=关闭
- wal_缓冲区=16MB
- wal_writer_延迟=5000ms
- 提交延迟=10
- 提交兄弟姐妹=10
- 有效缓存大小=1600MB
分辨率:
谢谢大家的建议,非常有帮助。我最终使用了Erwin Brandstetter建议的删除,但我对它做了一些调整:
DELETE FROM link_tbl lnk
WHERE lnk.item_id BETWEEN 0 AND 10000
AND lnk.item_id NOT IN (SELECT itm.id FROM item itm
WHERE itm.id BETWEEN 0 AND 10000)
我比较了NOT IN和NOT EXISTS的结果,输出如下,尽管我使用了COUNT而不是DELETE,我认为应该是相同的(我的意思是为了相对比较):
也许这是:
DELETE FROM link_tbl lnk
WHERE NOT EXISTS
( SELECT 1 FROM item_tbl item WHERE item.id = lnk.item_id );
在处理大量记录时,创建临时表、执行插入选择*自…
然后删除原始表、重命名临时表,然后重新添加索引可能会更加高效…首先:文本显示:
我想从项目中删除这些孤立记录
但是你的代码说:
DELETE FROM link_tbl lnk ...
然后l.item\u id介于100001和200000之间,以此类推
您无法使用函数自动执行此操作。这将把一切都打包成一个事务,并违背其目的。因此,您必须从任何客户端编写脚本。
或者你可以用
这个附加模块允许您在任何数据库(包括运行它的数据库)中运行单独的事务。这可以通过持久连接来实现,这将消除大部分连接开销。
有关如何安装的说明:
DO
将完成此工作(PostgreSQL 9.0或更高版本)。一次为50000个项目id
运行100个DELETE
命令:
DO
$$
DECLARE
_sql text;
BEGIN
PERFORM dblink_connect('port=5432 dbname=mydb'); -- your connection parameters
FOR i IN 0 .. 100
LOOP
_sql := format('
DELETE FROM link_tbl l
WHERE l.item_id BETWEEN %s AND %s
AND l.item_id NOT IN (SELECT i.id FROM item_tbl i)'
, (50000 * i)::text
, (50000 * (i+1))::text);
PERFORM dblink_exec(_sql);
END LOOP;
PERFORM dblink_disconnect();
END
$$
如果脚本被中断:dblink\u connect
将它执行的内容写入数据库日志,这样您就可以看到已经执行的内容。我对四个典型查询进行了基准测试,使用不同的{work\u mem、effective\u cache\u size、random\u page\u cost}设置,这些设置对所选计划的影响最大。我首先使用默认设置进行了“磨合”,以预热缓存。
注意:测试集足够小,允许缓存中存在所有需要的页面
测试集
SET search_path=tmp;
/************************/
DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp ;
SET search_path=tmp;
CREATE TABLE one
( id SERIAL NOT NULL PRIMARY KEY
, payload varchar
);
CREATE TABLE two
( id SERIAL NOT NULL PRIMARY KEY
, one_id INTEGER REFERENCES one
, payload varchar
);
INSERT INTO one (payload) SELECT 'Text_' || gs::text FROM generate_series(1,30000) gs;
INSERT INTO two (payload) SELECT 'Text_' || gs::text FROM generate_series(1,30000) gs;
UPDATE two t
SET one_id = o.id
FROM one o
WHERE o.id = t.id
AND random() < 0.1;
INSERT INTO two (one_id,payload) SELECT one_id,payload FROM two;
INSERT INTO two (one_id,payload) SELECT one_id,payload FROM two;
INSERT INTO two (one_id,payload) SELECT one_id,payload FROM two;
VACUUM ANALYZE one;
VACUUM ANALYZE two;
/***************/
结果(摘要)
如您所见,NOT IN()
变量对工作内存的不足非常敏感。同意,设置64(KB)非常低,但这个“或多或少”对应于大数据集,这也不适合哈希表
额外:在预热阶段,NOT EXISTS()
查询遇到了极端的FK触发器争用。这可能是与真空执事冲突的结果,真空执事在表格设置后仍处于活动状态:
PostgreSQL 9.1.2 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1, 64-bit
NOT EXISTS()
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Delete on one o (cost=6736.00..7623.94 rows=27962 width=12) (actual time=80.596..80.596 rows=0 loops=1)
-> Hash Anti Join (cost=6736.00..7623.94 rows=27962 width=12) (actual time=49.174..61.327 rows=27050 loops=1)
Hash Cond: (o.id = t.one_id)
-> Seq Scan on one o (cost=0.00..463.00 rows=30000 width=10) (actual time=0.003..5.156 rows=30000 loops=1)
-> Hash (cost=3736.00..3736.00 rows=240000 width=10) (actual time=49.121..49.121 rows=23600 loops=1)
Buckets: 32768 Batches: 1 Memory Usage: 1015kB
-> Seq Scan on two t (cost=0.00..3736.00 rows=240000 width=10) (actual time=0.006..33.790 rows=240000 loops=1)
Trigger for constraint two_one_id_fkey: time=467720.117 calls=27050
Total runtime: 467824.652 ms
(9 rows)
我想你会希望不存在:我以前从未见过这么高的成本,即使在解释明显荒谬的陈述时也是如此。所以我猜#2的答案是“比任何人愿意等待的时间都长”。除了看看lucas建议的解释之外,我还想看看是否只在item_tbl.id上添加索引,以及从朋友那里借用更多RAM。我猜要么没有合适的索引/键,要么预期的命中率太高(低熵指数)。也可能是work_mem设置得太高,随机页面成本处于其默认值(:=等于顺序页面成本)顺便说一句:简单的数学:删除/触摸16M/252M将导致删除约6%的行。如果分布(你有有效的统计数据吗?)不是太偏斜,这实际上意味着你需要触摸每一页(加上索引),seq scan可能是一个不错的选择。请解决你的问题。文本所表达的与代码和数字所表达的相反。孤儿们在哪里?您还应该提供您的PostgreSQL版本、有关现有索引的信息以及任何地方是否可以有空值。虽然我也赞成不存在,但我担心这会产生完全相同的计划。@wildplasser我不能与PostgreSQL对话,但在db2中,我使用EXISTS
大幅提高了性能。。。同样,这里有一个关于sql server的答案,大意是:@wildplasser,虽然查看了您的个人资料,但在postgresql方面,您似乎比我更有资格:)我知道。(请记住:我也是非EISTS
的粉丝!)不存在
是一个原语,以前的计划生成器在()中的的重复(和空)删除方面存在问题,通常会强制对子查询的结果进行额外的排序传递。一旦可以检测、分析和合并子查询,hash就可以解决所有问题。顺便说一句:我不是更合格。我倾向于凭直觉行事。好像我是合格的…@wildplasser:我做了一些测试。
DO
$$
DECLARE
_sql text;
BEGIN
PERFORM dblink_connect('port=5432 dbname=mydb'); -- your connection parameters
FOR i IN 0 .. 100
LOOP
_sql := format('
DELETE FROM link_tbl l
WHERE l.item_id BETWEEN %s AND %s
AND l.item_id NOT IN (SELECT i.id FROM item_tbl i)'
, (50000 * i)::text
, (50000 * (i+1))::text);
PERFORM dblink_exec(_sql);
END LOOP;
PERFORM dblink_disconnect();
END
$$
SET search_path=tmp;
/************************/
DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp ;
SET search_path=tmp;
CREATE TABLE one
( id SERIAL NOT NULL PRIMARY KEY
, payload varchar
);
CREATE TABLE two
( id SERIAL NOT NULL PRIMARY KEY
, one_id INTEGER REFERENCES one
, payload varchar
);
INSERT INTO one (payload) SELECT 'Text_' || gs::text FROM generate_series(1,30000) gs;
INSERT INTO two (payload) SELECT 'Text_' || gs::text FROM generate_series(1,30000) gs;
UPDATE two t
SET one_id = o.id
FROM one o
WHERE o.id = t.id
AND random() < 0.1;
INSERT INTO two (one_id,payload) SELECT one_id,payload FROM two;
INSERT INTO two (one_id,payload) SELECT one_id,payload FROM two;
INSERT INTO two (one_id,payload) SELECT one_id,payload FROM two;
VACUUM ANALYZE one;
VACUUM ANALYZE two;
/***************/
\echo NOT EXISTS()
EXPLAIN ANALYZE
DELETE FROM one o
WHERE NOT EXISTS ( SELECT * FROM two t
WHERE t.one_id = o.id
);
\echo NOT IN()
EXPLAIN ANALYZE
DELETE FROM one o
WHERE o.id NOT IN ( SELECT one_id FROM two t)
;
\echo USING (subquery self LEFT JOIN two where NULL)
EXPLAIN ANALYZE
DELETE FROM one o
USING (
SELECT o2.id
FROM one o2
LEFT JOIN two t ON t.one_id = o2.id
WHERE t.one_id IS NULL
) sq
WHERE sq.id = o.id
;
\echo USING (subquery self WHERE NOT EXISTS(two)))
EXPLAIN ANALYZE
DELETE FROM one o
USING (
SELECT o2.id
FROM one o2
WHERE NOT EXISTS ( SELECT *
FROM two t WHERE t.one_id = o2.id
)
) sq
WHERE sq.id = o.id
;
NOT EXISTS() NOT IN() USING(LEFT JOIN NULL) USING(NOT EXISTS)
1) rpc=4.0.csz=1M wmm=64 80.358 14389.026 77.620 72.917
2) rpc=4.0.csz=1M wmm=64000 60.527 69.104 51.851 51.004
3) rpc=1.5.csz=1M wmm=64 69.804 10758.480 80.402 77.356
4) rpc=1.5.csz=1M wmm=64000 50.872 69.366 50.763 53.339
5) rpc=4.0.csz=1G wmm=64 84.117 7625.792 69.790 69.627
6) rpc=4.0.csz=1G wmm=64000 49.964 67.018 49.968 49.380
7) rpc=1.5.csz=1G wmm=64 68.567 3650.008 70.283 69.933
8) rpc=1.5.csz=1G wmm=64000 49.800 67.298 50.116 50.345
legend:
rpc := "random_page_cost"
csz := "effective_cache_size"
wmm := "work_mem"
PostgreSQL 9.1.2 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1, 64-bit
NOT EXISTS()
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Delete on one o (cost=6736.00..7623.94 rows=27962 width=12) (actual time=80.596..80.596 rows=0 loops=1)
-> Hash Anti Join (cost=6736.00..7623.94 rows=27962 width=12) (actual time=49.174..61.327 rows=27050 loops=1)
Hash Cond: (o.id = t.one_id)
-> Seq Scan on one o (cost=0.00..463.00 rows=30000 width=10) (actual time=0.003..5.156 rows=30000 loops=1)
-> Hash (cost=3736.00..3736.00 rows=240000 width=10) (actual time=49.121..49.121 rows=23600 loops=1)
Buckets: 32768 Batches: 1 Memory Usage: 1015kB
-> Seq Scan on two t (cost=0.00..3736.00 rows=240000 width=10) (actual time=0.006..33.790 rows=240000 loops=1)
Trigger for constraint two_one_id_fkey: time=467720.117 calls=27050
Total runtime: 467824.652 ms
(9 rows)