Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/postgresql/9.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Sql 删除其他表中不匹配的记录_Sql_Postgresql_Exists_Bigdata_Sql Delete - Fatal编程技术网

Sql 删除其他表中不匹配的记录

Sql 删除其他表中不匹配的记录,sql,postgresql,exists,bigdata,sql-delete,Sql,Postgresql,Exists,Bigdata,Sql Delete,有两个由id链接的表: item_tbl (id) link_tbl (item_id) item\u tbl中有一些记录在link\u tbl中没有匹配的行。计算其金额的选择应为: SELECT COUNT(*) FROM link_tbl lnk LEFT JOIN item_tbl itm ON lnk.item_id=itm.id WHERE itm.id IS NULL 我想从link\u tbl中删除那些孤立记录(在另一个表中没有匹配项的记录),但我能想到的唯一方法是: DELE

有两个由id链接的表:

item_tbl (id)
link_tbl (item_id)
item\u tbl
中有一些记录在
link\u tbl
中没有匹配的行。计算其金额的选择应为:

SELECT COUNT(*)
FROM link_tbl lnk LEFT JOIN item_tbl itm ON lnk.item_id=itm.id
WHERE itm.id IS NULL
我想从
link\u tbl
中删除那些孤立记录(在另一个表中没有匹配项的记录),但我能想到的唯一方法是:

DELETE FROM link_tbl lnk
WHERE lnk.item_id NOT IN (SELECT itm.id FROM item_tbl itm)

262086253
链接中记录
3033811
项目中
16844347链接中的孤立记录
服务器有4GB RAM和8核CPU

EXPLAIN DELETE FROM link_tbl lnk
WHERE lnk.item_id NOT IN (SELECT itm.id FROM item_tbl itm)
返回:

Delete on link lnk  (cost=0.00..11395249378057.98 rows=131045918 width=6)
->  Seq Scan on link lnk  (cost=0.00..11395249378057.98 rows=131045918 width=6)
     Filter: (NOT (SubPlan 1))
     SubPlan 1
       ->  Materialize  (cost=0.00..79298.10 rows=3063207 width=4)
             ->  Seq Scan on item itm  (cost=0.00..52016.07 rows=3063207 width=4)
问题是:

  • 有没有更好的方法从
    链接中删除孤立记录
  • 上面的解释有多准确,或者删除这些记录需要多长时间

    • 编辑:根据Erwin Brandstetter评论进行修复。
    • 编辑:PostgreSql版本为9.1
    • 编辑:postgresql.config的某些部分
    • 共享缓冲区=368MB
    • 温度缓冲区=32MB
    • 工作内存=32MB
    • 维护工作内存=64MB
    • 最大堆栈深度=6MB
    • fsync=off
    • 同步提交=关闭
    • 完整页面写入=关闭
    • wal_缓冲区=16MB
    • wal_writer_延迟=5000ms
    • 提交延迟=10
    • 提交兄弟姐妹=10
    • 有效缓存大小=1600MB
  • 分辨率:

    谢谢大家的建议,非常有帮助。我最终使用了Erwin Brandstetter建议的删除,但我对它做了一些调整:

    DELETE FROM link_tbl lnk
    WHERE lnk.item_id BETWEEN 0 AND 10000
      AND lnk.item_id NOT IN (SELECT itm.id FROM item itm
                              WHERE itm.id BETWEEN 0 AND 10000)
    
    我比较了NOT IN和NOT EXISTS的结果,输出如下,尽管我使用了COUNT而不是DELETE,我认为应该是相同的(我的意思是为了相对比较):

    也许这是:

    DELETE FROM link_tbl lnk
    WHERE NOT EXISTS
      ( SELECT 1 FROM item_tbl item WHERE item.id = lnk.item_id );
    

    在处理大量记录时,创建临时表、执行插入选择*自…
    然后删除原始表、重命名临时表,然后重新添加索引可能会更加高效…

    首先:文本显示:

    我想从
    项目中删除这些孤立记录

    但是你的代码说:

    DELETE FROM link_tbl lnk ...
    然后
    l.item\u id介于100001和200000之间,以此类推

    您无法使用函数自动执行此操作。这将把一切都打包成一个事务,并违背其目的。因此,您必须从任何客户端编写脚本。
    或者你可以用

    这个附加模块允许您在任何数据库(包括运行它的数据库)中运行单独的事务。这可以通过持久连接来实现,这将消除大部分连接开销。 有关如何安装的说明:

    DO
    将完成此工作(PostgreSQL 9.0或更高版本)。一次为50000个
    项目id
    运行100个
    DELETE
    命令:

    DO
    $$
    DECLARE
       _sql text;
    BEGIN
    
    PERFORM dblink_connect('port=5432 dbname=mydb');  -- your connection parameters
    
    FOR i IN 0 .. 100
    LOOP
       _sql := format('
       DELETE FROM link_tbl l
       WHERE  l.item_id BETWEEN %s AND %s
       AND    l.item_id NOT IN (SELECT i.id FROM item_tbl i)'
       , (50000 * i)::text
       , (50000 * (i+1))::text);
    
       PERFORM  dblink_exec(_sql);
    END LOOP;
    
    PERFORM dblink_disconnect();
    
    END
    $$
    

    如果脚本被中断:
    dblink\u connect
    将它执行的内容写入数据库日志,这样您就可以看到已经执行的内容。

    我对四个典型查询进行了基准测试,使用不同的{work\u mem、effective\u cache\u size、random\u page\u cost}设置,这些设置对所选计划的影响最大。我首先使用默认设置进行了“磨合”,以预热缓存。 注意:测试集足够小,允许缓存中存在所有需要的页面

    测试集

    SET search_path=tmp;
    
    /************************/
    DROP SCHEMA tmp CASCADE;
    CREATE SCHEMA tmp ;
    SET search_path=tmp;
    
    CREATE TABLE one
            ( id SERIAL NOT NULL PRIMARY KEY
            , payload varchar
            );
    
    CREATE TABLE two
            ( id SERIAL NOT NULL PRIMARY KEY
            , one_id INTEGER REFERENCES one
            , payload varchar
            );
    
    INSERT INTO one (payload) SELECT 'Text_' || gs::text FROM generate_series(1,30000) gs;
    INSERT INTO two (payload) SELECT 'Text_' || gs::text FROM generate_series(1,30000) gs;
    
    
    UPDATE two t
    SET one_id = o.id
    FROM one o
    WHERE o.id = t.id
    AND random() < 0.1;
    
    INSERT INTO two (one_id,payload) SELECT one_id,payload FROM two;
    INSERT INTO two (one_id,payload) SELECT one_id,payload FROM two;
    INSERT INTO two (one_id,payload) SELECT one_id,payload FROM two;
    
    VACUUM ANALYZE one;
    VACUUM ANALYZE two;
    /***************/
    
    结果(摘要)

    如您所见,
    NOT IN()
    变量对
    工作内存的不足非常敏感。同意,设置64(KB)非常低,但这个“或多或少”对应于大数据集,这也不适合哈希表

    额外:在预热阶段,
    NOT EXISTS()
    查询遇到了极端的FK触发器争用。这可能是与真空执事冲突的结果,真空执事在表格设置后仍处于活动状态:

    PostgreSQL 9.1.2 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1, 64-bit
    NOT EXISTS()
                                                               QUERY PLAN
    --------------------------------------------------------------------------------------------------------------------------------
     Delete on one o  (cost=6736.00..7623.94 rows=27962 width=12) (actual time=80.596..80.596 rows=0 loops=1)
       ->  Hash Anti Join  (cost=6736.00..7623.94 rows=27962 width=12) (actual time=49.174..61.327 rows=27050 loops=1)
             Hash Cond: (o.id = t.one_id)
             ->  Seq Scan on one o  (cost=0.00..463.00 rows=30000 width=10) (actual time=0.003..5.156 rows=30000 loops=1)
             ->  Hash  (cost=3736.00..3736.00 rows=240000 width=10) (actual time=49.121..49.121 rows=23600 loops=1)
                   Buckets: 32768  Batches: 1  Memory Usage: 1015kB
                   ->  Seq Scan on two t  (cost=0.00..3736.00 rows=240000 width=10) (actual time=0.006..33.790 rows=240000 loops=1)
     Trigger for constraint two_one_id_fkey: time=467720.117 calls=27050
     Total runtime: 467824.652 ms
    (9 rows)
    

    我想你会希望不存在:我以前从未见过这么高的成本,即使在解释明显荒谬的陈述时也是如此。所以我猜#2的答案是“比任何人愿意等待的时间都长”。除了看看lucas建议的解释之外,我还想看看是否只在item_tbl.id上添加索引,以及从朋友那里借用更多RAM。我猜要么没有合适的索引/键,要么预期的命中率太高(低熵指数)。也可能是work_mem设置得太高,随机页面成本处于其默认值(:=等于顺序页面成本)顺便说一句:简单的数学:删除/触摸16M/252M将导致删除约6%的行。如果分布(你有有效的统计数据吗?)不是太偏斜,这实际上意味着你需要触摸每一页(加上索引),seq scan可能是一个不错的选择。请解决你的问题。文本所表达的与代码和数字所表达的相反。孤儿们在哪里?您还应该提供您的PostgreSQL版本、有关现有索引的信息以及任何地方是否可以有空值。虽然我也赞成
    不存在
    ,但我担心这会产生完全相同的计划。@wildplasser我不能与PostgreSQL对话,但在db2中,我使用
    EXISTS
    大幅提高了性能。。。同样,这里有一个关于sql server的答案,大意是:@wildplasser,虽然查看了您的个人资料,但在postgresql方面,您似乎比我更有资格:)我知道。(请记住:我也是
    非EISTS
    的粉丝!)
    不存在
    是一个原语,以前的计划生成器在()中的
    的重复(和空)删除方面存在问题,通常会强制对子查询的结果进行额外的排序传递。一旦可以检测、分析和合并子查询,hash就可以解决所有问题。顺便说一句:我不是更合格。我倾向于凭直觉行事。好像我是合格的…@wildplasser:我做了一些测试。
    
    DO
    $$
    DECLARE
       _sql text;
    BEGIN
    
    PERFORM dblink_connect('port=5432 dbname=mydb');  -- your connection parameters
    
    FOR i IN 0 .. 100
    LOOP
       _sql := format('
       DELETE FROM link_tbl l
       WHERE  l.item_id BETWEEN %s AND %s
       AND    l.item_id NOT IN (SELECT i.id FROM item_tbl i)'
       , (50000 * i)::text
       , (50000 * (i+1))::text);
    
       PERFORM  dblink_exec(_sql);
    END LOOP;
    
    PERFORM dblink_disconnect();
    
    END
    $$
    
    SET search_path=tmp;
    
    /************************/
    DROP SCHEMA tmp CASCADE;
    CREATE SCHEMA tmp ;
    SET search_path=tmp;
    
    CREATE TABLE one
            ( id SERIAL NOT NULL PRIMARY KEY
            , payload varchar
            );
    
    CREATE TABLE two
            ( id SERIAL NOT NULL PRIMARY KEY
            , one_id INTEGER REFERENCES one
            , payload varchar
            );
    
    INSERT INTO one (payload) SELECT 'Text_' || gs::text FROM generate_series(1,30000) gs;
    INSERT INTO two (payload) SELECT 'Text_' || gs::text FROM generate_series(1,30000) gs;
    
    
    UPDATE two t
    SET one_id = o.id
    FROM one o
    WHERE o.id = t.id
    AND random() < 0.1;
    
    INSERT INTO two (one_id,payload) SELECT one_id,payload FROM two;
    INSERT INTO two (one_id,payload) SELECT one_id,payload FROM two;
    INSERT INTO two (one_id,payload) SELECT one_id,payload FROM two;
    
    VACUUM ANALYZE one;
    VACUUM ANALYZE two;
    /***************/
    
    \echo NOT EXISTS()
    EXPLAIN ANALYZE
    DELETE FROM one o
    WHERE NOT EXISTS ( SELECT * FROM two t
            WHERE t.one_id = o.id
            );
    
    \echo NOT IN()
    EXPLAIN ANALYZE 
    DELETE FROM one o
    WHERE o.id NOT IN ( SELECT one_id FROM two t)
            ;
    
    \echo USING (subquery self LEFT JOIN two where NULL)
    EXPLAIN ANALYZE
    DELETE FROM one o
    USING (
            SELECT o2.id
            FROM one o2
            LEFT JOIN two t ON t.one_id = o2.id
            WHERE t.one_id IS NULL
            ) sq
    WHERE sq.id = o.id
            ;
    
    \echo USING (subquery self WHERE NOT EXISTS(two)))
    EXPLAIN ANALYZE
    DELETE FROM one o
    USING (
            SELECT o2.id
            FROM one o2
            WHERE NOT EXISTS ( SELECT *
                    FROM two t WHERE t.one_id = o2.id
                    )
            ) sq
    WHERE sq.id = o.id
            ;
    
                            NOT EXISTS()    NOT IN()        USING(LEFT JOIN NULL)   USING(NOT EXISTS)
    1) rpc=4.0.csz=1M wmm=64        80.358  14389.026       77.620                  72.917
    2) rpc=4.0.csz=1M wmm=64000     60.527  69.104          51.851                  51.004
    3) rpc=1.5.csz=1M wmm=64        69.804  10758.480       80.402                  77.356
    4) rpc=1.5.csz=1M wmm=64000     50.872  69.366          50.763                  53.339
    5) rpc=4.0.csz=1G wmm=64        84.117  7625.792        69.790                  69.627
    6) rpc=4.0.csz=1G wmm=64000     49.964  67.018          49.968                  49.380
    7) rpc=1.5.csz=1G wmm=64        68.567  3650.008        70.283                  69.933
    8) rpc=1.5.csz=1G wmm=64000     49.800  67.298          50.116                  50.345
    
    legend: 
    rpc := "random_page_cost"
    csz := "effective_cache_size"
    wmm := "work_mem"
    
    PostgreSQL 9.1.2 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1, 64-bit
    NOT EXISTS()
                                                               QUERY PLAN
    --------------------------------------------------------------------------------------------------------------------------------
     Delete on one o  (cost=6736.00..7623.94 rows=27962 width=12) (actual time=80.596..80.596 rows=0 loops=1)
       ->  Hash Anti Join  (cost=6736.00..7623.94 rows=27962 width=12) (actual time=49.174..61.327 rows=27050 loops=1)
             Hash Cond: (o.id = t.one_id)
             ->  Seq Scan on one o  (cost=0.00..463.00 rows=30000 width=10) (actual time=0.003..5.156 rows=30000 loops=1)
             ->  Hash  (cost=3736.00..3736.00 rows=240000 width=10) (actual time=49.121..49.121 rows=23600 loops=1)
                   Buckets: 32768  Batches: 1  Memory Usage: 1015kB
                   ->  Seq Scan on two t  (cost=0.00..3736.00 rows=240000 width=10) (actual time=0.006..33.790 rows=240000 loops=1)
     Trigger for constraint two_one_id_fkey: time=467720.117 calls=27050
     Total runtime: 467824.652 ms
    (9 rows)