Oracle 在表中查找重复值(并获取其主键)
我在选择一些简化为以下示例的值时遇到问题。基本上,我有一张这样的桌子:Oracle 在表中查找重复值(并获取其主键),oracle,plsql,oracle11g,subquery,Oracle,Plsql,Oracle11g,Subquery,我在选择一些简化为以下示例的值时遇到问题。基本上,我有一张这样的桌子: CREATE TABLE sample_table ( pk_id NUMBER, business_id NUMBER ) ALTER TABLE sample_table ADD ( CONSTRAINT sample_table_PK PRIMARY KEY (pk_id)); create sequence sample_sequence; create trigger sa
CREATE TABLE sample_table
(
pk_id NUMBER,
business_id NUMBER
)
ALTER TABLE sample_table ADD (
CONSTRAINT sample_table_PK
PRIMARY KEY
(pk_id));
create sequence sample_sequence;
create trigger sample_trigger before insert on sample_table for each row
begin
:new.pk_id := sample_sequence.nextval;
end;
insert into sample_table (business_id) values (1000);
insert into sample_table (business_id) values (1001);
insert into sample_table (business_id) values (1002);
insert into sample_table (business_id) values (1003);
insert into sample_table (business_id) values (1003);
insert into sample_table (business_id) values (1004);
-----------------------------------------------...
| Id | Operation | Name | ...
-----------------------------------------------...
| 0 | SELECT STATEMENT | | ...
|* 1 | HASH JOIN RIGHT SEMI | | ...
| 2 | VIEW | VW_NSO_1 | ...
|* 3 | FILTER | | ...
| 4 | HASH GROUP BY | | ...
| 5 | TABLE ACCESS FULL| SAMPLE_TABLE | ...
| 6 | TABLE ACCESS FULL | SAMPLE_TABLE | ...
-----------------------------------------------...
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("BUSINESS_ID"="BUSINESS_ID")
3 - filter(COUNT(*)>1)
SELECT business_id, LISTAGG(pk_id) WITHIN GROUP (ORDER BY pk_id)
FROM sample_table
GROUP BY business_id
HAVING COUNT(*) > 1
现在,此表中的一些业务id是重复的,我需要知道这些记录的主键
让我们假设我进一步构建并填充表格,如下所示:
CREATE TABLE sample_table
(
pk_id NUMBER,
business_id NUMBER
)
ALTER TABLE sample_table ADD (
CONSTRAINT sample_table_PK
PRIMARY KEY
(pk_id));
create sequence sample_sequence;
create trigger sample_trigger before insert on sample_table for each row
begin
:new.pk_id := sample_sequence.nextval;
end;
insert into sample_table (business_id) values (1000);
insert into sample_table (business_id) values (1001);
insert into sample_table (business_id) values (1002);
insert into sample_table (business_id) values (1003);
insert into sample_table (business_id) values (1003);
insert into sample_table (business_id) values (1004);
-----------------------------------------------...
| Id | Operation | Name | ...
-----------------------------------------------...
| 0 | SELECT STATEMENT | | ...
|* 1 | HASH JOIN RIGHT SEMI | | ...
| 2 | VIEW | VW_NSO_1 | ...
|* 3 | FILTER | | ...
| 4 | HASH GROUP BY | | ...
| 5 | TABLE ACCESS FULL| SAMPLE_TABLE | ...
| 6 | TABLE ACCESS FULL | SAMPLE_TABLE | ...
-----------------------------------------------...
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("BUSINESS_ID"="BUSINESS_ID")
3 - filter(COUNT(*)>1)
SELECT business_id, LISTAGG(pk_id) WITHIN GROUP (ORDER BY pk_id)
FROM sample_table
GROUP BY business_id
HAVING COUNT(*) > 1
现在,很容易确定哪些业务id是重复的:
SELECT business_id, COUNT (business_id)
FROM sample_table
GROUP BY business_id
HAVING COUNT (business_id) > 1;
但我不想要商业id,我想要pk id
我可以使用上面的查询作为子查询来获取它们:
select * from sample_table where business_id in (
SELECT business_id
FROM sample_table
GROUP BY business_id
HAVING COUNT (business_id) > 1);
或者使用分区上的COUNT*和子查询分解
with q as
(SELECT business_id, COUNT ( * ) OVER (PARTITION BY business_id) totalcount
FROM sample_table)
select * from q
where q.totalcount > 1
但是,这两种方法都使我的查询速度非常慢—对于这个示例工作,这两种方法都可以,但是当我处理大约500.000行的生产数据时,性能并没有那么好,所以我想知道是否有更好的方法来实现这一点。就表和PK索引而言,第一个查询:
SELECT * from sample_table where business_id in (
SELECT business_id
FROM sample_table
GROUP BY business_id
HAVING COUNT (business_id) > 1);
将需要执行完整的表扫描来评估子查询,然后主查询还需要执行完整的扫描,因为找到的业务ID列表PK索引对此没有任何用处。你会看到这样的计划:
CREATE TABLE sample_table
(
pk_id NUMBER,
business_id NUMBER
)
ALTER TABLE sample_table ADD (
CONSTRAINT sample_table_PK
PRIMARY KEY
(pk_id));
create sequence sample_sequence;
create trigger sample_trigger before insert on sample_table for each row
begin
:new.pk_id := sample_sequence.nextval;
end;
insert into sample_table (business_id) values (1000);
insert into sample_table (business_id) values (1001);
insert into sample_table (business_id) values (1002);
insert into sample_table (business_id) values (1003);
insert into sample_table (business_id) values (1003);
insert into sample_table (business_id) values (1004);
-----------------------------------------------...
| Id | Operation | Name | ...
-----------------------------------------------...
| 0 | SELECT STATEMENT | | ...
|* 1 | HASH JOIN RIGHT SEMI | | ...
| 2 | VIEW | VW_NSO_1 | ...
|* 3 | FILTER | | ...
| 4 | HASH GROUP BY | | ...
| 5 | TABLE ACCESS FULL| SAMPLE_TABLE | ...
| 6 | TABLE ACCESS FULL | SAMPLE_TABLE | ...
-----------------------------------------------...
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("BUSINESS_ID"="BUSINESS_ID")
3 - filter(COUNT(*)>1)
SELECT business_id, LISTAGG(pk_id) WITHIN GROUP (ORDER BY pk_id)
FROM sample_table
GROUP BY business_id
HAVING COUNT(*) > 1
按该顺序在business_id和pk_id上抛出一个唯一的索引,您应该能够放弃第二次表扫描,并使用该索引仅查找重复的business_id。第一次表扫描是不可避免的,因为它必须检查所有行是否存在可能的重复。使用复合索引,Oracle可以查找业务id并同时获取pk id,而无需跳回到表中
-------------------------------------------------...
| Id | Operation | Name |...
-------------------------------------------------...
| 0 | SELECT STATEMENT | |...
| 1 | NESTED LOOPS | |...
| 2 | VIEW | VW_NSO_1 |...
|* 3 | FILTER | |...
| 4 | HASH GROUP BY | |...
| 5 | TABLE ACCESS FULL| SAMPLE_TABLE |...
|* 6 | INDEX RANGE SCAN | BUSINESS_ID_IDX |...
-------------------------------------------------...
Predicate Information (identified by operation id):
---------------------------------------------------
3 - filter(COUNT(*)>1)
6 - access("BUSINESS_ID"="BUSINESS_ID")
如果重复项是例外的话,这应该可以很好地工作。如果在最坏的情况下,所有业务ID都是重复的,那么索引查找可能会变得很糟糕
你可以试试这样更有趣的东西:
CREATE TABLE sample_table
(
pk_id NUMBER,
business_id NUMBER
)
ALTER TABLE sample_table ADD (
CONSTRAINT sample_table_PK
PRIMARY KEY
(pk_id));
create sequence sample_sequence;
create trigger sample_trigger before insert on sample_table for each row
begin
:new.pk_id := sample_sequence.nextval;
end;
insert into sample_table (business_id) values (1000);
insert into sample_table (business_id) values (1001);
insert into sample_table (business_id) values (1002);
insert into sample_table (business_id) values (1003);
insert into sample_table (business_id) values (1003);
insert into sample_table (business_id) values (1004);
-----------------------------------------------...
| Id | Operation | Name | ...
-----------------------------------------------...
| 0 | SELECT STATEMENT | | ...
|* 1 | HASH JOIN RIGHT SEMI | | ...
| 2 | VIEW | VW_NSO_1 | ...
|* 3 | FILTER | | ...
| 4 | HASH GROUP BY | | ...
| 5 | TABLE ACCESS FULL| SAMPLE_TABLE | ...
| 6 | TABLE ACCESS FULL | SAMPLE_TABLE | ...
-----------------------------------------------...
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("BUSINESS_ID"="BUSINESS_ID")
3 - filter(COUNT(*)>1)
SELECT business_id, LISTAGG(pk_id) WITHIN GROUP (ORDER BY pk_id)
FROM sample_table
GROUP BY business_id
HAVING COUNT(*) > 1
现在您只需要进行一次完整的表扫描,但现在所有pk_ID都粘在同一行上。有几种方法可以做到这一点,我更喜欢使用联接,因为这可以加快查询速度
SELECT
DISTINCT a.pk_id
FROM
sample_table a
JOIN sample_table b ON ( a.pk_id <> b.pk_id AND a.business_id = b.business_id )
另外,一个关于商业id的索引也有帮助