p> 一般的想法是创建一个新的临时表,通常添加一个唯一的约束以避免进一步的重复,并将以前表中的数据插入到新表中,同时处理重复的数据。这种方法依赖于简单的MySQL INSERT查询,创建新的约束以避免进一步的重复,并且不需要使用内部查询来搜索重复项和应保存在内存中的临时表(因此也适合大数据源)

p> 一般的想法是创建一个新的临时表,通常添加一个唯一的约束以避免进一步的重复,并将以前表中的数据插入到新表中,同时处理重复的数据。这种方法依赖于简单的MySQL INSERT查询,创建新的约束以避免进一步的重复,并且不需要使用内部查询来搜索重复项和应保存在内存中的临时表(因此也适合大数据源),mysql,sql,duplicates,Mysql,Sql,Duplicates,这就是实现这一目标的方式。假设我们有一个表employee,包含以下列: employee (id, first_name, last_name, start_date, ssn) 要删除具有重复ssn列的行,并仅保留找到的第一个条目,可以执行以下过程: -- create a new tmp_eployee table CREATE TABLE tmp_employee LIKE employee; -- add a unique constraint ALTER TABLE tmp_em

这就是实现这一目标的方式。假设我们有一个表employee,包含以下列:

employee (id, first_name, last_name, start_date, ssn)
要删除具有重复ssn列的行,并仅保留找到的第一个条目,可以执行以下过程:

-- create a new tmp_eployee table
CREATE TABLE tmp_employee LIKE employee;

-- add a unique constraint
ALTER TABLE tmp_employee ADD UNIQUE(ssn);

-- scan over the employee table to insert employee entries
INSERT IGNORE INTO tmp_employee SELECT * FROM employee ORDER BY id;

-- rename tables
RENAME TABLE employee TO backup_employee, tmp_employee TO employee;
技术说明
  • 第1行创建了一个新的tmp\U eployee表,其结构与员工表完全相同
  • 第2行为新的tmp#U eployee表添加了一个唯一的约束,以避免任何重复
  • 第3行按id扫描原始员工表,将新员工条目插入新tmp#U eployee表,同时忽略重复条目
  • 第4行重命名了表,这样新的employee表就保存了所有条目,没有重复项,并且在backup\u employee表上保留了以前数据的备份副本
⇒ 使用这种方法,160万个寄存器在不到200秒内转换为6k

,按照此过程,您可以通过运行以下命令快速轻松地删除所有重复项并创建唯一约束:

CREATE TABLE tmp_jobs LIKE jobs;

ALTER TABLE tmp_jobs ADD UNIQUE(site_id, title, company);

INSERT IGNORE INTO tmp_jobs SELECT * FROM jobs ORDER BY id;

RENAME TABLE jobs TO backup_jobs, tmp_jobs TO jobs;
当然,在删除重复项时,可以进一步修改此过程以适应不同的需要。下面是一些例子

✔ 保留最后一个条目而不是第一个条目的变体 有时我们需要保留最后一个重复条目,而不是第一个

CREATE TABLE tmp_employee LIKE employee;

ALTER TABLE tmp_employee ADD UNIQUE(ssn);

INSERT IGNORE INTO tmp_employee SELECT * FROM employee ORDER BY id DESC;

RENAME TABLE employee TO backup_employee, tmp_employee TO employee;
  • 在第3行,ORDER BY id DESC子句使最后一个id优先于其他id
✔ 对重复项执行某些任务的变体,例如对找到的重复项进行计数 有时,我们需要对找到的重复条目执行一些进一步的处理(例如保留重复项的计数)

  • 在第3行,创建了一个新列n\u duplicates
  • 在第4行,插入到。。。在重复密钥更新时查询用于在发现重复密钥时执行额外更新(在这种情况下,增加计数器) 插入到。。。在重复密钥更新时查询可用于对找到的重复项执行不同类型的更新
✔ 用于重新生成自动增量字段id的变量 有时我们使用自动增量字段,为了使索引尽可能紧凑,我们可以利用删除重复项来在新的临时表中重新生成自动增量字段

CREATE TABLE tmp_employee LIKE employee;

ALTER TABLE tmp_employee ADD UNIQUE(ssn);

INSERT IGNORE INTO tmp_employee SELECT (first_name, last_name, start_date, ssn) FROM employee ORDER BY id;

RENAME TABLE employee TO backup_employee, tmp_employee TO employee;
  • 在第3行,不再选择表中的所有字段,而是跳过id字段,以便DB引擎自动生成一个新字段
✔ 进一步的变化 根据期望的行为,许多进一步的修改也是可行的。例如,以下查询将使用第二个临时表来保存最后一个条目,而不是第一个条目;2)在发现的重复项上增加计数器;另外3)重新生成自动增量字段id,同时保持输入顺序与前一个数据相同

CREATE TABLE tmp_employee LIKE employee;

ALTER TABLE tmp_employee ADD UNIQUE(ssn);

ALTER TABLE tmp_employee ADD COLUMN n_duplicates INT DEFAULT 0;

INSERT INTO tmp_employee SELECT * FROM employee ORDER BY id DESC ON DUPLICATE KEY UPDATE n_duplicates=n_duplicates+1;

CREATE TABLE tmp_employee2 LIKE tmp_employee;

INSERT INTO tmp_employee2 SELECT (first_name, last_name, start_date, ssn) FROM tmp_employee ORDER BY id;

DROP TABLE tmp_employee;

RENAME TABLE employee TO backup_employee, tmp_employee2 TO employee;

易于理解且无主键的解决方案:

1) 添加一个新的布尔列

alter table mytable add tokeep boolean;
2) 在复制列和新列上添加约束

alter table mytable add constraint preventdupe unique (mycol1, mycol2, tokeep);
3) 将布尔列设置为true。由于新的约束,这将仅在其中一个重复的行上成功

update ignore mytable set tokeep = true;
4) 删除尚未标记为tokeep的行

delete from mytable where tokeep is null;
5) 删除添加的列

alter table mytable drop tokeep;

我建议您保留添加的约束,以便将来防止新的重复。

使用Delete JOIN语句删除重复的行 MySQL为您提供了DELETE JOIN语句,您可以使用该语句快速删除重复的行

以下语句删除重复行并保留最高id:

DELETE t1 FROM contacts t1
    INNER JOIN
contacts t2 WHERE
t1.id < t2.id AND t1.email = t2.email;
从联系人t1中删除t1
内连接
联系t2在哪里
t1.id
如果您有一个包含大量记录的大型表,则上述解决方案将无法工作或花费太多时间。然后我们有一个不同的解决方案

-- Create temporary table

CREATE TABLE temp_table LIKE table1;

-- Add constraint
ALTER TABLE temp_table ADD UNIQUE(title, company,site_id);

-- Copy data
INSERT IGNORE INTO temp_table SELECT * FROM table1;

-- Rename and drop
RENAME TABLE table1 TO old_table1, temp_table TO table1;
DROP TABLE old_table1;
我找到了一个简单的方法。(保持最新状态)

从tablename t1中删除t1内部连接tablename t2
其中t1.id
从版本8.0(2018)开始,MySQL最终支持

窗口功能既方便又高效。下面是一个解决方案,演示如何使用它们来解决此任务

在子查询中,我们可以使用为
column1/column2
组中表中的每条记录分配一个位置,按
id
排序。如果没有重复项,记录将获得行号
1
。如果存在重复项,它们将按升序编号
id
(从
1
开始)

一旦子查询中的记录正确编号,外部查询将删除行号不是1的所有记录

查询:

DELETE FROM tablename
WHERE id IN (
    SELECT id
    FROM (
        SELECT 
            id, 
            ROW_NUMBER() OVER(PARTITION BY column1, column2 ORDER BY id) rn
        FROM output
    ) t
    WHERE rn > 1
)

为了复制具有唯一列的记录,例如COL1、COL2、COL3不应被复制(假设我们在表结构中遗漏了3列唯一,并且在表中创建了多个重复条目)


Hope将帮助开发人员删除表中的重复记录

delete from job s 
where rowid < any 
(select rowid from job k 
where s.site_id = k.site_id and 
s.title = k.title and 
s.company = k.company);

这将删除标题、公司和站点值相同的重复行。第一次出现的内容将被保留,其余所有重复内容将被删除

从tablename t1中删除t1
内乔
CREATE TEMPORARY TABLE IF NOT EXISTS _temp_duplicates AS (SELECT dub.id FROM table_with_duplications dub GROUP BY dub.field_must_be_uniq_1, dub.field_must_be_uniq_2 HAVING COUNT(*)  > 1);

DELETE FROM table_with_duplications WHERE id IN (SELECT id FROM _temp_duplicates);
CREATE TABLE tableToclean_temp LIKE tableToclean;
ALTER TABLE tableToclean_temp ADD UNIQUE INDEX (fontsinuse_id);
INSERT IGNORE INTO tableToclean_temp SELECT * FROM tableToclean;
DROP TABLE tableToclean;
RENAME TABLE tableToclean_temp TO tableToclean;
employee (id, first_name, last_name, start_date, ssn)
-- create a new tmp_eployee table
CREATE TABLE tmp_employee LIKE employee;

-- add a unique constraint
ALTER TABLE tmp_employee ADD UNIQUE(ssn);

-- scan over the employee table to insert employee entries
INSERT IGNORE INTO tmp_employee SELECT * FROM employee ORDER BY id;

-- rename tables
RENAME TABLE employee TO backup_employee, tmp_employee TO employee;
CREATE TABLE tmp_jobs LIKE jobs;

ALTER TABLE tmp_jobs ADD UNIQUE(site_id, title, company);

INSERT IGNORE INTO tmp_jobs SELECT * FROM jobs ORDER BY id;

RENAME TABLE jobs TO backup_jobs, tmp_jobs TO jobs;
CREATE TABLE tmp_employee LIKE employee;

ALTER TABLE tmp_employee ADD UNIQUE(ssn);

INSERT IGNORE INTO tmp_employee SELECT * FROM employee ORDER BY id DESC;

RENAME TABLE employee TO backup_employee, tmp_employee TO employee;
CREATE TABLE tmp_employee LIKE employee;

ALTER TABLE tmp_employee ADD UNIQUE(ssn);

ALTER TABLE tmp_employee ADD COLUMN n_duplicates INT DEFAULT 0;

INSERT INTO tmp_employee SELECT * FROM employee ORDER BY id ON DUPLICATE KEY UPDATE n_duplicates=n_duplicates+1;

RENAME TABLE employee TO backup_employee, tmp_employee TO employee;
CREATE TABLE tmp_employee LIKE employee;

ALTER TABLE tmp_employee ADD UNIQUE(ssn);

INSERT IGNORE INTO tmp_employee SELECT (first_name, last_name, start_date, ssn) FROM employee ORDER BY id;

RENAME TABLE employee TO backup_employee, tmp_employee TO employee;
CREATE TABLE tmp_employee LIKE employee;

ALTER TABLE tmp_employee ADD UNIQUE(ssn);

ALTER TABLE tmp_employee ADD COLUMN n_duplicates INT DEFAULT 0;

INSERT INTO tmp_employee SELECT * FROM employee ORDER BY id DESC ON DUPLICATE KEY UPDATE n_duplicates=n_duplicates+1;

CREATE TABLE tmp_employee2 LIKE tmp_employee;

INSERT INTO tmp_employee2 SELECT (first_name, last_name, start_date, ssn) FROM tmp_employee ORDER BY id;

DROP TABLE tmp_employee;

RENAME TABLE employee TO backup_employee, tmp_employee2 TO employee;
alter table mytable add tokeep boolean;
alter table mytable add constraint preventdupe unique (mycol1, mycol2, tokeep);
update ignore mytable set tokeep = true;
delete from mytable where tokeep is null;
alter table mytable drop tokeep;
DELETE t1 FROM contacts t1
    INNER JOIN
contacts t2 WHERE
t1.id < t2.id AND t1.email = t2.email;
-- Create temporary table

CREATE TABLE temp_table LIKE table1;

-- Add constraint
ALTER TABLE temp_table ADD UNIQUE(title, company,site_id);

-- Copy data
INSERT IGNORE INTO temp_table SELECT * FROM table1;

-- Rename and drop
RENAME TABLE table1 TO old_table1, temp_table TO table1;
DROP TABLE old_table1;
DELETE t1 FROM tablename t1 INNER JOIN tablename t2 
WHERE t1.id < t2.id AND t1.column1 = t2.column1 AND t1.column2 = t2.column2;
DELETE FROM tablename
WHERE id IN (
    SELECT id
    FROM (
        SELECT 
            id, 
            ROW_NUMBER() OVER(PARTITION BY column1, column2 ORDER BY id) rn
        FROM output
    ) t
    WHERE rn > 1
)
DROP TABLE TABLE_NAME_copy;
CREATE TABLE TABLE_NAME_copy LIKE TABLE_NAME;
INSERT INTO TABLE_NAME_copy
SELECT * FROM TABLE_NAME
GROUP BY COLUMN1, COLUMN2, COLUMN3; 
DROP TABLE TABLE_NAME;
ALTER TABLE TABLE_NAME_copy RENAME TO TABLE_NAME;
delete from job s 
where rowid < any 
(select rowid from job k 
where s.site_id = k.site_id and 
s.title = k.title and 
s.company = k.company);
delete from job s 
where rowid not in 
(select max(rowid) from job k 
where s.site_id = k.site_id and
s.title = k.title and 
s.company = k.company);
-- Here is what I used, and it works:
create table temp_table like my_table;
-- t_id is my unique column
insert into temp_table (id) select id from my_table GROUP by t_id;
delete from my_table where id not in (select id from temp_table);
drop table temp_table;
DELETE t1 FROM table_name t1
JOIN table_name t2
WHERE
    t1.id < t2.id AND
    t1.title = t2.title AND t1.company = t2.company AND t1.site_id = t2.site_id;