从MySQL数据库中删除重复项
我有一个超过7000条记录的数据库。事实证明,这些记录中存在多个重复项。我找到了一些关于如何删除重复项并只保留一条记录的建议。 但在我的例子中,情况要复杂一些:如果案例与另一条记录包含相同的数据,那么案例就不仅仅是重复的。相反,有几种情况下保存相同的数据是完全正确的。只有当它们保存相同的数据并且都在30秒内插入时,才会将它们标记为重复 因此,我需要一个SQL语句来删除重复项(例如:除从MySQL数据库中删除重复项,mysql,datetime,Mysql,Datetime,我有一个超过7000条记录的数据库。事实证明,这些记录中存在多个重复项。我找到了一些关于如何删除重复项并只保留一条记录的建议。 但在我的例子中,情况要复杂一些:如果案例与另一条记录包含相同的数据,那么案例就不仅仅是重复的。相反,有几种情况下保存相同的数据是完全正确的。只有当它们保存相同的数据并且都在30秒内插入时,才会将它们标记为重复 因此,我需要一个SQL语句来删除重复项(例如:除id和datetime之外的所有字段),如果它们在40秒内插入(例如:评估datetime字段) 由于我不是SQL
id
和datetime
之外的所有字段),如果它们在40秒内插入(例如:评估datetime
字段)
由于我不是SQL专家,在网上找不到合适的解决方案,我真的希望你们中的一些人能帮助我,为我指明正确的方向。非常感谢
表格结构如下:
CREATE TABLE IF NOT EXISTS `wp_ttr_results` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`user_id` int(11) NOT NULL,
`schoolyear` varchar(10) CHARACTER SET utf8 DEFAULT NULL,
`datetime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`area` varchar(15) CHARACTER SET utf8 NOT NULL,
`content` varchar(10) CHARACTER SET utf8 NOT NULL,
`types` varchar(100) CHARACTER SET utf8 NOT NULL,
`tasksWrong` varchar(300) DEFAULT NULL,
`tasksRight` varchar(300) DEFAULT NULL,
`tasksData` longtext CHARACTER SET utf8,
`parent_id` varchar(20) DEFAULT NULL,
UNIQUE KEY `id` (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=68696 ;
再次澄清,重复案例是指:
[1] 除id
和datetime
字段外,所有字段的数据都与另一个大小写相同
[2] 根据datetime
字段,在具有相同值的另一条记录的40秒内插入数据库
如果两个条件都满足,则除一个之外的所有情况都应删除。这可能有效,但可能不会很快
DELETE FROM dupes
USING wp_ttr_results AS dupes
INNER JOIN wp_ttr_results AS origs
ON dupes.field1 = origs.field1
AND dupes.field2 = origs.field2
AND ....
AND AS dupes.id <> origs.id
AND dupes.`datetime` BETWEEN orig.`datetime` AND (orig.`datetime` + INTERVAL 40 SECOND)
;
从复制中删除
使用wp_ttr_结果作为重复
内部联接wp\u ttr\u结果作为源
在dupes.field1=origs.field1上
和dupes.field2=origs.field2
还有。。。。
作为被复制者,id,origs,id
和重复。起始日期时间和(起始日期时间+间隔40秒)之间的日期时间
;
这可能行得通,但可能不会很快
DELETE FROM dupes
USING wp_ttr_results AS dupes
INNER JOIN wp_ttr_results AS origs
ON dupes.field1 = origs.field1
AND dupes.field2 = origs.field2
AND ....
AND AS dupes.id <> origs.id
AND dupes.`datetime` BETWEEN orig.`datetime` AND (orig.`datetime` + INTERVAL 40 SECOND)
;
从复制中删除
使用wp_ttr_结果作为重复
内部联接wp\u ttr\u结果作为源
在dupes.field1=origs.field1上
和dupes.field2=origs.field2
还有。。。。
作为被复制者,id,origs,id
和重复。起始日期时间和(起始日期时间+间隔40秒)之间的日期时间
;
这可能行得通,但可能不会很快
DELETE FROM dupes
USING wp_ttr_results AS dupes
INNER JOIN wp_ttr_results AS origs
ON dupes.field1 = origs.field1
AND dupes.field2 = origs.field2
AND ....
AND AS dupes.id <> origs.id
AND dupes.`datetime` BETWEEN orig.`datetime` AND (orig.`datetime` + INTERVAL 40 SECOND)
;
从复制中删除
使用wp_ttr_结果作为重复
内部联接wp\u ttr\u结果作为源
在dupes.field1=origs.field1上
和dupes.field2=origs.field2
还有。。。。
作为被复制者,id,origs,id
和重复。起始日期时间和(起始日期时间+间隔40秒)之间的日期时间
;
这可能行得通,但可能不会很快
DELETE FROM dupes
USING wp_ttr_results AS dupes
INNER JOIN wp_ttr_results AS origs
ON dupes.field1 = origs.field1
AND dupes.field2 = origs.field2
AND ....
AND AS dupes.id <> origs.id
AND dupes.`datetime` BETWEEN orig.`datetime` AND (orig.`datetime` + INTERVAL 40 SECOND)
;
从复制中删除
使用wp_ttr_结果作为重复
内部联接wp\u ttr\u结果作为源
在dupes.field1=origs.field1上
和dupes.field2=origs.field2
还有。。。。
作为被复制者,id,origs,id
和重复。起始日期时间和(起始日期时间+间隔40秒)之间的日期时间
;
正如@Juru在评论中指出的,我们需要一把外科手术刀来切割这把刀。但是,可以通过存储过程以迭代的方式执行此操作
首先,我们使用自连接来标识每个记录的第一个副本,该记录本身不是副本:
SELECT DISTINCT
MIN(postdups.id AS id)
FROM wp_ttr_results AS base
INNER JOIN wp_ttr_results AS postdups
ON base.id<postdups.id
AND UNIX_TIMESTAMP(postdups.datetime)-UNIX_TIMESTAMP(base.datetime)<40
AND base.user_id=postdups.user_id
AND base.schoolyear=postdups.schoolyear
AND base.area=postdups.area
AND base.content=postdups.content
AND base.types=postdups.types
AND base.tasksWrong=postdups.tasksWrong
AND base.tasksRight=postdups.tasksRight
AND base.parent_id=postdups.user_id
LEFT JOIN wp_ttr_results AS predups
ON base.id>predups.id
AND UNIX_TIMESTAMP(base.datetime)-UNIX_TIMESTAMP(predups.datetime)<40
AND base.user_id=predups.user_id
AND base.schoolyear=predups.schoolyear
AND base.area=predups.area
AND base.content=predups.content
AND base.types=predups.types
AND base.tasksWrong=predups.tasksWrong
AND base.tasksRight=predups.tasksRight
AND base.parent_id=predups.user_id
WHERE predups.id IS NULL
GROUP BY base.id
;
到目前为止,我们已经删除了每个记录的第一个副本,在这个过程中,可能会发生变化,这将被视为一个副本
最后,我们必须循环这个过程,如果selectdistinct
没有返回任何内容,就退出循环
将其全部放在一个存储过程中:
DELIMITER ;;
CREATE PROCEDURE cleanUpDuplicates()
BEGIN
DECLARE numDuplicates INT;
iterate: LOOP
DROP TABLE IF EXISTS cleanUpDuplicatesTemp;
CREATE TEMPORARY TABLE cleanUpDuplicatesTemp
SELECT DISTINCT
MIN(postdups.id AS id)
FROM wp_ttr_results AS base
INNER JOIN wp_ttr_results AS postdups
ON base.id<postdups.id
AND UNIX_TIMESTAMP(postdups.datetime)-UNIX_TIMESTAMP(base.datetime)<40
AND base.user_id=postdups.user_id
AND base.schoolyear=postdups.schoolyear
AND base.area=postdups.area
AND base.content=postdups.content
AND base.types=postdups.types
AND base.tasksWrong=postdups.tasksWrong
AND base.tasksRight=postdups.tasksRight
AND base.parent_id=postdups.user_id
LEFT JOIN wp_ttr_results AS predups
ON base.id>predups.id
AND UNIX_TIMESTAMP(base.datetime)-UNIX_TIMESTAMP(predups.datetime)<40
AND base.user_id=predups.user_id
AND base.schoolyear=predups.schoolyear
AND base.area=predups.area
AND base.content=predups.content
AND base.types=predups.types
AND base.tasksWrong=predups.tasksWrong
AND base.tasksRight=predups.tasksRight
AND base.parent_id=predups.user_id
WHERE predups.id IS NULL
GROUP BY base.id;
SELECT COUNT(*) INTO numDuplicates FROM cleanUpDuplicatesTemp;
IF numDuplicates<=0 THEN
LEAVE iterate;
END IF;
DELETE FROM wp_ttr_results
WHERE id IN
(SELECT id FROM cleanUpDuplicatesTemp)
END LOOP iterate;
DROP TABLE IF EXISTS cleanUpDuplicatesTemp;
END;;
DELIMITER ;
分隔符;;
创建过程cleanUpDuplicates()
开始
声明numDuplicates INT;
迭代:循环
删除表格(如果存在)清除重复的STEMP;
创建临时表cleanUpDuplicatesTemp
选择不同的
最小值(postdups.id作为id)
从wp_ttr_结果作为基础
内部连接wp\u ttr\u结果作为postdup
根据base.id正如@Juru在评论中指出的,我们需要一把外科手术刀来切割这把刀。但是,可以通过存储过程以迭代的方式执行此操作
首先,我们使用自连接来标识每个记录的第一个副本,该记录本身不是副本:
SELECT DISTINCT
MIN(postdups.id AS id)
FROM wp_ttr_results AS base
INNER JOIN wp_ttr_results AS postdups
ON base.id<postdups.id
AND UNIX_TIMESTAMP(postdups.datetime)-UNIX_TIMESTAMP(base.datetime)<40
AND base.user_id=postdups.user_id
AND base.schoolyear=postdups.schoolyear
AND base.area=postdups.area
AND base.content=postdups.content
AND base.types=postdups.types
AND base.tasksWrong=postdups.tasksWrong
AND base.tasksRight=postdups.tasksRight
AND base.parent_id=postdups.user_id
LEFT JOIN wp_ttr_results AS predups
ON base.id>predups.id
AND UNIX_TIMESTAMP(base.datetime)-UNIX_TIMESTAMP(predups.datetime)<40
AND base.user_id=predups.user_id
AND base.schoolyear=predups.schoolyear
AND base.area=predups.area
AND base.content=predups.content
AND base.types=predups.types
AND base.tasksWrong=predups.tasksWrong
AND base.tasksRight=predups.tasksRight
AND base.parent_id=predups.user_id
WHERE predups.id IS NULL
GROUP BY base.id
;
到目前为止,我们已经删除了每个记录的第一个副本,在这个过程中,可能会发生变化,这将被视为一个副本
最后,我们必须循环这个过程,如果selectdistinct
没有返回任何内容,就退出循环
将其全部放在一个存储过程中:
DELIMITER ;;
CREATE PROCEDURE cleanUpDuplicates()
BEGIN
DECLARE numDuplicates INT;
iterate: LOOP
DROP TABLE IF EXISTS cleanUpDuplicatesTemp;
CREATE TEMPORARY TABLE cleanUpDuplicatesTemp
SELECT DISTINCT
MIN(postdups.id AS id)
FROM wp_ttr_results AS base
INNER JOIN wp_ttr_results AS postdups
ON base.id<postdups.id
AND UNIX_TIMESTAMP(postdups.datetime)-UNIX_TIMESTAMP(base.datetime)<40
AND base.user_id=postdups.user_id
AND base.schoolyear=postdups.schoolyear
AND base.area=postdups.area
AND base.content=postdups.content
AND base.types=postdups.types
AND base.tasksWrong=postdups.tasksWrong
AND base.tasksRight=postdups.tasksRight
AND base.parent_id=postdups.user_id
LEFT JOIN wp_ttr_results AS predups
ON base.id>predups.id
AND UNIX_TIMESTAMP(base.datetime)-UNIX_TIMESTAMP(predups.datetime)<40
AND base.user_id=predups.user_id
AND base.schoolyear=predups.schoolyear
AND base.area=predups.area
AND base.content=predups.content
AND base.types=predups.types
AND base.tasksWrong=predups.tasksWrong
AND base.tasksRight=predups.tasksRight
AND base.parent_id=predups.user_id
WHERE predups.id IS NULL
GROUP BY base.id;
SELECT COUNT(*) INTO numDuplicates FROM cleanUpDuplicatesTemp;
IF numDuplicates<=0 THEN
LEAVE iterate;
END IF;
DELETE FROM wp_ttr_results
WHERE id IN
(SELECT id FROM cleanUpDuplicatesTemp)
END LOOP iterate;
DROP TABLE IF EXISTS cleanUpDuplicatesTemp;
END;;
DELIMITER ;
分隔符;;
创建过程cleanUpDuplicates()
开始
声明numDuplicates INT;
迭代:循环
删除表格(如果存在)清除重复的STEMP;
创建临时表cleanUpDuplicatesTemp
选择不同的
最小值(postdups.id作为id)
从wp_ttr_结果作为基础
内部连接wp\u ttr\u结果作为postdup
根据base.id正如@Juru在评论中指出的,我们需要一把外科手术刀来切割这把刀。但是,可以通过存储过程以迭代的方式执行此操作
首先,我们使用自连接来标识每个记录的第一个副本,该记录本身不是副本:
SELECT DISTINCT
MIN(postdups.id AS id)
FROM wp_ttr_results AS base
INNER JOIN wp_ttr_results AS postdups
ON base.id<postdups.id
AND UNIX_TIMESTAMP(postdups.datetime)-UNIX_TIMESTAMP(base.datetime)<40
AND base.user_id=postdups.user_id
AND base.schoolyear=postdups.schoolyear
AND base.area=postdups.area
AND base.content=postdups.content
AND base.types=postdups.types
AND base.tasksWrong=postdups.tasksWrong
AND base.tasksRight=postdups.tasksRight
AND base.parent_id=postdups.user_id
LEFT JOIN wp_ttr_results AS predups
ON base.id>predups.id
AND UNIX_TIMESTAMP(base.datetime)-UNIX_TIMESTAMP(predups.datetime)<40
AND base.user_id=predups.user_id
AND base.schoolyear=predups.schoolyear
AND base.area=predups.area
AND base.content=predups.content
AND base.types=predups.types
AND base.tasksWrong=predups.tasksWrong
AND base.tasksRight=predups.tasksRight
AND base.parent_id=predups.user_id
WHERE predups.id IS NULL
GROUP BY base.id
;
到目前为止,我们已经删除了每个记录的第一个副本,在这个过程中,可能会发生变化,这将被视为一个副本
最后,我们必须循环这个过程,如果selectdistinct
没有返回任何内容,就退出循环
将其全部放在一个存储过程中:
DELIMITER ;;
CREATE PROCEDURE cleanUpDuplicates()
BEGIN
DECLARE numDuplicates INT;
iterate: LOOP
DROP TABLE IF EXISTS cleanUpDuplicatesTemp;
CREATE TEMPORARY TABLE cleanUpDuplicatesTemp
SELECT DISTINCT
MIN(postdups.id AS id)
FROM wp_ttr_results AS base
INNER JOIN wp_ttr_results AS postdups
ON base.id<postdups.id
AND UNIX_TIMESTAMP(postdups.datetime)-UNIX_TIMESTAMP(base.datetime)<40
AND base.user_id=postdups.user_id
AND base.schoolyear=postdups.schoolyear
AND base.area=postdups.area
AND base.content=postdups.content
AND base.types=postdups.types
AND base.tasksWrong=postdups.tasksWrong
AND base.tasksRight=postdups.tasksRight
AND base.parent_id=postdups.user_id
LEFT JOIN wp_ttr_results AS predups
ON base.id>predups.id
AND UNIX_TIMESTAMP(base.datetime)-UNIX_TIMESTAMP(predups.datetime)<40
AND base.user_id=predups.user_id
AND base.schoolyear=predups.schoolyear
AND base.area=predups.area
AND base.content=predups.content
AND base.types=predups.types
AND base.tasksWrong=predups.tasksWrong
AND base.tasksRight=predups.tasksRight
AND base.parent_id=predups.user_id
WHERE predups.id IS NULL
GROUP BY base.id;
SELECT COUNT(*) INTO numDuplicates FROM cleanUpDuplicatesTemp;
IF numDuplicates<=0 THEN
LEAVE iterate;
END IF;
DELETE FROM wp_ttr_results
WHERE id IN
(SELECT id FROM cleanUpDuplicatesTemp)
END LOOP iterate;
DROP TABLE IF EXISTS cleanUpDuplicatesTemp;
END;;
DELIMITER ;
分隔符;;
创建过程cleanUpDuplicates()
开始
声明numDuplicates INT;
迭代:循环
删除表格(如果存在)清除重复的STEMP;
创建临时表cleanUpDuplicatesTemp
选择不同的
最小值(postdups.id作为id)
从wp_ttr_结果作为基础
内部连接wp\u ttr\u结果作为postdup
根据base.id正如@Juru在评论中指出的,我们需要一把外科手术刀来切割这把刀。但是,可以通过存储过程以迭代的方式执行此操作
首先,我们使用s