Sql postgres users表中出现了意外的类似副本，导致另一个表中出现了混乱的外键，如何修复和合并外键？_Sql_Postgresql_Duplicates_Bigdata

Sql postgres users表中出现了意外的类似副本，导致另一个表中出现了混乱的外键，如何修复和合并外键？

sql postgresql

Sql postgres users表中出现了意外的类似副本，导致另一个表中出现了混乱的外键，如何修复和合并外键？,sql,postgresql,duplicates,bigdata,Sql,Postgresql,Duplicates,Bigdata,对于我的应用程序，使用Postgres 问题每个用户都应该与N个案例关联，定义一对多关系，但由于应用程序逻辑错误，用户通常在数据库中重复，从而导致任何给定人员的多个ID 鉴于大多数用户的这些类型几乎都是重复的，这使得每个用户在users表中几乎总是由Y个id表示在此上下文中，近似重复表示两行基本相似。下面是一个近似重复的示例 | id | first_name | last_name | str_adrr | ----------------------------------

对于我的应用程序，使用Postgres

问题每个用户都应该与N个案例关联，定义一对多关系，但由于应用程序逻辑错误，用户通常在数据库中重复，从而导致任何给定人员的多个ID

鉴于大多数用户的这些类型几乎都是重复的，这使得每个用户在

users

表中几乎总是由Y个id表示

在此上下文中，近似重复表示两行基本相似。下面是一个近似重复的示例

|  id | first_name | last_name | str_adrr      | 
------------------------------------------------
|  1  | Mary       | Doe       | 124 Main Ave  | 
|  2  | Mary       | Doe       | 124 Main St   |

目标是删除除一个用户之外的所有接近重复的用户，留下一个用户，同时将所有相关案例关联到该单个用户。最终在用户和案例之间建立一对多关系

我的方法第一步我对用户进行模糊匹配，并以集群id作为标识符对他们进行分组。其中，cluster_id用于指示分组本身；集群id为

的所有行都被视为彼此重复

下面是

用户

表的示例

|  id | first_name | last_name | str_adrr      | group                   | cluster_id
-------------------------------------------------------------------------------------
|  1  | Mary       | Doe       | 124 Main Ave  | Mary Doe 124 Main Ave   | 1
|  2  | Mary       | Doe       | 124 Main St   | Mary Doe 124 Main Ave   | 1
|  7  | Mary       | Doe       | 124 Main Ave  | Mary Doe 124 Main Ave   | 1
|  4  | Mary       | Does      | 124 Main Ave  | Mary Doe 124 Main Ave   | 1
|  5  | James      | Smith     | 14 Street NW  |James Smith 14 Street NW | 2
|  6  | James      | Smith     | 14 Street NW  |James Smith 14 Street NW | 2
| 10  | James      | Smth      | 14 Street NW  |James Smith 14 Street NW | 2
| 11  | Paula      | James     | 21 River SW   | Paula James21 River SW  | 3
| 45  | Paula      | James     | 21 River SW   | Paula James21 River SW  | 3

给定另一个名为

案例的表

。以下是该表中相关列的示例：

|  id | user_id
---------------
|  1  | 1  # corresponds to mary
|  2  | 2  # corresponds to mary
|  3  | 4  # corresponds to mary
|  4  | 7  # corresponds to mary
|  5  | 10 # corresponds to james
|  6  | 11 # corresponds to paula
|  7  | 45 # corresponds to paula
|  8  | 1  # corresponds to mary
|  9  | 10 # corresponds to james
|  10 | 10 # corresponds to james
|  11 | 6  # corresponds to james

user\u id

在本

案例中

表格对应于

users

表格中的

id

|  id | first_name | last_name | str_adrr      | group                   | cluster_id
-------------------------------------------------------------------------------------
|  1  | Mary       | Doe       | 124 Main Ave  | Mary Doe 124 Main Ave   | 1
|  2  | Mary       | Doe       | 124 Main St   | Mary Doe 124 Main Ave   | 1
|  7  | Mary       | Doe       | 124 Main Ave  | Mary Doe 124 Main Ave   | 1
|  4  | Mary       | Does      | 124 Main Ave  | Mary Doe 124 Main Ave   | 1
|  5  | James      | Smith     | 14 Street NW  |James Smith 14 Street NW | 2
|  6  | James      | Smith     | 14 Street NW  |James Smith 14 Street NW | 2
| 10  | James      | Smth      | 14 Street NW  |James Smith 14 Street NW | 2
| 11  | Paula      | James     | 21 River SW   | Paula James21 River SW  | 3
| 45  | Paula      | James     | 21 River SW   | Paula James21 River SW  | 3

一个用户id可以有许多（多达几千个）案例

步骤2 我加入了

用户

和

案例

表格

下面是结果表的示例，

users\u cases

：

|cluster_id| user_id| case_id
----------------------------------
|  1       | 1      | 1
|  1       | 1      | 8
|  1       | 2      | 2
|  1       | 4      | 3
|  1       | 7      | 4
|  2       | 10     | 5
|  2       | 10     | 9
|  2       | 10     | 10
|  2       | 6      | 11
|  3       | 11     | 6
|  3       | 11     | 7

步骤3 我需要确定给定的

集群id

分组中的

用户id

与

用户案例

表中最大数量的案例相关

我能够做到这一点，最终得到了一个

max\u cluster\u user

表格，形状如下

|cluster_id| user_id| case_id_count
-------------------------------------
| 1        | 1      | 2 
| 2        | 10     | 3 
| 3        | 11     | 1

翻译。第一行表示对于值为

的

cluster\u id

，案例数量最多的

user\u id

为

，案例数量由值为

的

case\u id\u count

表示

第四步：我需要帮助的地方然后，我需要更新

user\u cases

表（或创建一个具有相同形状的新表），以便

cluster\u id

组中的每一行的每个

user\u id

都相同。结果应该是这样的

|cluster_id| user_id| case_id
----------------------------------
|  1       | 1      | 1
|  1       | 1      | 8
|  1       | 1      | 2
|  1       | 1      | 3
|  1       | 1      | 4
|  2       | 10     | 5
|  2       | 10     | 9
|  2       | 10     | 10
|  2       | 10     | 11
|  3       | 11     | 6
|  3       | 11     | 7

我不知道如何才能做到这一点。约束条件是必须通过与Postgresql兼容的SQL来完成

步骤4的程序代码解决方案我确实把它画成了代码，以便在程序上考虑，这可能会有所帮助。虽然我知道这不是一个可行的解决方案，因为对于>500k的记录，这种类型的逻辑需要几天才能正常运行

#max_cluster_user引用相同名称的表
对于max_cluster_用户中的群集：
#获取特定集群中的用户
cluster\u users=[如果用户['cluster\u id']=cluster['cluster\u id']]
#用户引用相同名称的表
对于群集用户中的用户，请执行以下操作：
#获取与给定id关联的案例
user_cases=[案例中的案例对案例如果案例['user_id']==用户['id']
对于用户案例中的用户案例：
#更新案例的用户\u id
用户案例['user\u id=cluster['user\u id']

提前感谢

我想你只需要

更新

加入第4步：

update user_cases uc
    set user_id = mcu.user_id
    from max_cluster_user mcu
    where mcu.cluster_id = uc.cluster_id and
          uc.user_id <> mcu.user_id;

更新用户案例
设置用户\u id=mcu.user\u id
来自max_群集_用户mcu
其中mcu.cluster\u id=uc.cluster\u id和
uc.user\u id mcu.user\u id；

我会接受你的答案。我刚刚测试过，效果不错！谢谢