按键字段查找MongoDB集合中的所有重复文档_Mongodb_Mapreduce_Duplicates_Aggregation Framework_Pymongo

按键字段查找MongoDB集合中的所有重复文档

mongodb mapreduce

按键字段查找MongoDB集合中的所有重复文档,mongodb,mapreduce,duplicates,aggregation-framework,pymongo,Mongodb,Mapreduce,Duplicates,Aggregation Framework,Pymongo,假设我有一个集合和一些文档集。像这样的 { "_id" : ObjectId("4f127fa55e7242718200002d"), "id":1, "name" : "foo"} { "_id" : ObjectId("4f127fa55e7242718200002d"), "id":2, "name" : "bar"} { "_id" : ObjectId("4f127fa55e7242718200002d"), "id":3, "name" : "baz"} { "_id" : Obje

假设我有一个集合和一些文档集。像这样的

{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":1, "name" : "foo"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":2, "name" : "bar"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":3, "name" : "baz"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":4, "name" : "foo"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":5, "name" : "bar"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":6, "name" : "bar"}

我想通过“名称”字段查找此集合中的所有重复条目。例如，“foo”出现两次，“bar”出现三次。

注意：这个解决方案最容易理解，但不是最好的

您可以使用查找文档包含特定字段的次数：

var map = function(){
   if(this.name) {
        emit(this.name, 1);
   }
}

var reduce = function(key, values){
    return Array.sum(values);
}

var res = db.collection.mapReduce(map, reduce, {out:{ inline : 1}});
db[res.result].find({value: {$gt: 1}}).sort({value: -1});

有关通用Mongo解决方案，请参阅。请注意，聚合更快、更强大，因为它可以返回重复记录的

\u id

例如，公认的答案（使用mapReduce）没有那么有效。相反，我们可以使用以下方法：

等效的SQL查询将是：

SELECT name，COUNT（name）FROM prb GROUP BY name

。请注意，我们仍然需要从数组中筛选出计数为0的元素。再次，请参阅，以获取使用

组的规范解决方案
对于大型集合，可接受的答案非常慢，并且不会返回重复记录的\u id
s
聚合速度更快，可以返回\u id
s:
{
  "_id" : {
    "name" : "Toothpick"
},
  "uniqueIds" : [
    "xzuzJd2qatfJCSvkN",
    "9bpewBsKbrGBQexv4",
    "fi3Gscg9M64BQdArv",
  ],
  "count" : 3
},
{
  "_id" : {
    "name" : "Broom"
  },
  "uniqueIds" : [
    "3vwny3YEj2qBsmmhA",
    "gJeWGcuX6Wk69oFYD"
  ],
  "count" : 2
}

在聚合管道的第一阶段中
操作员按名称
字段聚合文档，并将分组记录的每个\u id
值存储在唯一id中。
运算符将传递给它的字段的值相加，在本例中为常量1
——从而将分组记录的数量计算到计数
字段中
在管道的第二阶段，我们使用
筛选计数至少为2的文档，即重复文档
然后，我们首先对最常见的重复项进行排序，并将结果限制在前10个
此查询最多将输出具有重复名称的$limit
记录及其\u id
s。例如：
// Desired unique index: 
// db.collection.ensureIndex({ firstField: 1, secondField: 1 }, { unique: true})

db.collection.aggregate([
  { $group: { 
    _id: { firstField: "$firstField", secondField: "$secondField" }, 
    uniqueIds: { $addToSet: "$_id" },
    count: { $sum: 1 } 
  }}, 
  { $match: { 
    count: { $gt: 1 } 
  }}
])

可用于轻松识别具有重复键值的文档：
uniqueIds: { $addToSet: "$_id" },

~Ref:mongo实验室官方博客上的有用信息：
这里被接受的最高答案是：
db.collection.aggregate([ 
  {$group: { _id: {name: "$name"}, 
             count: {$sum: 1} } }, 
  {$match: { count: {"$gt": 1} } } 
]);

这还会返回一个名为uniqueIds的新字段，其中包含ID列表。但是如果你只想要字段和它的计数呢？那么就是这个,
SELECT COUNT(*), my_type FROM table GROUP BY my_type;
+----------+-----------------+
| COUNT(*) | my_type         |
+----------+-----------------+
|        3 | Contact         |
|        1 | Practice        |
|        1 | Prospect        |
|        1 | Task            |
+----------+-----------------+

为了解释这一点，如果您来自MySQL和PostgreSQL等SQL数据库，那么您习惯于使用聚合函数（例如COUNT（）、SUM（）、MIN（）、MAX（）），这些函数与GROUP BY语句一起工作，例如，允许您查找列值在表中出现的总计数
db.contacts.aggregate([ ... ]);

如您所见，我们的输出显示了每个my_类型值出现的计数。为了在MongoDB中找到重复项，我们将以类似的方式解决这个问题。MongoDB拥有聚合操作，将多个文档中的值分组在一起，并可以对分组数据执行各种操作以返回单个结果。它的概念与SQL中的聚合函数类似
假设集合名为contacts，初始设置如下所示：
db.contacts.aggregate([  
    {$group: { 
        _id: {name: "$name"} 
        } 
    }
]);

这个聚合函数接受一个聚合运算符数组，在我们的例子中，我们需要$group运算符，因为我们的目标是根据字段的计数（即字段值的发生次数）对数据进行分组
db.contacts.aggregate([  
  {$group: { 
    _id: {name: "$name"} 
    } 
  }
]);

{ "_id" : { "name" : "John" } }
{ "_id" : { "name" : "Joan" } }
{ "_id" : { "name" : "Stephen" } }
{ "_id" : { "name" : "Rod" } }
{ "_id" : { "name" : "Albert" } }
{ "_id" : { "name" : "Amanda" } }

这种方法有一点特殊性。使用group by运算符需要_id字段。在本例中，我们对$name字段进行分组。_id中的密钥名可以有任何名称。但我们使用名称，因为它在这里是直观的
通过仅使用$group运算符运行聚合，我们将获得所有名称字段的列表（无论它们在集合中出现一次还是多次）：
注意上面的聚合是如何工作的。它获取具有名称字段的文档，并返回提取的名称字段的新集合
但我们想知道的是字段值重新出现了多少次。$group运算符接受一个计数字段，该字段使用$sum运算符将表达式1添加到组中每个文档的总数中。因此，$group和$sum一起返回给定字段（例如名称）的所有数值的总和
由于目标是消除重复项，因此需要多做一步。要仅获取计数超过一的组，我们可以使用$match操作符筛选结果。在$match操作符中，我们将告诉它查看count字段，并告诉它使用表示“大于”的$gt操作符和数字1查找大于1的计数
The 'cursor' option is required, except for aggregate with the explain argument 

另外，如果您通过类似Mongoid for Ruby的ORM使用MongoDB，可能会出现以下错误：
module Moped
  class Collection
    # Mongo 3.6 requires a `cursor` option be passed as part of aggregate queries.  This overrides
    # `Moped::Collection#aggregate` to include a cursor, which is not provided by Moped otherwise.
    #
    # Per the [MongoDB documentation](https://docs.mongodb.com/manual/reference/command/aggregate/):
    #
    #   Changed in version 3.6: MongoDB 3.6 removes the use of `aggregate` command *without* the `cursor` option unless
    #   the command includes the `explain` option. Unless you include the `explain` option, you must specify the
    #   `cursor` option.
    #
    #   To indicate a cursor with the default batch size, specify `cursor: {}`.
    #
    #   To indicate a cursor with a non-default batch size, use `cursor: { batchSize: <num> }`.
    #
    def aggregate(*pipeline)
      # Ordering of keys apparently matters to Mongo -- `aggregate` has to come before `cursor` here.
      extract_result(session.command(aggregate: name, pipeline: pipeline.flatten, cursor: {}))
    end

    private

    def extract_result(response)
      response.key?("cursor") ? response["cursor"]["firstBatch"] : response["result"]
    end
  end
end

这很可能意味着您的ORM已经过时，正在执行MongoDB不再支持的操作。因此，要么更新您的ORM，要么找到修复程序。对于Mongoid，这是我的解决方案：
模块轻便摩托车
类集合
#Mongo3.6要求作为聚合查询的一部分传递“cursor”选项。这将覆盖
#'Moped:：Collection#aggregate`包含一个游标，而Moped不提供该游标。
#
#根据[MongoDB文档](https://docs.mongodb.com/manual/reference/command/aggregate/):
#
#在版本3.6中进行了更改：MongoDB 3.6删除了使用“聚合”命令*而不使用*游标选项，除非
#该命令包括“explain”选项。除非包含“explain”选项，否则必须指定
#`cursor`选项。
#
#要指示具有默认批处理大小的游标，请指定“游标：{}”。
#
#要指示具有非默认批量大小的游标，请使用“游标：{batchSize:}”。
#
def聚合（*管道）
#键的顺序显然对Mongo很重要--“aggregate”必须在“cursor”之前。
提取结果（session.command（聚合：名称，管道：pipeline.flatte，游标：{}））
结束
私有的
def提取结果（响应）
回答。键？（“cu
db.contacts.aggregate([  
  {$group: { 
    _id: {name: "$name"} 
    } 
  }
]);

{ "_id" : { "name" : "John" } }
{ "_id" : { "name" : "Joan" } }
{ "_id" : { "name" : "Stephen" } }
{ "_id" : { "name" : "Rod" } }
{ "_id" : { "name" : "Albert" } }
{ "_id" : { "name" : "Amanda" } }

db.contacts.aggregate([  
  {$group: { 
    _id: {name: "$name"},
    count: {$sum: 1}
    } 
  }
]);

{ "_id" : { "name" : "John" },  "count" : 1  }
{ "_id" : { "name" : "Joan" },  "count" : 3  }
{ "_id" : { "name" : "Stephen" },  "count" : 2 }
{ "_id" : { "name" : "Rod" },  "count" : 3 }
{ "_id" : { "name" : "Albert" },  "count" : 2 }
{ "_id" : { "name" : "Amanda" },  "count" : 1 }

db.contacts.aggregate([ 
  {$group: { _id: {name: "$name"}, 
             count: {$sum: 1} } }, 
  {$match: { count: {"$gt": 1} } } 
]);

{ "_id" : { "name" : "Joan" },  "count" : 3  }
{ "_id" : { "name" : "Stephen" },  "count" : 2 }
{ "_id" : { "name" : "Rod" },  "count" : 3 }
{ "_id" : { "name" : "Albert" },  "count" : 2 }

The 'cursor' option is required, except for aggregate with the explain argument 

module Moped
  class Collection
    # Mongo 3.6 requires a `cursor` option be passed as part of aggregate queries.  This overrides
    # `Moped::Collection#aggregate` to include a cursor, which is not provided by Moped otherwise.
    #
    # Per the [MongoDB documentation](https://docs.mongodb.com/manual/reference/command/aggregate/):
    #
    #   Changed in version 3.6: MongoDB 3.6 removes the use of `aggregate` command *without* the `cursor` option unless
    #   the command includes the `explain` option. Unless you include the `explain` option, you must specify the
    #   `cursor` option.
    #
    #   To indicate a cursor with the default batch size, specify `cursor: {}`.
    #
    #   To indicate a cursor with a non-default batch size, use `cursor: { batchSize: <num> }`.
    #
    def aggregate(*pipeline)
      # Ordering of keys apparently matters to Mongo -- `aggregate` has to come before `cursor` here.
      extract_result(session.command(aggregate: name, pipeline: pipeline.flatten, cursor: {}))
    end

    private

    def extract_result(response)
      response.key?("cursor") ? response["cursor"]["firstBatch"] : response["result"]
    end
  end
end