mongodb计数并删除重复值
我有一个很大的mongodb集合,里面有很多类似这样的重复插入mongodb计数并删除重复值,mongodb,count,mapreduce,distinct,aggregation-framework,Mongodb,Count,Mapreduce,Distinct,Aggregation Framework,我有一个很大的mongodb集合,里面有很多类似这样的重复插入 { "_id" : 1, "val" : "222222", "val2" : "37"} { "_id" : 2, "val" : "222222", "val2" : "37" } { "_id" : 3, "val" : "222222", "val2" : "37" } { "_id" : 4, "val" : "333333", "val2" : "66" } { "_id" : 5, "val" : "111111",
{ "_id" : 1, "val" : "222222", "val2" : "37"}
{ "_id" : 2, "val" : "222222", "val2" : "37" }
{ "_id" : 3, "val" : "222222", "val2" : "37" }
{ "_id" : 4, "val" : "333333", "val2" : "66" }
{ "_id" : 5, "val" : "111111", "val2" : "22" }
{ "_id" : 6, "val" : "111111", "val2" : "22" }
{ "_id" : 7, "val" : "111111", "val2" : "22" }
{ "_id" : 8, "val" : "111111", "val2" : "22" }
{ "_id" : 1, "val" : "222222", "val2" : "37", "count" : "3"}
{ "_id" : 2, "val" : "333333", "val2" : "66", "count" : "1"}
{ "_id" : 2, "val" : "111111", "val2" : "22", "count" : "4" }
我想对每个插入的所有重复项进行计数,并且只留下一个唯一的条目,该条目的计数编号以DB为单位,如下所示
{ "_id" : 1, "val" : "222222", "val2" : "37"}
{ "_id" : 2, "val" : "222222", "val2" : "37" }
{ "_id" : 3, "val" : "222222", "val2" : "37" }
{ "_id" : 4, "val" : "333333", "val2" : "66" }
{ "_id" : 5, "val" : "111111", "val2" : "22" }
{ "_id" : 6, "val" : "111111", "val2" : "22" }
{ "_id" : 7, "val" : "111111", "val2" : "22" }
{ "_id" : 8, "val" : "111111", "val2" : "22" }
{ "_id" : 1, "val" : "222222", "val2" : "37", "count" : "3"}
{ "_id" : 2, "val" : "333333", "val2" : "66", "count" : "1"}
{ "_id" : 2, "val" : "111111", "val2" : "22", "count" : "4" }
我已经签出了MapReduce和aggregation framework,但它们从未将完整文档输出回来,只对完整集合执行一次计算
如果使用mongodb 2.6,最好将新数据保存到新集合中,下面是聚合框架的一个示例:
db.duplicate.aggregate({$group:{_id:"$val",count:{$sum :1}}},
{$project:{_id:0, val:"$_id", count:1}},
{$out:"deduplicate"})
val
和计数分组希望它适合您的情况。使用增量map reduce可能更容易
mapper=function(){
emit({'val1':this.val, 'val2':this.val2}, {'count':1});
}
reducer=function(k,v){
counter=0;
for (i=0;i<v.length;i++){
counter+=v[i].count;
}
return {'count':counter}
}
这将产生一个名为reduced collection的新集合。您的值将是id,计数将在那里。请注意,在新集合中使用了两个值作为键。如果要查找特定实例,可以执行以下操作:
reducedcollection.findOne({'id.val1':'33333', 'id.val2':'22'})
有趣的是,您现在可以删除旧集合,当新数据进入时,将reduce映射到reduced集合之上,并增加计数
可能很方便?如果您向我们展示您的尝试,可能会复制更好的。好的,但是我如何将其他值直接传递到新文档其他值?给我举个例子,它会更简单。假设我在“val”旁边有一个值“val2”,那么我如何将它传递给新文档,而val2是否也是重复的(作为val)?问题不是很清楚,更容易处理一组真实的数据。是的,它与val相同,我更新了我的帖子以使事情更清楚。这很好,但我如何将我的其他值直接传递到新文档它只保存val和count