使用MongoDB聚合将集合合并到固定大小_Mongodb_Mongodb Query_Aggregation Framework_Aggregate Functions

使用MongoDB聚合将集合合并到固定大小

mongodb

使用MongoDB聚合将集合合并到固定大小,mongodb,mongodb-query,aggregation-framework,aggregate-functions,Mongodb,Mongodb Query,Aggregation Framework,Aggregate Functions,我有一个类似这样的收藏： { "_id" : id1, "field1" : 11, "field2": 101, "localityID" : 27 } { "_id" : id2, "field1" : 22, "field2": 202, "localityID" :

我有一个类似这样的收藏：

{
    "_id" : id1,
    "field1" : 11,
    "field2": 101,
    "localityID" : 27
}
{
    "_id" : id2,
    "field1" : 22,
    "field2": 202,
    "localityID" : 27
}
{
    "_id" : id3,
    "field1" : 33,
    "field2": 303,
    "localityID" : 27
}
{
    "_id" : id4,
    "field1" : 44,
    "field2": 404,
    "localityID" : 27
}
{
    "_id" : id5,
    "field1" : 55,
    "field2": 505,
    "localityID" : 27
}
{
    "_id" : id6,
    "field1" : 66,
    "field2": 606,
    "localityID" : 61
}
{
    "_id" : id4,
    "field1" : 77,
    "field2": 707,
    "localityID" : 61
}

用例-我希望检索和处理具有相同

localityID

的记录，批量大小为3。出于跟踪目的，我还希望跟踪在特定批次中处理的记录

同样，我希望使用MongoDB的聚合框架来组合具有相同

localityID

但只有固定大小（如上所述3个）的集合

我想将上述集合更新为以下内容：

{
  "_id" : "id111",
  "batchId" : "batch1",
  "localityID": 27,
  "batches": [
     {
         "field1" : 11,
         "field2": 101
     },
     {
         "field1" : 22,
         "field2": 202
     },
     {
         "field1" : 33,
         "field2": 303
     }
  ]
}
{
  "_id" : "id222",
  "batchId" : "batch2",
  "localityID": 27,
  "batches": [
     {
         "field1" : 44,
         "field2": 404
     },
     {
         "field1" : 55,
         "field2": 505
     }
  ]
}
{
  "_id" : "id333",
  "batchId" : "batch1",
  "localityID": 61,
  "batches": [
     {
         "field1" : 66,
         "field2": 606
     },
     {
         "field1" : 77,
         "field2": 707
     }
  ]
}

我尝试了一些聚合函数的组合，如下面的一个，但未能获得预期的结果

（这可以将所有记录与相同的

localityID

合并，但只能合并到一个文档中，而不是成批地进行合并）

上述聚合函数产生以下结果-

{
  "_id" : "id111",
  "batchId" : "batch1",
  "localityID": 27,
  "batches": [
     {
         "field1" : 11,
         "field2": 101
     },
     {
         "field1" : 22,
         "field2": 202
     },
     {
         "field1" : 33,
         "field2": 303
     },
     {
         "field1" : 44,
         "field2": 404
     },
     {
         "field1" : 55,
         "field2": 505
     }
  ]
}
{
  "_id" : "id333",
  "batchId" : "batch1",
  "localityID": 61,
  "batches": [
     {
         "field1" : 66,
         "field2": 606
     },
     {
         "field1" : 77,
         "field2": 707
     }
  ]
}

Mongo的聚合框架是否有可能实现这一点，或者我使用其他东西会更好？

这个想法来源于。您可以使用生成一个索引数组，其中step参数设置为some

bucketSize

。然后，您只需要获得一个大小为bucketSize的数组，请尝试以下操作：

让bucketSize=3；
db.old_collection.aggregate([
{
$group:{
_id:“$localityID”，
id:{$first:“$\U id”}，
localityID:{$first:$localityID}，
批次：{
$push:{
字段1:“$field1”，
字段2:“$field2”
}
}
}
},
{
$项目：{
_id:0，
localityID:“$localityID”，
批次：{
$map:{
输入：{$range:[0，{$size:$batches}，bucketSize]}，
作为：“索引”，
在：{$slice:[“$batches”，“$$index”，bucketSize]}
}
}
}
},
{
$REWIND：{
路径：“$batches”，
IncludeAryIndex：“batchId”
}
},
{
$addFields：{
batchId:{
$concat：[
“批量”，
{$toString:{$add:[“$batchId”，1]}
]
}
}
},
//$sort是可选的。如果不需要，可以将其删除。
{
$sort:{
地点ID:1，
批处理ID:1
}
}
{$out:“新收藏”}
]);

输出

[
{
“_id”：ObjectId（“…”），
“LocationID”：27，
“批次”：[
{
“字段1”：11，
“字段2”：101
},
{
“字段1”：22，
“字段2”：202
},
{
“字段1”：33，
“字段2”：303
}
],
“batchId”：“batch1”
},
{
“_id”：ObjectId（“…”），
“LocationID”：27，
“批次”：[
{
“字段1”：44，
“字段2”：404
},
{
“字段1”：55，
“字段2”：505
}
],
“batchId”：“batch2”
},
{
“_id”：ObjectId（“…”），
“LocationID”：61，
“批次”：[
{
“字段1”：66，
“字段2”：606
},
{
“字段1”：77，
“字段2”：707
}
],
“batchId”：“batch1”
}
]

如前所述，我没有获取字段

batchId

的逻辑。除此之外，简单的解决方案可能是：

db.collection.aggregate([
   { $group: { _id: "$localityID", batches: { $push: { field1: "$field1", field2: "$field2" } } } },
   {
      $project: {
         localityID: "$_id",
         batches: { $slice: ["$batches", 1, 3] }
      }
   }
])

您的聚合管道没有任何

batchId

字段，因此您提供的结果肯定不是来自此聚合管道。我没有获得

batchId

字段的逻辑。是@wernfrieddomsheit，

batchId

字段不在输入中。每个

localityID

的

batchId

值可以是一个简单的序列号，从0开始，一直到为该

localityID

创建的文档总数，当为大型集合（超过6000万条记录）实施上述解决方案时，我收到以下错误

$push使用了太多内存，无法溢出到磁盘。

。我们是否有办法修改上述解决方案以解决此错误。我尝试启用

allowDiskUse

但没有解决问题。完整错误消息

完整响应为{“操作时间”：{“$timestamp”：{“t”：1617712444，“I”：1}，“ok”：0.0，“errmsg”：“$push使用了太多内存，无法溢出到磁盘。内存限制：104857600字节”，“code”：146，“codeName”：“ExceededMemoryLimit”，“$clusterTime”：{“clusterTime”：{“$timestamp”：{“t”：1617712522，“i”：1}}，“keyId”：69039205590851}}}

我不太擅长编写内存效率高的查询。如果我找到一些解决方案，我会更新你的。对不起！

db.collection.aggregate([
   { $group: { _id: "$localityID", batches: { $push: { field1: "$field1", field2: "$field2" } } } },
   {
      $project: {
         localityID: "$_id",
         batches: { $slice: ["$batches", 1, 3] }
      }
   }
])