Mongodb 聚合$lookup匹配管道中文档的总大小超过最大文档大小_Mongodb_Aggregation Framework

Mongodb 聚合$lookup匹配管道中文档的总大小超过最大文档大小

mongodb

Mongodb 聚合$lookup匹配管道中文档的总大小超过最大文档大小,mongodb,aggregation-framework,Mongodb,Aggregation Framework,我有一个非常简单的$lookup聚合查询，如下所示： {'$lookup': {'from': 'edge', 'localField': 'gid', 'foreignField': 'to', 'as': 'from'}} 当我在有足够文档的匹配项上运行此操作时，会出现以下错误： Command failed with error 4568: 'Total size of documents in edge matching { $match: { $and: [ { fro

我有一个非常简单的

$lookup

聚合查询，如下所示：

{'$lookup':
 {'from': 'edge',
  'localField': 'gid',
  'foreignField': 'to',
  'as': 'from'}}

当我在有足够文档的匹配项上运行此操作时，会出现以下错误：

Command failed with error 4568: 'Total size of documents in edge
matching { $match: { $and: [ { from: { $eq: "geneDatabase:hugo" }
}, {} ] } } exceeds maximum document size' on server

所有限制文档数量的尝试都失败

allowDiskUse:true

不执行任何操作。在中发送

光标

不起任何作用。将

$limit

添加到聚合中也会失败

这怎么可能

然后我又看到了错误。那

$match

和

$and

和

$eq

是从哪里来的？幕后的聚合管道是否对另一个聚合调用了

$lookup

，一个它自己运行的聚合，我无法提供限制或使用游标

这里发生了什么？

如前面的注释所述，发生错误的原因是，当执行默认情况下从外部集合的结果在父文档中生成目标“数组”时，为该数组选择的文档的总大小会导致父文档超过

此操作的计数器是在管道阶段之后立即处理的。这实际上改变了in的行为，因此结果不是在父级中生成数组，而是每个匹配文档的每个父级的“副本”

与的常规用法非常相似，只是

展开操作实际上添加到了管道操作本身，而不是作为“单独的”管道阶段进行处理。理想情况下，您还可以在后面添加一个条件，该条件还将创建一个要添加到的匹配的
参数。实际上，您可以在管道的explain
输出中看到这一点
核心文档中的一节实际上（简要）介绍了该主题：
$lookup+$unwind合并
3.2版中的新版本
当一个$unwind紧接着另一个$lookup，并且$unwind在$lookup的as字段上运行时，优化器可以将$unwind合并到$lookup阶段。这样可以避免创建大型中间文档
最好的例子是，通过创建超过16MB BSON限制的“相关”文档，将服务器置于压力之下。尽可能简短地完成，以突破BSON限制并在其周围工作：
const MongoClient = require('mongodb').MongoClient;

const uri = 'mongodb://localhost/test';

function data(data) {
  console.log(JSON.stringify(data, undefined, 2))
}

(async function() {

  let db;

  try {
    db = await MongoClient.connect(uri);

    console.log('Cleaning....');
    // Clean data
    await Promise.all(
      ["source","edge"].map(c => db.collection(c).remove() )
    );

    console.log('Inserting...')

    await db.collection('edge').insertMany(
      Array(1000).fill(1).map((e,i) => ({ _id: i+1, gid: 1 }))
    );
    await db.collection('source').insert({ _id: 1 })

    console.log('Fattening up....');
    await db.collection('edge').updateMany(
      {},
      { $set: { data: "x".repeat(100000) } }
    );

    // The full pipeline. Failing test uses only the $lookup stage
    let pipeline = [
      { $lookup: {
        from: 'edge',
        localField: '_id',
        foreignField: 'gid',
        as: 'results'
      }},
      { $unwind: '$results' },
      { $match: { 'results._id': { $gte: 1, $lte: 5 } } },
      { $project: { 'results.data': 0 } },
      { $group: { _id: '$_id', results: { $push: '$results' } } }
    ];

    // List and iterate each test case
    let tests = [
      'Failing.. Size exceeded...',
      'Working.. Applied $unwind...',
      'Explain output...'
    ];

    for (let [idx, test] of Object.entries(tests)) {
      console.log(test);

      try {
        let currpipe = (( +idx === 0 ) ? pipeline.slice(0,1) : pipeline),
            options = (( +idx === tests.length-1 ) ? { explain: true } : {});

        await new Promise((end,error) => {
          let cursor = db.collection('source').aggregate(currpipe,options);
          for ( let [key, value] of Object.entries({ error, end, data }) )
            cursor.on(key,value);
        });
      } catch(e) {
        console.error(e);
      }

    }

  } catch(e) {
    console.error(e);
  } finally {
    db.close();
  }

})();

在插入一些初始数据后，列表将尝试运行仅由以下数据组成的聚合，该聚合将失败，并出现以下错误：
Command failed with error 4568: 'Total size of documents in edge
matching { $match: { $and: [ { from: { $eq: "geneDatabase:hugo" }
}, {} ] } } exceeds maximum document size' on server

{MongoError:边缘匹配管道{$match:{$and:[{gid:{$eq:1}}，{}]}中文档的总大小超过了最大文档大小
这基本上是告诉你在检索时超过了BSON限制
相比之下，下一次尝试将添加和管道阶段
解释输出：
  {
    "$lookup": {
      "from": "edge",
      "as": "results",
      "localField": "_id",
      "foreignField": "gid",
      "unwinding": {                        // $unwind now is unwinding
        "preserveNullAndEmptyArrays": false
      },
      "matching": {                         // $match now is matching
        "$and": [                           // and actually executed against 
          {                                 // the foreign collection
            "_id": {
              "$gte": 1
            }
          },
          {
            "_id": {
              "$lte": 5
            }
          }
        ]
      }
    }
  },
  // $unwind and $match stages removed
  {
    "$project": {
      "results": {
        "data": false
      }
    }
  },
  {
    "$group": {
      "_id": "$_id",
      "results": {
        "$push": "$results"
      }
    }
  }

结果当然是成功的，因为结果不再放在父文档中，所以不能超过BSON限制
这实际上只是一个添加的结果，但是添加了，例如，表明这是也添加到阶段中，并且总体效果是以有效的方式“限制”返回的结果，因为它都是在该操作中完成的，除了那些匹配的结果之外，没有其他结果被实际返回
通过以这种方式构造，您可以查询将超过BSON限制的“引用数据”，然后，如果您希望结果返回到数组格式，只要它们被实际执行的“隐藏查询”有效过滤

MongoDB 3.6及以上版本-用于“左连接”的附加功能
正如上面所有内容所指出的，BSON限制是一个“硬”限制，您不能违反它，这就是为什么作为一个过渡步骤，BSON限制通常是必要的。但是，存在一个限制，即“左连接”变成“内部连接”由于无法保留内容，因此即使preserveNuLandEmptyArray
也会否定“合并”，仍然保留完整的数组，从而导致相同的BSON限制问题
MongoDB 3.6添加了新的语法，允许使用“子管道”表达式来代替“本地”和“外部”键。因此，不必像演示的那样使用“合并”选项，只要生成的数组也不违反限制，就可以在该管道中设置条件以返回“完整”的数组，并且可能没有匹配项，这表示“左连接”
然后，新的表达式将是：
{ "$lookup": {
  "from": "edge",
  "let": { "gid": "$gid" },
  "pipeline": [
    { "$match": {
      "_id": { "$gte": 1, "$lte": 5 },
      "$expr": { "$eq": [ "$$gid", "$to" ] }
    }}          
  ],
  "as": "from"
}}

事实上，这基本上就是MongoDB使用前面的语法“隐藏”所做的，因为3.6使用“内部”来构造语句。当然，区别在于，在实际执行语句的方式中没有“展开”选项
如果由于“pipeline”
表达式而没有实际生成任何文档，则主控文档中的目标数组实际上将为空，就像“LEFT JOIN”实际为空一样，这是的正常行为，没有任何其他选项
但是，的输出数组不能导致创建它的文档超过BSON限制。因此，您必须确保条件下的任何“匹配”内容都保持在该限制下，否则相同的错误将持续存在，当然，除非您实际使用它来实现“内部联接”.
我对fllowing Node.js查询也有同样的问题，因为“redemptions”集合有超过400000的数据。我使用的是Mongo DB server 4.2和Node js driver 3.5.3
db.collection('businesses').aggregate(
    { 
        $lookup: { from: 'redemptions', localField: "_id", foreignField: "business._id", as: "redemptions" }
    },      
    {
        $project: {
            _id: 1,
            name: 1,            
            email: 1,               
            "totalredemptions" : {$size:"$redemptions"}
        }
    }

我对查询进行了如下修改，使其运行得非常快
db.collection('businesses').aggregate(query,
{
    $lookup:
    {
        from: 'redemptions',
        let: { "businessId": "$_id" },
        pipeline: [
            { $match: { $expr: { $eq: ["$business._id", "$$businessId"] } } },
            { $group: { _id: "$_id", totalCount: { $sum: 1 } } },
            { $project: { "_id": 0, "totalCount": 1 } }
        ],
        as: "redemptions"
    }, 
    {
        $project: {
            _id: 1,
            name: 1,            
            email: 1,               
            "totalredemptions" : {$size:"$redemptions"}
        }
    }
}

直接在$lookup
之后添加一个$unwind
。这实际上在某种程度上改变了$lookup
的行为。还要注意哪个MongoDB版本是ac