如何在MongoDB中将查找限制为聚合中的唯一值

如何在MongoDB中将查找限制为聚合中的唯一值,mongodb,mongodb-query,aggregation-framework,Mongodb,Mongodb Query,Aggregation Framework,示例数据集: { "source": "http://adress.com/", "date": ISODate("2016-08-31T08:41:00.000Z"), "author": "Some Guy", "thread": NumberInt(115265), "commentID": NumberInt(2693454), "title": ["A", "title", "for", "a", "comment"], "com

示例数据集:

{
    "source": "http://adress.com/",
    "date": ISODate("2016-08-31T08:41:00.000Z"),
    "author": "Some Guy",
    "thread": NumberInt(115265),
    "commentID": NumberInt(2693454),
    "title": ["A", "title", "for", "a", "comment"],
    "comment": ["This", "is", "a", "comment", "with", "a", "duplicate"]
}
我使用的数据集基本上是来自用户的注释,具有唯一的
commentID
。注释本身是一个单词数组。我已经成功地解开了阵列,匹配了流行语并找回了所有的发现

我现在的问题是消除重复,即流行语在评论中多次出现。我想我必须使用一个小组,但找不到一种方法

目前的管道是:

[
    {"$unwind": "$comment"},
    {"$match": {"comment": buzzword } }
]
这很管用。但是如果我在搜索流行词“a”,在上面的例子中,它会找到两次注释,因为单词“a”会出现两次


我需要的是一个JSON,以便管道将所有重复项都放在第一个之后。

一个可能的解决方案是使用
$group
这样做

...
{ $unwind: "$comment"},
{ $match: {"comment": buzzword } },
{
    $group: {
        _id : "$_id",
        source: { $first: "$source" },
        date: { $first: "$date" },
        author: { $first: "$author" },
        thread: { $first: "$thread" },
        commentID: { $first: "$commentID" },
        title: { $first: "$title" }
    } 
}
...
...
{
    $project: {             
        source: 1,
        date: 1,
        author: 1,
        thread: 1,
        commentID: 1,
        title: 1,
        comment: { $setUnion: ["$comment"] }
    }
},
{$unwind: "$comment"},
{$match: {"comment": buzzword } }
...
...
{
    $project: {             
        source: 1,
        date: 1,
        author: 1,
        thread: 1,
        commentID: 1,
        title: 1,
        comment: 1,
        commentWord: { $setUnion: ["$comment"] }
    }
},
{$unwind: "$commentWord"},
{$match: {"commentWord": buzzword } }
...
另一种方法是在展开数组之前使用
$project
,以消除重复的单词,如

...
{ $unwind: "$comment"},
{ $match: {"comment": buzzword } },
{
    $group: {
        _id : "$_id",
        source: { $first: "$source" },
        date: { $first: "$date" },
        author: { $first: "$author" },
        thread: { $first: "$thread" },
        commentID: { $first: "$commentID" },
        title: { $first: "$title" }
    } 
}
...
...
{
    $project: {             
        source: 1,
        date: 1,
        author: 1,
        thread: 1,
        commentID: 1,
        title: 1,
        comment: { $setUnion: ["$comment"] }
    }
},
{$unwind: "$comment"},
{$match: {"comment": buzzword } }
...
...
{
    $project: {             
        source: 1,
        date: 1,
        author: 1,
        thread: 1,
        commentID: 1,
        title: 1,
        comment: 1,
        commentWord: { $setUnion: ["$comment"] }
    }
},
{$unwind: "$commentWord"},
{$match: {"commentWord": buzzword } }
...

因评论而更新:

要保留
注释
数组,可以将该数组投影到另一个字段,然后像这样展开该字段

...
{ $unwind: "$comment"},
{ $match: {"comment": buzzword } },
{
    $group: {
        _id : "$_id",
        source: { $first: "$source" },
        date: { $first: "$date" },
        author: { $first: "$author" },
        thread: { $first: "$thread" },
        commentID: { $first: "$commentID" },
        title: { $first: "$title" }
    } 
}
...
...
{
    $project: {             
        source: 1,
        date: 1,
        author: 1,
        thread: 1,
        commentID: 1,
        title: 1,
        comment: { $setUnion: ["$comment"] }
    }
},
{$unwind: "$comment"},
{$match: {"comment": buzzword } }
...
...
{
    $project: {             
        source: 1,
        date: 1,
        author: 1,
        thread: 1,
        commentID: 1,
        title: 1,
        comment: 1,
        commentWord: { $setUnion: ["$comment"] }
    }
},
{$unwind: "$commentWord"},
{$match: {"commentWord": buzzword } }
...


希望这有助于您可以在不使用的情况下运行单个管道,从而利用数组运算符。前者将为您提供给定数组中的第一个元素,此数组将是使用后者过滤元素的结果,

按照此示例获得所需的结果:

db.collection.aggregate([
    { "$match": { "comment": buzzword } },
    {
        "$project": {
            "source": 1,
            "date": 1,
            "author": 1,
            "thread": 1,
            "commentID": 1,
            "title": 1,
            "comment": 1,
            "distinct_matched_comment": {
                "$arrayElemAt": [ 
                    {
                        "$filter": {
                            "input": "$comment",
                            "as": "word",
                            "cond": {
                                "$eq": ["$$word", buzzword]
                            }
                        }
                    }, 0
                ]
            }
        }
    }
])

解释

在上面的管道中,技巧是首先通过只选择满足给定条件的元素来过滤注释数组。例如,要演示此概念,请运行以下管道:

db.collection.aggregate([
    {
        "$project": {
            "filtered_comment": {
                "$filter": {
                    "input": ["This", "is", "a", "comment", "with", "a", "duplicate"], /* hardcoded input array for demo */
                    "as": "word", /* The variable name for the element in the input array. 
                                     The as expression accesses each element in the input array by this variable.*/
                    "cond": { /* this condition determines whether to include the element in the resulting array. */
                        "$eq": ["$$word", "a"] /* condition where the variable equals the buzzword "a" */
                    }
                }
            }
        }
    }
])
输出

{
    "_id" : ObjectId("57dbd747be80cdcab63703dc"),
    "filtered_comment" : [ 
        "a", 
        "a"
    ]
}
{
    "_id" : ObjectId("57dbd747be80cdcab63703dc"),
    "distinct_matched_comment": "a"
}
由于
输入
参数接受解析为数组的表达式,因此可以改用数组字段


进一步考虑上述结果,我们可以展示操作符的工作原理:

db.collection.aggregate([
    {
        "$project": {
            "distinct_matched_comment": {
                "$arrayElemAt": [ 
                    ["a", "a"], /* array produced by the above $filter expression */
                    0 /* the index position of the element we want to return, here being the first */
                ]   
            }
        }
    }
])
输出

{
    "_id" : ObjectId("57dbd747be80cdcab63703dc"),
    "filtered_comment" : [ 
        "a", 
        "a"
    ]
}
{
    "_id" : ObjectId("57dbd747be80cdcab63703dc"),
    "distinct_matched_comment": "a"
}
由于运算符中的表达式

{ "$arrayElemAt": [ <array>, <idx> ] } 

谢谢你,这确实有帮助。有没有办法以某种方式保留注释数组?否则,我将不得不稍后查找。^^您更新的解决方案保留了注释,但不知何故跳过了删除重复项。好吧,最坏的情况是我可以在RStudio中删除重复项。不管怎样,谢谢你的解决方案,因为它简单而优雅,尽管我缺乏知识,但我还是能理解它。跳过删除重复项到底是什么意思?如果使用“a”作为流行语,结果不会给您两次第一个文档,但是注释数组是原始注释数组(带有重复项)-不应该是原始数组的后一个吗?好吧,我让管道在我的测试流行语上运行:6551次点击,其中4590次是唯一的。因此,在保持阵列完整的情况下,它没有选择唯一性。这不可能,但如果没有示例,很难对其进行故障排除。该解决方案以某种方式为我返回所有数据集。现在运行良好。非常感谢。现在,我必须深入研究它是如何工作的DNice回答,开销比我的少+1@DAXaholic干杯@AndersBernard我在答案中添加了一些解释。