如何在MongoDB中将查找限制为聚合中的唯一值_Mongodb_Mongodb Query_Aggregation Framework

如何在MongoDB中将查找限制为聚合中的唯一值

mongodb

如何在MongoDB中将查找限制为聚合中的唯一值,mongodb,mongodb-query,aggregation-framework,Mongodb,Mongodb Query,Aggregation Framework,示例数据集： { "source": "http://adress.com/", "date": ISODate("2016-08-31T08:41:00.000Z"), "author": "Some Guy", "thread": NumberInt(115265), "commentID": NumberInt(2693454), "title": ["A", "title", "for", "a", "comment"], "com

示例数据集：

{
    "source": "http://adress.com/",
    "date": ISODate("2016-08-31T08:41:00.000Z"),
    "author": "Some Guy",
    "thread": NumberInt(115265),
    "commentID": NumberInt(2693454),
    "title": ["A", "title", "for", "a", "comment"],
    "comment": ["This", "is", "a", "comment", "with", "a", "duplicate"]
}

我使用的数据集基本上是来自用户的注释，具有唯一的

commentID

。注释本身是一个单词数组。我已经成功地解开了阵列，匹配了流行语并找回了所有的发现

我现在的问题是消除重复，即流行语在评论中多次出现。我想我必须使用一个小组，但找不到一种方法

目前的管道是：

[
    {"$unwind": "$comment"},
    {"$match": {"comment": buzzword } }
]

这很管用。但是如果我在搜索流行词“a”，在上面的例子中，它会找到两次注释，因为单词“a”会出现两次

我需要的是一个JSON，以便管道将所有重复项都放在第一个之后。

一个可能的解决方案是使用

$group

这样做

...
{ $unwind: "$comment"},
{ $match: {"comment": buzzword } },
{
    $group: {
        _id : "$_id",
        source: { $first: "$source" },
        date: { $first: "$date" },
        author: { $first: "$author" },
        thread: { $first: "$thread" },
        commentID: { $first: "$commentID" },
        title: { $first: "$title" }
    } 
}
...

...
{
    $project: {             
        source: 1,
        date: 1,
        author: 1,
        thread: 1,
        commentID: 1,
        title: 1,
        comment: { $setUnion: ["$comment"] }
    }
},
{$unwind: "$comment"},
{$match: {"comment": buzzword } }
...

...
{
    $project: {             
        source: 1,
        date: 1,
        author: 1,
        thread: 1,
        commentID: 1,
        title: 1,
        comment: 1,
        commentWord: { $setUnion: ["$comment"] }
    }
},
{$unwind: "$commentWord"},
{$match: {"commentWord": buzzword } }
...

另一种方法是在展开数组之前使用

$project

，以消除重复的单词，如

...
{ $unwind: "$comment"},
{ $match: {"comment": buzzword } },
{
    $group: {
        _id : "$_id",
        source: { $first: "$source" },
        date: { $first: "$date" },
        author: { $first: "$author" },
        thread: { $first: "$thread" },
        commentID: { $first: "$commentID" },
        title: { $first: "$title" }
    } 
}
...

...
{
    $project: {             
        source: 1,
        date: 1,
        author: 1,
        thread: 1,
        commentID: 1,
        title: 1,
        comment: { $setUnion: ["$comment"] }
    }
},
{$unwind: "$comment"},
{$match: {"comment": buzzword } }
...

...
{
    $project: {             
        source: 1,
        date: 1,
        author: 1,
        thread: 1,
        commentID: 1,
        title: 1,
        comment: 1,
        commentWord: { $setUnion: ["$comment"] }
    }
},
{$unwind: "$commentWord"},
{$match: {"commentWord": buzzword } }
...

因评论而更新：

要保留

注释

数组，可以将该数组投影到另一个字段，然后像这样展开该字段

...
{ $unwind: "$comment"},
{ $match: {"comment": buzzword } },
{
    $group: {
        _id : "$_id",
        source: { $first: "$source" },
        date: { $first: "$date" },
        author: { $first: "$author" },
        thread: { $first: "$thread" },
        commentID: { $first: "$commentID" },
        title: { $first: "$title" }
    } 
}
...

...
{
    $project: {             
        source: 1,
        date: 1,
        author: 1,
        thread: 1,
        commentID: 1,
        title: 1,
        comment: { $setUnion: ["$comment"] }
    }
},
{$unwind: "$comment"},
{$match: {"comment": buzzword } }
...

...
{
    $project: {             
        source: 1,
        date: 1,
        author: 1,
        thread: 1,
        commentID: 1,
        title: 1,
        comment: 1,
        commentWord: { $setUnion: ["$comment"] }
    }
},
{$unwind: "$commentWord"},
{$match: {"commentWord": buzzword } }
...

希望这有助于您可以在不使用的情况下运行单个管道，从而利用数组运算符和。前者将为您提供给定数组中的第一个元素，此数组将是使用后者过滤元素的结果，

按照此示例获得所需的结果：

db.collection.aggregate([
    { "$match": { "comment": buzzword } },
    {
        "$project": {
            "source": 1,
            "date": 1,
            "author": 1,
            "thread": 1,
            "commentID": 1,
            "title": 1,
            "comment": 1,
            "distinct_matched_comment": {
                "$arrayElemAt": [ 
                    {
                        "$filter": {
                            "input": "$comment",
                            "as": "word",
                            "cond": {
                                "$eq": ["$$word", buzzword]
                            }
                        }
                    }, 0
                ]
            }
        }
    }
])

解释

在上面的管道中，技巧是首先通过只选择满足给定条件的元素来过滤注释数组。例如，要演示此概念，请运行以下管道：

db.collection.aggregate([
    {
        "$project": {
            "filtered_comment": {
                "$filter": {
                    "input": ["This", "is", "a", "comment", "with", "a", "duplicate"], /* hardcoded input array for demo */
                    "as": "word", /* The variable name for the element in the input array. 
                                     The as expression accesses each element in the input array by this variable.*/
                    "cond": { /* this condition determines whether to include the element in the resulting array. */
                        "$eq": ["$$word", "a"] /* condition where the variable equals the buzzword "a" */
                    }
                }
            }
        }
    }
])

输出

{
    "_id" : ObjectId("57dbd747be80cdcab63703dc"),
    "filtered_comment" : [ 
        "a", 
        "a"
    ]
}

{
    "_id" : ObjectId("57dbd747be80cdcab63703dc"),
    "distinct_matched_comment": "a"
}

由于的

输入

参数接受解析为数组的表达式，因此可以改用数组字段

进一步考虑上述结果，我们可以展示操作符的工作原理：

db.collection.aggregate([
    {
        "$project": {
            "distinct_matched_comment": {
                "$arrayElemAt": [ 
                    ["a", "a"], /* array produced by the above $filter expression */
                    0 /* the index position of the element we want to return, here being the first */
                ]   
            }
        }
    }
])

输出

{
    "_id" : ObjectId("57dbd747be80cdcab63703dc"),
    "filtered_comment" : [ 
        "a", 
        "a"
    ]
}

{
    "_id" : ObjectId("57dbd747be80cdcab63703dc"),
    "distinct_matched_comment": "a"
}

由于运算符中的表达式

{ "$arrayElemAt": [ <array>, <idx> ] }

谢谢你，这确实有帮助。有没有办法以某种方式保留注释数组？否则，我将不得不稍后查找。^^您更新的解决方案保留了注释，但不知何故跳过了删除重复项。好吧，最坏的情况是我可以在RStudio中删除重复项。不管怎样，谢谢你的解决方案，因为它简单而优雅，尽管我缺乏知识，但我还是能理解它。跳过删除重复项到底是什么意思？如果使用“a”作为流行语，结果不会给您两次第一个文档，但是注释数组是原始注释数组（带有重复项）-不应该是原始数组的后一个吗？好吧，我让管道在我的测试流行语上运行：6551次点击，其中4590次是唯一的。因此，在保持阵列完整的情况下，它没有选择唯一性。这不可能，但如果没有示例，很难对其进行故障排除。该解决方案以某种方式为我返回所有数据集。现在运行良好。非常感谢。现在，我必须深入研究它是如何工作的DNice回答，开销比我的少+1@DAXaholic干杯@AndersBernard我在答案中添加了一些解释。