可以在ArangoDB中编写查询来聚合关联文档中的值吗?

可以在ArangoDB中编写查询来聚合关联文档中的值吗?,arangodb,Arangodb,假设你有一个电影订阅服务,拥有普通和高级会员资格 以下是由用户活动生成并作为文档存储在集合中的数据示例: [ { "eventType": "sessionInfo", "userType": "premium", "sessionGroupID": 1 }, { "eventType": "mediaPlay", "productSKU": "starwars", "sess

假设你有一个电影订阅服务,拥有普通和高级会员资格

以下是由用户活动生成并作为文档存储在集合中的数据示例:

[
    {
        "eventType": "sessionInfo",
        "userType": "premium",
        "sessionGroupID": 1
    },
    {
        "eventType": "mediaPlay",
        "productSKU": "starwars",
        "sessionGroupID": 1,
        "elapsed": 200
    },
    {
        "eventType": "sessionInfo",
        "userType": "premium",
        "sessionGroupID": 2
    },
    {
        "eventType": "mediaPlay",
        "productSKU": "xmen",
        "sessionGroupID": 2,
        "elapsed": 500
    },
    {
        "eventType": "sessionInfo",
        "userType": "normal",
        "sessionGroupID": 3
    },
    {
        "eventType": "mediaPlay",
        "productSKU": "xmen",
        "sessionGroupID": 3,
        "elapsed": 10
    },
    {
        "eventType": "sessionInfo",
        "userType": "normal",
        "sessionGroupID": 4
    },
    {
        "eventType": "mediaPlay",
        "productSKU": "xmen",
        "sessionGroupID": 4,
        "elapsed": 100
    },
    {
        "eventType": "sessionInfo",
        "userType": "normal",
        "sessionGroupID": 5
    },
    {
        "eventType": "mediaPlay",
        "productSKU": "xmen",
        "sessionGroupID": 5,
        "elapsed": 5
    },
    {
        "eventType": "mediaPlay",
        "productSKU": "starwars",
        "sessionGroupID": 5,
        "elapsed": 25
    }
]
您可以看到有两种“事件类型”:

  • “sessionInfo”文档,包含整个系统共有的信息 用户会话

  • “mediaPlay”文档,用于存储一次播放的秒数 电影被观看了

(每个“mediaPlay”事件都包含sessionGroupID,因此它可以与该会话关联。)


问题1: 如果总共有数千万个文档,您将如何编写一个按用户类型分组的查询,以总计每部电影的观看时间

所需的查询结果:

premium users - total of "elapsed":
    xmen: 500
    starwars: 200

normal users - total of "elapsed":
    xmen: 115
    starwars: 25

问题2: 如果数据的结构不适合这种查询,那么理想的结构是什么

  • 例如,将“mediaPlay”事件作为嵌套数组嵌套在每个“sessionInfo”文档中是否更好
像这样

[
    {
        "eventType": "sessionInfo",
        "userType": "premium",
        "sessionGroupID": 1,
        "viewLog": [
            {
                "eventType": "mediaPlay",
                "productSKU": "starwars",
                "sessionGroupID": 1,
                "elapsed": 200
            }
        ]
    },
    {
        "eventType": "sessionInfo",
        "userType": "premium",
        "sessionGroupID": 2,
        "viewLog": [
            {
                "eventType": "mediaPlay",
                "productSKU": "xmen",
                "sessionGroupID": 2,
                "elapsed": 500
            }
        ]
    },
    {
        "eventType": "sessionInfo",
        "userType": "normal",
        "sessionGroupID": 3,
        "viewLog": [
            {
                "eventType": "mediaPlay",
                "productSKU": "xmen",
                "sessionGroupID": 3,
                "elapsed": 10
            }
        ]
    },
    {
        "eventType": "sessionInfo",
        "userType": "normal",
        "sessionGroupID": 4,
        "viewLog": [
            {
                "eventType": "mediaPlay",
                "productSKU": "xmen",
                "sessionGroupID": 4,
                "elapsed": 100
            }
        ]
    },
    {
        "eventType": "sessionInfo",
        "userType": "normal",
        "sessionGroupID": 5,
        "viewLog": [
            {
                "eventType": "mediaPlay",
                "productSKU": "xmen",
                "sessionGroupID": 5,
                "elapsed": 5
            },
            {
                "eventType": "mediaPlay",
                "productSKU": "starwars",
                "sessionGroupID": 5,
                "elapsed": 25
            }
        ]
    }
]

感谢您的指导和建议

下面的查询遍历集合并收集按用户类型分组的所有会话ID。然后,它创建一个子查询,该子查询迭代集合并收集所有电影和经过的时间总和,其中
eventType
为“mediaPlay”,收集的会话包含
sessionGroupID

@@coll
是一个包含集合名称的集合

FOR doc IN @@coll
  FILTER doc.eventType == "sessionInfo"
  COLLECT userTypes = doc.userType INTO sessions = doc.sessionGroupID
  RETURN {
    "userTypes" : userTypes,
    "movies" : (
      FOR event IN @@coll
        FILTER event.sessionGroupID IN sessions
        FILTER event.eventType == "mediaPlay"
        COLLECT movie = event.productSKU INTO elapsed = event.elapsed
        RETURN { "movie" : movie, "elapsed" : SUM(elapsed) }
      )
  }
此查询的结果是:

[
  {
    "userTypes": "normal",
    "movies": [
      {
        "movie": "starwars",
        "elapsed": 25
      },
      {
        "movie": "xmen",
        "elapsed": 115
      }
    ]
  },
  {
    "userTypes": "premium",
    "movies": [
      {
        "movie": "starwars",
        "elapsed": 200
      },
      {
        "movie": "xmen",
        "elapsed": 500
      }
    ]
  }
]
关于你的第二个问题。嵌套数组/对象不会优化此查询,但应将数据拆分为两个集合。每个
eventType
(例如,命名集合,如eventType
sessionInfo
mediaPlay
)。这减少了所需的筛选语句的数量,更重要的是,它允许您通过SessionInfo和mediaPlays单独查询,这大大提高了您的性能

然后,查询将如下所示:

FOR doc IN sessionInfo
  COLLECT userTypes = doc.userType INTO sessions = doc.sessionGroupID
  RETURN {
    "userTypes" : userTypes,
    "movies" : (
      FOR event IN mediaPlay
        FILTER event.sessionGroupID IN sessions
        COLLECT movie = event.productSKU INTO elapsed = event.elapsed
        RETURN { "movie" : movie, "elapsed" : SUM(elapsed) }
      )
  }

以下查询遍历集合并收集按用户类型分组的所有会话ID。然后,它创建一个子查询,该子查询迭代集合并收集所有电影和经过的时间总和,其中
eventType
为“mediaPlay”,收集的会话包含
sessionGroupID

@@coll
是一个包含集合名称的集合

FOR doc IN @@coll
  FILTER doc.eventType == "sessionInfo"
  COLLECT userTypes = doc.userType INTO sessions = doc.sessionGroupID
  RETURN {
    "userTypes" : userTypes,
    "movies" : (
      FOR event IN @@coll
        FILTER event.sessionGroupID IN sessions
        FILTER event.eventType == "mediaPlay"
        COLLECT movie = event.productSKU INTO elapsed = event.elapsed
        RETURN { "movie" : movie, "elapsed" : SUM(elapsed) }
      )
  }
此查询的结果是:

[
  {
    "userTypes": "normal",
    "movies": [
      {
        "movie": "starwars",
        "elapsed": 25
      },
      {
        "movie": "xmen",
        "elapsed": 115
      }
    ]
  },
  {
    "userTypes": "premium",
    "movies": [
      {
        "movie": "starwars",
        "elapsed": 200
      },
      {
        "movie": "xmen",
        "elapsed": 500
      }
    ]
  }
]
关于你的第二个问题。嵌套数组/对象不会优化此查询,但应将数据拆分为两个集合。每个
eventType
(例如,命名集合,如eventType
sessionInfo
mediaPlay
)。这减少了所需的筛选语句的数量,更重要的是,它允许您通过SessionInfo和mediaPlays单独查询,这大大提高了您的性能

然后,查询将如下所示:

FOR doc IN sessionInfo
  COLLECT userTypes = doc.userType INTO sessions = doc.sessionGroupID
  RETURN {
    "userTypes" : userTypes,
    "movies" : (
      FOR event IN mediaPlay
        FILTER event.sessionGroupID IN sessions
        COLLECT movie = event.productSKU INTO elapsed = event.elapsed
        RETURN { "movie" : movie, "elapsed" : SUM(elapsed) }
      )
  }

非常感谢,我会深入研究的。效果很好!非常感谢,这解释了很多!非常感谢,我会深入研究的。效果很好!非常感谢,这解释了很多!