Apache spark 多维数组列中的Spark聚合值

Apache spark 多维数组列中的Spark聚合值,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,数据集包含一个多维数组列,列之间有父子关系,需要对其进行聚合 数据(示例)- 输入模式- |-- consumerId: string |-- platform: string |-- impressions: array | |-- element: struct (containsNull = true) | | |-- id: string (nullable = true) // same "id" will be treat

数据集包含一个多维数组列,列之间有父子关系,需要对其进行聚合

数据(示例)-

输入模式-

 |-- consumerId: string
 |-- platform: string 
 |-- impressions: array
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)                 // same "id" will be treated as "containerId" and "itemId"
 |    |    |-- impressionType: string (nullable = true)     // Decide factor "container" or "item"
 |    |    |-- impressionId: long (nullable = true)         // postion - item will be identified by "impressions.impressionParentId = impressions.impressionId“ 
 |    |    |-- impressionParentId: long (nullable = true)   // relation between container & it's items as above
 |    |    |-- impressionTimes: array (nullable = true)     // Need to check Max time
 |    |    |    |-- element: long (containsNull = true)
 |-- consumerId: string
 |-- platform: string 
 |-- impressions: array                
 |   |-- containerId: integer          // If impressions.impressionType='container' then it's "id" will containerID
 |   |-- itemImpressions: array        // List of all impressions id (as itemId) where "impressionParentId = conainer.impressionId“
 |   |   |-- itemId: long              // "impressions.id" if  "impressions.impressionParentId = impressions.impressionId“
 |   |   |-- lastImpressionTime: long  // max(impressionTimes)
 |   |   |-- impressionCount: int    
输出模式-

 |-- consumerId: string
 |-- platform: string 
 |-- impressions: array
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)                 // same "id" will be treated as "containerId" and "itemId"
 |    |    |-- impressionType: string (nullable = true)     // Decide factor "container" or "item"
 |    |    |-- impressionId: long (nullable = true)         // postion - item will be identified by "impressions.impressionParentId = impressions.impressionId“ 
 |    |    |-- impressionParentId: long (nullable = true)   // relation between container & it's items as above
 |    |    |-- impressionTimes: array (nullable = true)     // Need to check Max time
 |    |    |    |-- element: long (containsNull = true)
 |-- consumerId: string
 |-- platform: string 
 |-- impressions: array                
 |   |-- containerId: integer          // If impressions.impressionType='container' then it's "id" will containerID
 |   |-- itemImpressions: array        // List of all impressions id (as itemId) where "impressionParentId = conainer.impressionId“
 |   |   |-- itemId: long              // "impressions.id" if  "impressions.impressionParentId = impressions.impressionId“
 |   |   |-- lastImpressionTime: long  // max(impressionTimes)
 |   |   |-- impressionCount: int    
几点-

  • 每行包含多维数组字段-印象

  • impressions包含容器和相关的项目,这些关系可以通过
    impressions.impressionParentId=impressions.impressionId
    [我认为需要进行自连接来建立关系]

  • 如果分解所有记录,则无法建立容器及其项目关系,因为impressionId与每行中的位置相同
  • 如果
    impressions.impressionType='container'
    ,那么它的“id”将containerID显示在输出模式中
  • 三级聚合-
    • 首先是consumerId和平台
    • 集装箱运输第二名
    • 第三个是项目ID、总计数和最大值(印象时间)

  • 上面的示例来自已过滤的数据集,我尝试分解和聚合,但不起作用。

    请编辑并在帖子中包含代码,这将有助于确定问题。编辑并更新了示例请编辑并在帖子中包含代码,这将有助于确定问题。编辑并更新了示例