Apache spark 多维数组列中的Spark聚合值
数据集包含一个多维数组列,列之间有父子关系,需要对其进行聚合 数据(示例)- 输入模式-Apache spark 多维数组列中的Spark聚合值,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,数据集包含一个多维数组列,列之间有父子关系,需要对其进行聚合 数据(示例)- 输入模式- |-- consumerId: string |-- platform: string |-- impressions: array | |-- element: struct (containsNull = true) | | |-- id: string (nullable = true) // same "id" will be treat
|-- consumerId: string
|-- platform: string
|-- impressions: array
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true) // same "id" will be treated as "containerId" and "itemId"
| | |-- impressionType: string (nullable = true) // Decide factor "container" or "item"
| | |-- impressionId: long (nullable = true) // postion - item will be identified by "impressions.impressionParentId = impressions.impressionId“
| | |-- impressionParentId: long (nullable = true) // relation between container & it's items as above
| | |-- impressionTimes: array (nullable = true) // Need to check Max time
| | | |-- element: long (containsNull = true)
|-- consumerId: string
|-- platform: string
|-- impressions: array
| |-- containerId: integer // If impressions.impressionType='container' then it's "id" will containerID
| |-- itemImpressions: array // List of all impressions id (as itemId) where "impressionParentId = conainer.impressionId“
| | |-- itemId: long // "impressions.id" if "impressions.impressionParentId = impressions.impressionId“
| | |-- lastImpressionTime: long // max(impressionTimes)
| | |-- impressionCount: int
输出模式-
|-- consumerId: string
|-- platform: string
|-- impressions: array
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true) // same "id" will be treated as "containerId" and "itemId"
| | |-- impressionType: string (nullable = true) // Decide factor "container" or "item"
| | |-- impressionId: long (nullable = true) // postion - item will be identified by "impressions.impressionParentId = impressions.impressionId“
| | |-- impressionParentId: long (nullable = true) // relation between container & it's items as above
| | |-- impressionTimes: array (nullable = true) // Need to check Max time
| | | |-- element: long (containsNull = true)
|-- consumerId: string
|-- platform: string
|-- impressions: array
| |-- containerId: integer // If impressions.impressionType='container' then it's "id" will containerID
| |-- itemImpressions: array // List of all impressions id (as itemId) where "impressionParentId = conainer.impressionId“
| | |-- itemId: long // "impressions.id" if "impressions.impressionParentId = impressions.impressionId“
| | |-- lastImpressionTime: long // max(impressionTimes)
| | |-- impressionCount: int
几点-
impressions.impressionParentId=impressions.impressionId
[我认为需要进行自连接来建立关系]impressions.impressionType='container'
,那么它的“id”将containerID显示在输出模式中- 首先是consumerId和平台
- 集装箱运输第二名
- 第三个是项目ID、总计数和最大值(印象时间)
上面的示例来自已过滤的数据集,我尝试分解和聚合,但不起作用。请编辑并在帖子中包含代码,这将有助于确定问题。编辑并更新了示例请编辑并在帖子中包含代码,这将有助于确定问题。编辑并更新了示例