Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Json 使用Spark-Python进行嵌套分组和缩减_Json_Apache Spark_Pyspark_Apache Spark Sql - Fatal编程技术网

Json 使用Spark-Python进行嵌套分组和缩减

Json 使用Spark-Python进行嵌套分组和缩减,json,apache-spark,pyspark,apache-spark-sql,Json,Apache Spark,Pyspark,Apache Spark Sql,我尝试使用嵌套分组来组织数据,比如按用户分组,然后按内键活动分组,然后按内键分组度量 我有以下数据帧结构 +----------+--------+------+ |CampaignID|MetricID|UserID| +----------+--------+------+ | 3| 1| 1| | 4| 3| 3| | 4| 2| 3| | 3| 2|

我尝试使用嵌套分组来组织数据,比如按用户分组,然后按内键活动分组,然后按内键分组度量

我有以下数据帧结构

+----------+--------+------+
|CampaignID|MetricID|UserID|
+----------+--------+------+
|         3|       1|     1|
|         4|       3|     3|
|         4|       2|     3|
|         3|       2|     2|
|         2|       3|     3|
+----------+--------+------+
我编写了下面的代码

rdd = newDf.rdd
new = rdd.groupBy(lambda x: x["UserID"]).map(lambda x: (x[0], list(x[1])))
new.take(5)
输出:

[('1',
  [Row(CampaignID='3', MetricID='1', UserID='1'),
   Row(CampaignID='2', MetricID='1', UserID='1'),
   Row(CampaignID='1', MetricID='3', UserID='1'),
  )
]
请注意,我有10公里的记录。现在,我已经根据UserID对数据进行了分组。我正试图找出如何通过活动ID进一步分组,然后通过MetricID进一步分组。然后对具有相同度量id的记录进行计数,如输出中所示,如下所示

[{
  "UserID" : "1",
  "data" : [{
    "CampaignID" : "1",
    "data" : [{
      "MetricID" : "1",
      "Count" : "5"
    }]
  }]
}]

有多个活动ID和多个指标。我认为首先分组,然后减少,看看是否存在相同的度量id来计算数据。任何想法或代码示例都会很有用。

您可以执行多个
groupBy
collect\u list
来构造json数据结构,最后使用
to\u json
将结果转换为json字符串

import pyspark.sql.functions as F

result = df.groupBy(
    'UserID', 'CampaignID', 'MetricID'
).count().groupBy(
    'UserID', 'CampaignID'
).agg(
    F.collect_list(F.struct('MetricID', 'count')).alias('data')
).groupBy(
    'UserID'
).agg(
    F.collect_list(F.struct('CampaignID', 'data')).alias('data')
).agg(
    F.to_json(F.collect_list(F.struct('UserID', 'data'))).alias('result')
)

result.show(truncate=False)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                  |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{"UserID":1,"data":[{"CampaignID":3,"data":[{"MetricID":1,"count":1}]}]},{"UserID":3,"data":[{"CampaignID":2,"data":[{"MetricID":3,"count":1}]},{"CampaignID":4,"data":[{"MetricID":3,"count":1},{"MetricID":2,"count":1}]}]},{"UserID":2,"data":[{"CampaignID":3,"data":[{"MetricID":2,"count":1}]}]}]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+