Apache spark 使用pyspark同时进行聚合和特征提取_Apache Spark_Pyspark_Apache Spark Sql_Aggregate

Apache spark 使用pyspark同时进行聚合和特征提取

apache-spark pyspark

Apache spark 使用pyspark同时进行聚合和特征提取,apache-spark,pyspark,apache-spark-sql,aggregate,Apache Spark,Pyspark,Apache Spark Sql,Aggregate,我有这个数据集 +---------+------+------------------+--------------------+-------------+ | LCLid|season| sum(KWH/hh)| avg(KWH/hh)|Acorn_grouped| +---------+------+------------------+--------------------+-------------+ |MAC000023|autumn|4067.

我有这个数据集

+---------+------+------------------+--------------------+-------------+
|    LCLid|season|       sum(KWH/hh)|         avg(KWH/hh)|Acorn_grouped|
+---------+------+------------------+--------------------+-------------+
|MAC000023|autumn|4067.4269999000007| 0.31550007755972703|            4|
|MAC000128|spring| 961.2639999999982| 0.10876487893188484|            2|
|MAC000012|summer| 121.7360000000022|0.027548314098212765|            0|
|MAC000053|autumn| 2289.498000000006| 0.17883908764255632|            2|
|MAC000121|spring| 1893.635999900008| 0.21543071671217384|            1|

对于每个消费者，我们有每个月的总消费量和平均消费量，每个消费者的acron Group是固定的

我想根据id进行聚合，同时提取这些新特性，并进行整数运算，最终得到这些数据

+---------+-------------+-------------------+------------------+------------------+------------------
|    LCLid|Acorn_grouped|autumn_avg(KWH/hh) |autumn_sum(KWH/hh)|autumn_max(KWH/hh)|spring_avg(KWH/hh)
+---------+-------------+-------------------+------------------+------------------+-----------------
|MAC000023|            4|                   |                  |                  |
|MAC000128|            2|                   |                  |                  |
|MAC000012|            0|                   |                  |                  |
|MAC000053|            2|                   |                  |                  |
|MAC000121|            1|                   |                  |                  |

您可以执行以下操作：

import pyspark.sql.functions as F

result = df.groupBy('LCLid', 'Acorn_grouped') \
           .pivot('season') \
           .agg(
               F.round(F.first('sum(KWH/hh)')).alias('sum(KWH/hh)'), 
               F.round(F.first('avg(KWH/hh)')).alias('avg(KWH/hh)')
           ).fillna(0)   # replace nulls with zero - 
                         # you can skip this if you want to keep nulls

非常感谢它的工作原理，您知道如何在此过程中应用圆函数吗process@eyabaklouti我添加了round函数