Apache spark 使用pyspark同时进行聚合和特征提取
我有这个数据集Apache spark 使用pyspark同时进行聚合和特征提取,apache-spark,pyspark,apache-spark-sql,aggregate,Apache Spark,Pyspark,Apache Spark Sql,Aggregate,我有这个数据集 +---------+------+------------------+--------------------+-------------+ | LCLid|season| sum(KWH/hh)| avg(KWH/hh)|Acorn_grouped| +---------+------+------------------+--------------------+-------------+ |MAC000023|autumn|4067.
+---------+------+------------------+--------------------+-------------+
| LCLid|season| sum(KWH/hh)| avg(KWH/hh)|Acorn_grouped|
+---------+------+------------------+--------------------+-------------+
|MAC000023|autumn|4067.4269999000007| 0.31550007755972703| 4|
|MAC000128|spring| 961.2639999999982| 0.10876487893188484| 2|
|MAC000012|summer| 121.7360000000022|0.027548314098212765| 0|
|MAC000053|autumn| 2289.498000000006| 0.17883908764255632| 2|
|MAC000121|spring| 1893.635999900008| 0.21543071671217384| 1|
对于每个消费者,我们有每个月的总消费量和平均消费量,每个消费者的acron Group是固定的
我想根据id进行聚合,同时提取这些新特性,并进行整数运算,最终得到这些数据
+---------+-------------+-------------------+------------------+------------------+------------------
| LCLid|Acorn_grouped|autumn_avg(KWH/hh) |autumn_sum(KWH/hh)|autumn_max(KWH/hh)|spring_avg(KWH/hh)
+---------+-------------+-------------------+------------------+------------------+-----------------
|MAC000023| 4| | | |
|MAC000128| 2| | | |
|MAC000012| 0| | | |
|MAC000053| 2| | | |
|MAC000121| 1| | | |
您可以执行以下操作:
import pyspark.sql.functions as F
result = df.groupBy('LCLid', 'Acorn_grouped') \
.pivot('season') \
.agg(
F.round(F.first('sum(KWH/hh)')).alias('sum(KWH/hh)'),
F.round(F.first('avg(KWH/hh)')).alias('avg(KWH/hh)')
).fillna(0) # replace nulls with zero -
# you can skip this if you want to keep nulls
非常感谢它的工作原理,您知道如何在此过程中应用圆函数吗process@eyabaklouti我添加了round函数