Python pandas groupby.apply to pyspark_Python_Pandas_Apache Spark_Pyspark_Apache Spark Sql

Python pandas groupby.apply to pyspark

python pandas apache-spark pyspark

Python pandas groupby.apply to pyspark,python,pandas,apache-spark,pyspark,apache-spark-sql,Python,Pandas,Apache Spark,Pyspark,Apache Spark Sql,我有以下自定义函数在pandas数据帧中进行聚合，我想在pyspark中做同样的事情： def custom_aggregation_pyspark（x，查询）：名称={} 对于正则计算项（）中的k，v： plus=x.query（v[“plus\u credit”]）['OBNETCRE'].sum（）+x.query（v[“plus\u debit”]）['OBNETDEB'].sum（）减号=x.query（v[“减号贷方”]）['OBNETCRE'].sum（）+x.query（v[

我有以下自定义函数在pandas数据帧中进行聚合，我想在pyspark中做同样的事情：

def custom_aggregation_pyspark（x，查询）：
名称={}
对于正则计算项（）中的k，v：
plus=x.query（v[“plus\u credit”]）['OBNETCRE'].sum（）+x.query（v[“plus\u debit”]）['OBNETDEB'].sum（）
减号=x.query（v[“减号贷方”]）['OBNETCRE'].sum（）+x.query（v[“减号借方”]）['OBNETDEB'].sum（）
名称[k]=加减
返回pd.Series（name，index=list（names.keys（）））
df=df.groupby（['LBUDG']）.apply（自定义聚合、查询）.sum（）

WARE查询是一个类似查询的字典

{'first_querys'：{
‘加上学分’：‘在（‘237’、‘238’）中的第二类补习班’，
‘加上借方’：‘在（'237'，'238'）中的第二类债务’，
‘减学分’：‘第237类、第238类’中的rg2，
‘减借方’：‘20’、‘21’、‘23’中的“综合类”
}
}

因此，我用pyspark'sql'替换了pandas“query”

def custom_aggregation_pyspark（x，查询）：
x、 createOrReplaceTempView（“df”）
名称={}
对于查询中的k，v.items（）：
plus=spark.sql（“从df中选择*，其中“+v[”plus_-credit“]））.SELECT（'OBNETCRE'）.groupby（'OBNETCRE'）.sum（）.collect（）+spark.sql（“从df中选择*，其中“+v[”plus_-debit“]）。SELECT（'OBNETDEB'）.groupby（'OBNETDEB'）.sum（）.collect（）
减号=spark.sql（“从df中选择*，其中“+v[”减号信用“]）。选择（'OBNETCRE'）。分组比（'OBNETCRE'）。总和（）。收集（）+spark.sql（“从df中选择*，其中“+v[”减号信用“]）。选择（'OBNETDEB'）。分组比（'OBNETDEB'）。总和（）。收集（）
名称[k]=加减
返回pd.Series（name，index=list（names.keys（）））
df.groupby（“LBUDG”）.agg（定制聚合（df，查询））

我肯定是走错了方向，因为上面的代码不起作用，你能告诉我应该去哪里看吗

所需的输出是按

LBUDG

（字符串）分组的表，其他列使用自定义聚合函数

编辑数据帧示例：

LBUDG OBNETCRE OBNETDEB 等级为0 综合类1 波扎特酒店 0,00 0,00 1. 10 波扎特酒店 67572,00 0,00 1. 10 波扎特酒店 0,00 0,00 1. 10 波扎特酒店 4908,12 0,00 1. 10 波扎特酒店 0,00 0,00 1. 10 达福 295240,67 0,00 1. 10 波扎特酒店 0,00 0,00 1. 11 波扎特酒店 0,00 0,00 1. 12 波扎特酒店 0,00 0,00 1. 13 波扎特酒店 0,00 0,00 1. 13 波扎特酒店 53697,94 0,00 1. 13

您可以使用

epxr

计算

查询中传递的条件，并使用条件聚合计算总和。下面是一个与您在《熊猫》中给出的示例相同的示例：
from pyspark.sql import functions as F


def custom_aggregation_pyspark(df, regles_calcul):
    df1 = df.groupBy("LBUDG") \
        .agg(
        *[
            ((F.sum(F.when(F.expr(v["plus_credit"]), F.col("OBNETCRE")).otherwise(0)) +
              F.sum(F.when(F.expr(v["plus_debit"]), F.col("OBNETDEB")).otherwise(0))) -
             (F.sum(F.when(F.expr(v["minus_credit"]), F.col("OBNETCRE")).otherwise(0)) +
              F.sum(F.when(F.expr(v["minus_debit"]), F.col("OBNETDEB")).otherwise(0)))
             ).alias(k)

            for k, v in regles_calcul.items()
        ]
    )

    return df1


df = custom_aggregation_pyspark(df, queries)