按行求和每个组，并在Pyspark的数据帧中将总计作为新行添加_Pyspark_Apache Spark Sql_Pyspark Dataframes

按行求和每个组，并在Pyspark的数据帧中将总计作为新行添加

pyspark

按行求和每个组，并在Pyspark的数据帧中将总计作为新行添加,pyspark,apache-spark-sql,pyspark-dataframes,Pyspark,Apache Spark Sql,Pyspark Dataframes,我有一个类似于这个示例的数据帧 df = spark.createDataFrame( [(2, "A" , "A2" , 2500), (2, "A" , "A11" , 3500), (2, "A" , "A12" , 5500), (4, "B" , "B25" , 7600), (4, &q

我有一个类似于这个示例的数据帧

df = spark.createDataFrame(
    [(2, "A" , "A2" , 2500),
    (2, "A" , "A11" , 3500),
    (2, "A" , "A12" , 5500),
    (4, "B" , "B25" , 7600),
    (4, "B", "B26" ,5600),
    (5, "C" , "c25" ,2658),
    (5, "C" , "c27" , 1100),
    (5, "C" , "c28" , 1200)],
    ['parent', 'group' , "brand" , "usage"])


output :
+------+-----+-----+-----+
|parent|group|brand|usage|
+------+-----+-----+-----+
|     2|    A|   A2| 2500|
|     2|    A|  A11| 3500|
|     4|    B|  B25| 7600|
|     4|    B|  B26| 5600|
|     5|    C|  c25| 2658|
|     5|    C|  c27| 1100|
|     5|    C|  c28| 1200|
+------+-----+-----+-----+

我想做的是计算每组的总使用量，并将其添加为一个新行，其中包含品牌的总价值。我如何在Pypark中执行此操作

Expected result:

+------+-----+-----+-----+
|parent|group|brand|usage|
+------+-----+-----+-----+
|     2|    A|   A2| 2500|
|     2|    A|  A11| 3500|
|     2|    A|Total| 6000|
|     4|    B|  B25| 7600|
|     4|    B|  B26| 5600|
|     4|    B|Total|18700|
|     5|    C|  c25| 2658|
|     5|    C|  c27| 1100|
|     5|    C|  c28| 1200|
|     5|    C|Total| 4958|
+------+-----+-----+-----+

groupby+sum，然后将结果归并：

df.union（df.groupby（'parent'，'group'，'F.lit（'Total'））.agg（F.sum（'usage'））.orderBy（'parent'，'group'））.show（）

groupby+sum，然后将结果归并：

df.union（df.groupby（'parent'，'group'，'F.lit（'Total'））.agg（F.sum（'usage'））.orderBy（'parent'，'group'））.show（）

import pyspark.sql.functions as F

df = spark.createDataFrame(
[(2, "A" , "A2" , 2500),
(2, "A" , "A11" , 3500),
(2, "A" , "A12" , 5500),
(4, "B" , "B25" , 7600),
(4, "B", "B26" ,5600),
(5, "C" , "c25" ,2658),
(5, "C" , "c27" , 1100),
(5, "C" , "c28" , 1200)],
['parent', 'group' , "brand" , "usage"])

df.show()
+------+-----+-----+-----+
|parent|group|brand|usage|
+------+-----+-----+-----+
|     2|    A|   A2| 2500|
|     2|    A|  A11| 3500|
|     2|    A|  A12| 5500|
|     4|    B|  B25| 7600|
|     4|    B|  B26| 5600|
|     5|    C|  c25| 2658|
|     5|    C|  c27| 1100|
|     5|    C|  c28| 1200|
+------+-----+-----+-----+

#Group by and sum to get the totals
totals = df.groupBy(['group','parent']).agg(F.sum('usage').alias('usage')).withColumn('brand', F.lit('Total'))

# create a temp variable to sort
totals = totals.withColumn('sort_id', F.lit(2))
df = df.withColumn('sort_id', F.lit(1))

#Union dataframes, drop temp variable and show
df.unionByName(totals).sort(['group','sort_id']).drop('sort_id').show()

+------+-----+-----+-----+
|parent|group|brand|usage|
+------+-----+-----+-----+
|     2|    A|  A12| 5500|
|     2|    A|  A11| 3500|
|     2|    A|   A2| 2500|
|     2|    A|Total|11500|
|     4|    B|  B25| 7600|
|     4|    B|  B26| 5600|
|     4|    B|Total|13200|
|     5|    C|  c25| 2658|
|     5|    C|  c28| 1200|
|     5|    C|  c27| 1100|
|     5|    C|Total| 4958|
+------+-----+-----+-----+