Python 生成pyspark dataframe中新列的列值和行数之和的矩阵
生成pyspark dataframe中新列的列值和行数之和的矩阵Python 生成pyspark dataframe中新列的列值和行数之和的矩阵,python,dataframe,apache-spark,pyspark,Python,Dataframe,Apache Spark,Pyspark,生成pyspark dataframe中新列的列值和行数之和的矩阵 colors = spark.createDataFrame([("Red","Re",20),("Blue","Bl",30),("Green","Gr",50)]).toDF("Colors","Prefix","Value") +---
colors = spark.createDataFrame([("Red","Re",20),("Blue","Bl",30),("Green","Gr",50)]).toDF("Colors","Prefix","Value")
+------+------+-----+
|Colors|Prefix|Value|
+------+------+-----+
| Red| Re| 20|
| Blue| Bl| 30|
| Green| Gr| 50|
+------+------+-----+
piv = colors.groupby("Colors").pivot("Prefix").sum("Value").fillna(0)
piv.withColumn("total",sum(piv[col] for col in piv.columns[1:])).show()
+------+---+---+---+-----+
|Colors| Bl| Gr| Re|total|
+------+---+---+---+-----+
| Green| 0| 50| 0| 50|
| Blue| 30| 0| 0| 30|
| Red| 0| 0| 20| 20|
+------+---+---+---+-----+
应为偶数列之和,如下所示(应为动态代码,如有更多列和行)
这是路。我使用
map
对所有列进行求和
import pyspark.sql.functions as f
df = colors.groupby("Colors").pivot("Prefix").sum("Value").fillna(0)
cols = df.columns[1:]
df.union(df.agg(f.lit('Total').alias('Color'), *[f.sum(f.col(c)).alias(c) for c in cols])) \
.withColumn("Total", sum(f.col(c) for c in cols)) \
.show()
+------+---+---+---+-----+
|Colors| Bl| Gr| Re|Total|
+------+---+---+---+-----+
| Green| 0| 50| 0| 50|
| Blue| 30| 0| 0| 30|
| Red| 0| 0| 20| 20|
| Total| 30| 50| 20| 100|
+------+---+---+---+-----+
生成器理解将更具可读性
(F.sum(F.col(c))。cols中c的别名(c)
,并与sum()中的生成器理解一致
import pyspark.sql.functions as f
df = colors.groupby("Colors").pivot("Prefix").sum("Value").fillna(0)
cols = df.columns[1:]
df.union(df.agg(f.lit('Total').alias('Color'), *[f.sum(f.col(c)).alias(c) for c in cols])) \
.withColumn("Total", sum(f.col(c) for c in cols)) \
.show()
+------+---+---+---+-----+
|Colors| Bl| Gr| Re|Total|
+------+---+---+---+-----+
| Green| 0| 50| 0| 50|
| Blue| 30| 0| 0| 30|
| Red| 0| 0| 20| 20|
| Total| 30| 50| 20| 100|
+------+---+---+---+-----+