pyspark数据帧中的groupby计数_Pyspark_Pyspark Sql_Pyspark Dataframes

pyspark数据帧中的groupby计数

pyspark

pyspark数据帧中的groupby计数,pyspark,pyspark-sql,pyspark-dataframes,Pyspark,Pyspark Sql,Pyspark Dataframes,我的数据框看起来像- id age gender category 1 34 m b 1 34 m c 1 34 m b 2 28 f a 2 28 f b 3 23 f

我的数据框看起来像-

id      age      gender       category
1        34        m             b
1        34        m             c
1        34        m             b
2        28        f             a
2        28        f             b
3        23        f             c
3        23        f             c 
3        23        f             c

id      age      gender       a      b      c
1        34        m          0      2      1
2        28        f          1      1      0
3        23        f          0      0      2

我希望我的数据框看起来像-

id      age      gender       category
1        34        m             b
1        34        m             c
1        34        m             b
2        28        f             a
2        28        f             b
3        23        f             c
3        23        f             c 
3        23        f             c

id      age      gender       a      b      c
1        34        m          0      2      1
2        28        f          1      1      0
3        23        f          0      0      2

我已经做了-

from pyspark.sql import functions as F
df = df.groupby(['id','age','gender']).pivot('category').agg(F.count('category')).fillna(0)
df.show()

如何在pyspark中管理？是否有正确的方法可以管理这件事

您的代码在我看来很好，但当我尝试运行它时，我看到了这一点

df=spark.read.csv（'dbfs:/FileStore/tables/txt_sample.txt'，header=True，inferSchema=True，sep=“\t”）
df=df.groupby（['id'，'age'，'gender'））.pivot（'category'）.agg（count（'category'））.fillna（0）
df.show（）
df:pyspark.sql.dataframe.dataframe=[id:integer，age:integer…还有5个字段]
+---+---+------+---+---+---+---+
|id |年龄|性别| a | b | c | c|
+---+---+------+---+---+---+---+
|2 | 28 | f | 1 | 1 | 0 | 0|
|1 | 34 | m | 0 | 2 | 1 | 0|
|3 | 23 | f | 0 | 0 | 1 | 2|
+---+---+------+---+---+---+---+

这是因为在最后两行中c后面有一个额外的空格字符

只需使用rtrim（）修剪空间

df=spark.read.csv（'dbfs:/FileStore/tables/txt_sample.txt'，header=True，inferSchema=True，sep='\t'）
df=df.withColumn（'Category'，rtrim（df['Category']）。drop（df['Category']）
df=df.groupby（['id'，'age'，'gender'））.pivot（'Category'）.agg（count（'Category'））.fillna（0）
df.show（）
df:pyspark.sql.dataframe.dataframe=[id:integer，age:integer…还有4个字段]
+---+---+------+---+---+---+
|id |年龄|性别| a | b | c|
+---+---+------+---+---+---+
|2 | 28 | f | 1 | 1 | 0|
|1 | 34 | m | 0 | 2 | 1|
|3 | 23 | f | 0 | 0 | 3|
+---+---+------+---+---+---+

您的问题是什么？你想实现什么？@cronoik-编辑了我的问题。你能帮我吗？对不起，我还是不明白你的问题。你的代码已经生成了你想要的内容（除了第三行看起来像

|3 | 23 | f | 0 | 0 | 3 |

）。@cronik你是对的……第三行应该是changed@NikitaAgarwal我觉得你的代码很正确。