Scala 在执行pivot spark之前分组并查找计数
我有一个如下所示的数据帧Scala 在执行pivot spark之前分组并查找计数,scala,apache-spark,databricks,Scala,Apache Spark,Databricks,我有一个如下所示的数据帧 A B C D foo one small 1 foo one large 2 foo one large 2 foo two small 3 我需要根据C列上的A和B以及D列上的sum透视groupBy 我可以使用 df.groupBy("A", "B").pivot("C").sum("D") 但是,如果我尝试以下操作,我还需要在groupBy之后查找count df.groupBy("A", "B").pivot("C")
A B C D
foo one small 1
foo one large 2
foo one large 2
foo two small 3
我需要根据C列上的A和B以及D列上的sum
透视groupBy
我可以使用
df.groupBy("A", "B").pivot("C").sum("D")
但是,如果我尝试以下操作,我还需要在groupBy
之后查找count
df.groupBy("A", "B").pivot("C").agg(sum("D"), count)
我得到的输出像
A B large small large_count small_count
是否有一种方法可以在执行输出操作时的透视操作之前,仅在groupBy
之后获取一个count
output.withColumn(“count”、$“large\u count”+$“small\u count”).show
如果愿意,可以删除两个计数列
在尝试之前先做
这就是你所期望的吗
val df = Seq(("foo", "one", "small", 1),
("foo", "one", "large", 2),
("foo", "one", "large", 2),
("foo", "two", "small", 3)).toDF("A","B","C","D")
scala> df.show
+---+---+-----+---+
| A| B| C| D|
+---+---+-----+---+
|foo|one|small| 1|
|foo|one|large| 2|
|foo|one|large| 2|
|foo|two|small| 3|
+---+---+-----+---+
scala> val df2 = df.groupBy('A,'B).pivot("C").sum("D")
df2: org.apache.spark.sql.DataFrame = [A: string, B: string ... 2 more fields]
scala> val df3 = df.groupBy('A as "A1",'B as "B1").agg(sum('D) as "sumd")
df3: org.apache.spark.sql.DataFrame = [A1: string, B1: string ... 1 more field]
scala> df3.join(df2,'A==='A1 and 'B==='B1,"inner").select("A","B","sumd","large","small").show
+---+---+----+-----+-----+
| A| B|sumd|large|small|
+---+---+----+-----+-----+
|foo|one| 5| 4| 1|
|foo|two| 3| null| 3|
+---+---+----+-----+-----+
scala>
这不需要加入。这就是你要找的吗
val df = Seq(("foo", "one", "small", 1),
("foo", "one", "large", 2),
("foo", "one", "large", 2),
("foo", "two", "small", 3)).toDF("A","B","C","D")
scala> df.show
+---+---+-----+---+
| A| B| C| D|
+---+---+-----+---+
|foo|one|small| 1|
|foo|one|large| 2|
|foo|one|large| 2|
|foo|two|small| 3|
+---+---+-----+---+
df.registerTempTable("dummy")
spark.sql("SELECT * FROM (SELECT A , B , C , sum(D) as D from dummy group by A,B,C grouping sets ((A,B,C) ,(A,B)) order by A nulls last , B nulls last , C nulls last) dummy pivot (first(D) for C in ('large' large ,'small' small , null total))").show
+---+---+-----+-----+-----+
| A| B|large|small|total|
+---+---+-----+-----+-----+
|foo|one| 4| 1| 5|
|foo|two| null| 3| 3|
+---+---+-----+-----+-----+
我不能做
df.groupBy(“A”、“B”).agg(count(“C”))
,因为pivot只适用于分组数据你能发布你想要的输出吗?不确定您到底想要什么如果我的数据帧大小非常大,执行group by和join将对性能产生巨大影响这取决于cols A和B的唯一性。试试看!