Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/16.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 在执行pivot spark之前分组并查找计数_Scala_Apache Spark_Databricks - Fatal编程技术网

Scala 在执行pivot spark之前分组并查找计数

Scala 在执行pivot spark之前分组并查找计数,scala,apache-spark,databricks,Scala,Apache Spark,Databricks,我有一个如下所示的数据帧 A B C D foo one small 1 foo one large 2 foo one large 2 foo two small 3 我需要根据C列上的A和B以及D列上的sum透视groupBy 我可以使用 df.groupBy("A", "B").pivot("C").sum("D") 但是,如果我尝试以下操作,我还需要在groupBy之后查找count df.groupBy("A", "B").pivot("C")

我有一个如下所示的数据帧

A   B   C       D
foo one small   1
foo one large   2
foo one large   2
foo two small   3
我需要根据C列上的A和B以及D列上的
sum
透视
groupBy

我可以使用

df.groupBy("A", "B").pivot("C").sum("D") 
但是,如果我尝试以下操作,我还需要在
groupBy
之后查找
count

df.groupBy("A", "B").pivot("C").agg(sum("D"), count)
我得到的输出像

A   B   large   small large_count small_count
是否有一种方法可以在执行输出操作时的透视操作之前,仅在
groupBy
之后获取一个
count

output.withColumn(“count”、$“large\u count”+$“small\u count”).show

如果愿意,可以删除两个计数列

在尝试之前先做
这就是你所期望的吗

val df = Seq(("foo", "one", "small",   1),
("foo", "one", "large",   2),
("foo", "one", "large",   2),
("foo", "two", "small",   3)).toDF("A","B","C","D")

scala> df.show
+---+---+-----+---+
|  A|  B|    C|  D|
+---+---+-----+---+
|foo|one|small|  1|
|foo|one|large|  2|
|foo|one|large|  2|
|foo|two|small|  3|
+---+---+-----+---+

scala> val df2 = df.groupBy('A,'B).pivot("C").sum("D")
df2: org.apache.spark.sql.DataFrame = [A: string, B: string ... 2 more fields]

scala> val df3 = df.groupBy('A as "A1",'B as "B1").agg(sum('D) as "sumd")
df3: org.apache.spark.sql.DataFrame = [A1: string, B1: string ... 1 more field]

scala> df3.join(df2,'A==='A1 and 'B==='B1,"inner").select("A","B","sumd","large","small").show
+---+---+----+-----+-----+
|  A|  B|sumd|large|small|
+---+---+----+-----+-----+
|foo|one|   5|    4|    1|
|foo|two|   3| null|    3|
+---+---+----+-----+-----+


scala>

这不需要加入。这就是你要找的吗

val df = Seq(("foo", "one", "small",   1),
("foo", "one", "large",   2),
("foo", "one", "large",   2),
("foo", "two", "small",   3)).toDF("A","B","C","D")

scala> df.show
+---+---+-----+---+
|  A|  B|    C|  D|
+---+---+-----+---+
|foo|one|small|  1|
|foo|one|large|  2|
|foo|one|large|  2|
|foo|two|small|  3|
+---+---+-----+---+

df.registerTempTable("dummy")

spark.sql("SELECT * FROM (SELECT A , B , C , sum(D) as D from dummy group by A,B,C grouping sets ((A,B,C) ,(A,B)) order by A nulls last , B nulls last , C nulls last) dummy pivot (first(D) for C in ('large' large ,'small' small , null total))").show

+---+---+-----+-----+-----+
|  A|  B|large|small|total|
+---+---+-----+-----+-----+
|foo|one|    4|    1|    5|
|foo|two| null|    3|    3|
+---+---+-----+-----+-----+

我不能做
df.groupBy(“A”、“B”).agg(count(“C”))
,因为pivot只适用于分组数据你能发布你想要的输出吗?不确定您到底想要什么如果我的数据帧大小非常大,执行group by和join将对性能产生巨大影响这取决于cols A和B的唯一性。试试看!