Apache spark 如何使用聚合在配置单元中透视数据
我有一个如下所示的表数据,我想用聚合来透视数据Apache spark 如何使用聚合在配置单元中透视数据,apache-spark,hadoop,hive,impala,Apache Spark,Hadoop,Hive,Impala,我有一个如下所示的表数据,我想用聚合来透视数据 ColumnA ColumnB ColumnC 1 complete Yes 1 complete Yes 2 In progress No 2 In progress No 3 Not yet started initiate 3
ColumnA ColumnB ColumnC
1 complete Yes
1 complete Yes
2 In progress No
2 In progress No
3 Not yet started initiate
3 Not yet started initiate
想像下面这样旋转吗
ColumnA Complete In progress Not yet started
1 2 0 0
2 0 2 0
3 0 0 2
在蜂巢或黑斑羚中,我们是否可以实现这一点?使用
case
和sum
聚合:
select ColumnA,
sum(case when ColumnB='complete' then 1 else 0 end) as Complete,
sum(case when ColumnB='In progress' then 1 else 0 end) as In_progress,
sum(case when ColumnB='Not yet started' then 1 else 0 end) as Not_yet_started
from table
group by ColumnA
order by ColumnA --remove if order is not necessary
;
这就是如何在spark scala中实现这一点
val conf = spark.sparkContext.hadoopConfiguration
val test = spark.sparkContext.parallelize(List( ("1", "Complete", "yes"),
("1", "Complete", "yes"),
("2", "Inprogress", "no"),
("2", "Inprogress", "no"),
("3", "Not yet started", "initiate"),
("3", "Not yet started", "initiate"))
).toDF("ColumnA","ColumnB","ColumnC")
test.show()
val test_pivot = test.groupBy("ColumnA")
.pivot("ColumnB")
.agg(count("columnC"))
test_pivot.na.fill(0)show(false)
}
以及输出
|ColumnA|Complete|Inprogress|Not yet started|
+-------+--------+----------+---------------+
|3 |0 |0 |2 |
|1 |2 |0 |0 |
|2 |0 |2 |0 |
+-------+--------+----------+---------------+
到目前为止你试过什么?