Apache spark 分组依据将配置单元中的多个列值合并到一列
我正在尝试根据group by key将多个列值合并到一个列中。基本上,我将使用spark 1.6 dataframe api来创建嵌套JSON 样本输入表abc:-Apache spark 分组依据将配置单元中的多个列值合并到一列,apache-spark,hive,apache-spark-sql,hiveql,Apache Spark,Hive,Apache Spark Sql,Hiveql,我正在尝试根据group by key将多个列值合并到一个列中。基本上,我将使用spark 1.6 dataframe api来创建嵌套JSON 样本输入表abc:- a b c d e f g --------------------------------------------- aa bb cc dd ee ff gg aa bb cc1 dd1
a b c d e f g
---------------------------------------------
aa bb cc dd ee ff gg
aa bb cc1 dd1 ee1 ff1 gg1
aa bb cc2 dd2 ee2 ff2 gg2
aa1 bb1 cc3 dd3 ee3 ff3 gg3
aa1 bb1 cc4 dd4 ee4 ff4 gg4
a、b的最终输出组:-
aa bb {{cc,dd,ee,ff,gg},{cc1,dd1,ee1,ff1,gg1},{cc2,dd2,ee2,ff2,gg2}}
aa1 bb1 {{cc3,dd3,ee3,ff3,gg3},{cc4,dd4,ee4,ff4,gg4}}
我尝试使用collect_list,但它只能对一列进行分组。不知道如何将多个列组合在一起。我尝试使用concat字符串,然后在其上使用collect,但我将丢失模式映射,因为我必须最终以json格式转储它。以map或struct的形式使用棒状物也可以。请就这个问题提出一些优雅的方法/解决方案。谢谢
注意:使用Spark 1.6时,两个查询都可以使用
sqlContext.sql(“选择…”)代码>
火花演示
谢谢你的回复。但是我们正在使用Spark1.6 henc eit,它不起作用。即使在配置单元上,它也给出了异常失败:UDFArgumentTypeException仅接受基元类型参数collect_list(数组(c,d,e,f,g))或collect_list(结构(c,d,e,f,g))在Spark 2.0中运行良好。但在spark 1.6中,collect_列表仅支持基元类型。请建议spark 1.6中是否有其他可能的解决方法。这已在spark 1.6上成功测试。我的spark安装中似乎缺少导致此问题的路径。谢谢你的回复,你找到你的路径问题了吗?如果是这样,缺少什么?
select a,b
,collect_list(array(c,d,e,f,g))
from abc
group by a,b
;
+-----+-----+----------------------------------------------------------------------------------------------+
| aa | bb | [["cc","dd","ee","ff","gg"],["cc1","dd1","ee1","ff1","gg1"],["cc2","dd2","ee2","ff2","gg2"]] |
+-----+-----+----------------------------------------------------------------------------------------------+
| aa1 | bb1 | [["cc3","dd3","ee3","ff3","gg3"],["cc4","dd4","ee4","ff4","gg4"]] |
+-----+-----+----------------------------------------------------------------------------------------------+
select a,b
,collect_list(struct(c,d,e,f,g))
from abc
group by a,b
;
+-----+-----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| aa | bb | [{"col1":"cc","col2":"dd","col3":"ee","col4":"ff","col5":"gg"},{"col1":"cc1","col2":"dd1","col3":"ee1","col4":"ff1","col5":"gg1"},{"col1":"cc2","col2":"dd2","col3":"ee2","col4":"ff2","col5":"gg2"}] |
+-----+-----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| aa1 | bb1 | [{"col1":"cc3","col2":"dd3","col3":"ee3","col4":"ff3","col5":"gg3"},{"col1":"cc4","col2":"dd4","col3":"ee4","col4":"ff4","col5":"gg4"}] |
+-----+-----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
[cloudera@quickstart ~]$ spark-shell --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.6.0
/_/
Type --help for more information.
[cloudera@quickstart ~]$
[cloudera@quickstart ~]$ spark-shell
scala> sqlContext.sql("select * from abc").show;
+---+---+---+---+---+---+---+
| a| b| c| d| e| f| g|
+---+---+---+---+---+---+---+
| aa| bb| cc| dd| ee| ff| gg|
| aa| bb|cc1|dd1|ee1|ff1|gg1|
| aa| bb|cc2|dd2|ee2|ff2|gg2|
|aa1|bb1|cc3|dd3|ee3|ff3|gg3|
|aa1|bb1|cc4|dd4|ee4|ff4|gg4|
+---+---+---+---+---+---+---+
scala> sqlContext.sql("select a,b,collect_list(array(c,d,e,f,g)) from abc group by a,b").show;
+---+---+--------------------+
| a| b| _c2|
+---+---+--------------------+
|aa1|bb1|[[cc3, dd3, ee3, ...|
| aa| bb|[[cc, dd, ee, ff,...|
+---+---+--------------------+
scala> sqlContext.sql("select a,b,collect_list(struct(c,d,e,f,g)) from abc group by a,b").show;
+---+---+--------------------+
| a| b| _c2|
+---+---+--------------------+
|aa1|bb1|[[cc3,dd3,ee3,ff3...|
| aa| bb|[[cc,dd,ee,ff,gg]...|
+---+---+--------------------+