如何在java中展平spark数据集中的包装数组

如何在java中展平spark数据集中的包装数组,java,apache-spark,Java,Apache Spark,使用Spark 2.2 Java 1.8 我需要收集一组数组列。但它给我带来了痛苦。请看下面 Dataset<Row> df2 = df.groupBy("id").agg(collect_list("values")) df2.show(truncate=False) # +-----+----------------------------------------------+ # |id| collect_list(values

使用Spark 2.2 Java 1.8

我需要收集一组数组列。但它给我带来了痛苦。请看下面

Dataset<Row> df2 = df.groupBy("id").agg(collect_list("values"))
df2.show(truncate=False)
# +-----+----------------------------------------------+ 
# |id|                         collect_list(values) | 
# +-----+----------------------------------------------+ 
# |1    |[WrappedArray(1, 2, 3), WrappedArray(4, 5, 6)]| 
# |2    |[WrappedArray(2), WrappedArray(3)]            | 
# +-----+----------------------------------------------+

Expected output : = 

# +-----+------------------+
# |store|           values |
# +-----+------------------+
# |1    |[1, 2, 3, 4, 5, 6]|
# |2    |[2, 3]            |
# +-----+------------------+
Dataset df2=df.groupBy(“id”).agg(收集列表(“值”))
df2.show(truncate=False)
# +-----+----------------------------------------------+ 
#| id |收集|列表(值)|
# +-----+----------------------------------------------+ 
#| 1 |[WrappedArray(1,2,3),WrappedArray(4,5,6)]|
#| 2 |[WrappedArray(2),WrappedArray(3)]|
# +-----+----------------------------------------------+
预期产量:=
# +-----+------------------+
#|存储|值|
# +-----+------------------+
# |1    |[1, 2, 3, 4, 5, 6]|
# |2    |[2, 3]            |
# +-----+------------------+
如何在spark java中实现上述输出。有人能帮忙吗?谢谢。

分组前可以使用“爆炸”功能:

df.withColumn("values", explode($"values")).groupBy("id").agg(collect_list($"values"))

下面是使用UDF(而不是java)的scala等价物:

输出:

+-----+----------------------------------------------+-------------+
|store|values                                        |values_new   |
+-----+----------------------------------------------+-------------+
|1    |[WrappedArray(1, 2, 3), WrappedArray(4, 5, 6)]|[1,2,3,4,5,6]|
|2    |[WrappedArray(2), WrappedArray(3)]            |[2,3]        |
+-----+----------------------------------------------+-------------+    

希望这有帮助

爆炸是一项昂贵的操作。这需要更多的时间。可以有不同的方法吗?谢谢
+-----+----------------------------------------------+-------------+
|store|values                                        |values_new   |
+-----+----------------------------------------------+-------------+
|1    |[WrappedArray(1, 2, 3), WrappedArray(4, 5, 6)]|[1,2,3,4,5,6]|
|2    |[WrappedArray(2), WrappedArray(3)]            |[2,3]        |
+-----+----------------------------------------------+-------------+