Java中Apache Spark的GroupBy和连接数据帧行_Java_Apache Spark_Apache Spark Sql_Spark Dataframe

Java中Apache Spark的GroupBy和连接数据帧行

java apache-spark

Java中Apache Spark的GroupBy和连接数据帧行,java,apache-spark,apache-spark-sql,spark-dataframe,Java,Apache Spark,Apache Spark Sql,Spark Dataframe,我有一个具有此模式的数据帧： id user keywords 1 u1, u2 key1, key2 1 u3, u4 key3, key4 1 u5, u6 key5, key6 2 u7, u8 key7, key8 2 u9, u10 key9, key10 3 u11, u12 key11, key12 3 u13, u1

我有一个具有此模式的数据帧：

id      user        keywords
1       u1, u2      key1, key2  
1       u3, u4      key3, key4
1       u5, u6      key5, key6
2       u7, u8      key7, key8
2       u9, u10     key9, key10
3       u11, u12    key11, key12
3       u13, u14    key13, key14

我需要一个方法来按id对行进行分组，并将user和keywords列中的字符串连接起来，使其看起来像这样：

id      user                            keywords
1       u1, u2, u3, u4, u5, u6          key1, key2, key3, key4, key5, key6
2       u7, u8, u9, u10                 key7, key8, key9, key10
3       u11, u12, u13, u14              key11, key12, key13, key14

在Java中如何做到这一点？

执行以下操作：

使用（用户，（作者，关键字）创建RDD

此RDD上的groupByKey

关于作者和关键词的一些平面图

你试过做什么？在这个网站上，你应该问你遇到的问题的答案，而不是要完成的工作的解决方案…我一直在尝试使用JavaRDD，将其转换为JavaPairDD并应用ReduceByKey和聚合，但没有成功。我想可能有更好的解决方案可以直接应用到数据帧上，而h我不知道怎么做。我不确定。我在理解python中建议的解决方案时遇到困难。似乎Java.python的spark 1.6.1中不存在UserDefinedAggregateFunction？那里没有python。