Apache spark 将列分组为spark中的列表_Apache Spark

Apache spark 将列分组为spark中的列表

apache-spark

Apache spark 将列分组为spark中的列表,apache-spark,Apache Spark,我的数据格式为（“名称”、“id”），其中一个名称可以有多个id。我希望将其转换为列表或集合，以便我有一个与每个名称对应的ID列表，“名称”成为唯一字段。我尝试了以下方法，但似乎不正确： val group = dataFrame.map( r => (dataFrame.rdd.filter(s => s.getAs(0) == r.getAs(0)).collect())) 我得到以下错误： org.apache.spark.SparkException: RDD trans

我的数据格式为（“名称”、“id”），其中一个名称可以有多个id。我希望将其转换为列表或集合，以便我有一个与每个名称对应的ID列表，“名称”成为唯一字段。
我尝试了以下方法，但似乎不正确：

val group = dataFrame.map( r => (dataFrame.rdd.filter(s => s.getAs(0) == r.getAs(0)).collect()))

我得到以下错误：

org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.

这个问题的解决方案是什么，groupBy在这里工作吗？如果是，如何工作？

假设这是您的

数据帧：
val df = Seq(("dave",1),("dave",2),("griffin",3),("griffin",4)).toDF("name","id")

然后，您可以执行以下操作：
df.groupBy(col("name")).agg(collect_list(col("id")) as "ids")

另一个问题是，如果我有另一个列（比如“place”）带有一对一映射和“name”，那么我如何聚合呢。我应该做同样的事情，然后进行连接，还是有办法在这里指定“place”列必须保持原样？