Scala Spark:按ID创建JSON组
我有带有示例数据的dataFrame unionDataDFScala Spark:按ID创建JSON组,scala,apache-spark,apache-spark-sql,rdd,Scala,Apache Spark,Apache Spark Sql,Rdd,我有带有示例数据的dataFrame unionDataDF +---+------------------+----+ | id| data| key| +---+------------------+----+ | 1|[{"data":"data1"}]|key1| | 2|[{"data":"data2"}]|key1| | 1|[{"data":"data1"}]|key2| | 2|[{"data":"data2"}]|key2| +---+----
+---+------------------+----+
| id| data| key|
+---+------------------+----+
| 1|[{"data":"data1"}]|key1|
| 2|[{"data":"data2"}]|key1|
| 1|[{"data":"data1"}]|key2|
| 2|[{"data":"data2"}]|key2|
+---+------------------+----+
其中id为IntType,数据为JsonType,键为StringType
我想通过网络发送每个id的数据。例如,id“1”的输出数据如下:
我该怎么做呢
创建unionDataDF的示例代码
版本:
Spark: 2.2
Scala: 2.11
差不多
unionDataDF
.groupBy("id")
.agg(collect_list(struct("key", "data")).alias("grouped"))
.show(10, false)
输出:
+---+--------------------------------------------------------+
|id |grouped |
+---+--------------------------------------------------------+
|1 |[[key1, [{"data":"data1"}]], [key2, [{"data":"data1"}]]]|
|2 |[[key1, [{"data":"data2"}]], [key2, [{"data":"data2"}]]]|
+---+--------------------------------------------------------+
谢谢你的回复。这一步之后我应该做什么。当使用“for(row)”进行迭代时,您到底想做什么?通过网络以JSON字符串的形式发送数据?您可以仅从执行器或驱动程序发送数据吗?批处理等?
unionDataDF
.groupBy("id")
.agg(collect_list(struct("key", "data")).alias("grouped"))
.show(10, false)
+---+--------------------------------------------------------+
|id |grouped |
+---+--------------------------------------------------------+
|1 |[[key1, [{"data":"data1"}]], [key2, [{"data":"data1"}]]]|
|2 |[[key1, [{"data":"data2"}]], [key2, [{"data":"data2"}]]]|
+---+--------------------------------------------------------+