Apache spark 如何将所有行作为JSON数组写入Kafka的流数据帧？_Apache Spark_Apache Kafka_Spark Structured Streaming

Apache spark 如何将所有行作为JSON数组写入Kafka的流数据帧？

apache-spark apache-kafka

Apache spark 如何将所有行作为JSON数组写入Kafka的流数据帧？,apache-spark,apache-kafka,spark-structured-streaming,Apache Spark,Apache Kafka,Spark Structured Streaming,我正在寻找一种解决方案，用于将spark流式数据写入kafka。我使用以下方法将数据写入kafka df.selectExpr("to_json(struct(*)) AS value").writeStream.format("kafka") 但我的问题是在给卡夫卡写信时，数据显示如下 {"country":"US","plan":postpaid,"value":300} {"country":"CAN","plan":0.0,"value":30} 我的预期产出是 [

我正在寻找一种解决方案，用于将spark流式数据写入kafka。我使用以下方法将数据写入kafka

df.selectExpr("to_json(struct(*)) AS value").writeStream.format("kafka")

但我的问题是在给卡夫卡写信时，数据显示如下

{"country":"US","plan":postpaid,"value":300}
{"country":"CAN","plan":0.0,"value":30}

我的预期产出是

   [
    {"country":"US","plan":postpaid,"value":300}
    {"country":"CAN","plan":0.0,"value":30}
   ]

我想把行括在数组中。如何在spark流媒体中实现同样的效果？有人能给我一些建议吗？我真的不确定这是否可行，但我还是会把我的建议贴在这里；因此，您可以在以后转换数据帧：

 //Input  
 inputDF.show(false)
 +---+-------+
 |int|string |
 +---+-------+
 |1  |string1|
 |2  |string2|
 +---+-------+

 //convert that to json
 inputDF.toJSON.show(false)
 +----------------------------+
 |value                       |
 +----------------------------+
 |{"int":1,"string":"string1"}|
 |{"int":2,"string":"string2"}|
 +----------------------------+

 //then use collect and mkString
 println(inputDF.toJSON.collect().mkString("[", "," , "]"))
 [{"int":1,"string":"string1"},{"int":2,"string":"string2"}]

我假设流数据帧（

df

）的模式如下：

root
 |-- country: string (nullable = true)
 |-- plan: string (nullable = true)
 |-- value: string (nullable = true)

我还假设您希望将流数据帧（

df

）中的所有行作为单个记录写入（生成）Kafka主题，其中的行以JSON数组的形式存在

如果是这样，您应该

groupBy

行和

collect\u list

将所有行分组为一个您可以向卡夫卡写的行

// df is a batch DataFrame so I could show for demo purposes
scala> df.show
+-------+--------+-----+
|country|    plan|value|
+-------+--------+-----+
|     US|postpaid|  300|
|    CAN|     0.0|   30|
+-------+--------+-----+

val jsons = df.selectExpr("to_json(struct(*)) AS value")
scala> jsons.show(truncate = false)
+------------------------------------------------+
|value                                           |
+------------------------------------------------+
|{"country":"US","plan":"postpaid","value":"300"}|
|{"country":"CAN","plan":"0.0","value":"30"}     |
+------------------------------------------------+

val grouped = jsons.groupBy().agg(collect_list("value") as "value")
scala> grouped.show(truncate = false)
+-----------------------------------------------------------------------------------------------+
|value                                                                                          |
+-----------------------------------------------------------------------------------------------+
|[{"country":"US","plan":"postpaid","value":"300"}, {"country":"CAN","plan":"0.0","value":"30"}]|
+-----------------------------------------------------------------------------------------------+

我会在中完成以上所有操作，以获得要处理的数据帧。

通过“预期输出”，您的意思是说您需要在Kafka中有一个数组（多行），尽管您只向Kafka写了一行？如果这是您所期望的-那么可能只是将多行写入Kafka初始值我希望Kafka中有一个数组（多行）…我在特定时间写入的任何数据帧都应该包含在一个数组中。您可以将

df.printSchema

的输出粘贴到您的问题？collect（）操作在spark structured streaming中不起作用我可以接受答案…但问题是我已经在使用一个分组操作来生成最终数据帧..我无法再次应用分组操作，因为我使用的是结构化流API（不支持多级聚合）