Apache spark 从Dataframe创建Json
作为Spark的新手,我正在做一些事情,并且面临着困难。任何线索都会有帮助。 我正试图从我拥有的dataframe创建一个JSON,但toJSON函数并没有帮到我。因此,我的输出数据框如下所示:-Apache spark 从Dataframe创建Json,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,作为Spark的新手,我正在做一些事情,并且面临着困难。任何线索都会有帮助。 我正试图从我拥有的dataframe创建一个JSON,但toJSON函数并没有帮到我。因此,我的输出数据框如下所示:- +---------+------------------+-------------------------+ |booking_id| status |count(status)| +---------+------------------+---------------
+---------+------------------+-------------------------+
|booking_id| status |count(status)|
+---------+------------------+-------------------------+
| 132 | rent count. | 6|
| 132 | rent booked | 24|
| 132 | rent delayed | 6|
| 134 | rent booked | 34|
| 134 | rent delayed. | 21|
我正在寻找的输出是一个数据帧,它将包含预订id和状态,以及它作为Json的计数
+---------+-------------------------------------------+
|booking_id| status_json
+---------+-------------------------------------------+
| 132 | { "rent count": 6, "rent booked": 24, "rent delayed": 6}
| 134 | { "rent booked": 34, "rent delayed": 21}
提前谢谢 对于Spark2.4,使用_数组中的map_
对于Spark2.4,使用_数组中的映射_
首先创建一个包含staus和countstatus列的映射列。然后groupBy,aggcollect\u listyourmapcolumn,最后调用jsonfirst创建一个包含staus和countstatus列的映射列。然后groupBy,aggcollect\u listyourmapcolumn,最后调用JSON
from pyspark.sql import functions as F
df.groupBy("booking_id").agg(F.to_json(F.map_from_arrays(F.collect_list("status"),F.collect_list("count(status)")))\
.alias("status_json"))\
.show(truncate=False)
#+----------+--------------------------------------------------+
#|booking_id|status_json |
#+----------+--------------------------------------------------+
#|132 |{"rent count":6,"rent booked":24,"rent delayed":6}|
#|134 |{"rent booked":34,"rent delayed":21} |
#+----------+--------------------------------------------------+
val sourceDF = Seq(
(132, "rent count", 6),
(132, "rent booked", 24),
(132, "rent delayed", 6),
(134, "rent booked", 34),
(134, "rent delayed", 21)
).toDF("booking_id", "status", "count(status)")
val resDF = sourceDF
.groupBy("booking_id")
.agg(to_json(collect_list(map(col("status"), col("count(status)")))).alias("status_json"))
// +----------+--------------------------------------------------------+
// |booking_id|status_json |
// +----------+--------------------------------------------------------+
// |132 |[{"rent count":6},{"rent booked":24},{"rent delayed":6}]|
// |134 |[{"rent booked":34},{"rent delayed":21}] |
// +----------+--------------------------------------------------------+