如何将pyspark数据帧转换为JSON?

如何将pyspark数据帧转换为JSON?,pyspark,pyspark-sql,pyspark-dataframes,Pyspark,Pyspark Sql,Pyspark Dataframes,我有pyspark数据框架,我想把它转换成包含JSON对象的列表。 为此,我做了如下工作 df.toJSON().collect() 但是,此操作将数据发送到驱动程序,这需要花费大量的时间和成本。而且我的数据帧包含数百万条记录。因此,没有比collect()优化的collect()操作,还有其他方法可以完成此操作吗 以下是我的数据帧: product cost pen 10 book 40 bottle 80 g

我有pyspark数据框架,我想把它转换成包含JSON对象的列表。 为此,我做了如下工作

df.toJSON().collect()
但是,此操作将数据发送到驱动程序,这需要花费大量的时间和成本。而且我的数据帧包含数百万条记录。因此,没有比collect()优化的collect()操作,还有其他方法可以完成此操作吗

以下是我的数据帧:

      product cost
      pen      10
      book     40
      bottle   80
      glass    55
输出如下:-

df2 = [{product:'pen',cost:40},{product:'book',cost:40},{product:'bottle',cost:80},{product:'glass',cost:55}]

当我打印df2的数据类型时,它将是列表

如果要在数据帧中创建json对象,请使用
收集列表
+
创建映射
+
到json
函数

(或)

如果要将
json
文档写入文件,则不会使用
写入json
而是使用
.write.json()

df.agg(collect_list(create_map(lit("product"),"product",lit("cost"),"cost")).alias("stru")).\
selectExpr("to_json(stru) as json").\
show(10,False)

#+-------------------------------------------------------------------------------------------------------------------------------+
#|json                                                                                                                           |
#+-------------------------------------------------------------------------------------------------------------------------------+
#|[{"product":"pen","cost":"10"},{"product":"book","cost":"40"},{"product":"bottle","cost":"80"},{"product":"glass","cost":"55"}]|
#+-------------------------------------------------------------------------------------------------------------------------------+


#write to hdfs use .saveAsTextFile
df.agg(collect_list(create_map(lit("product"),"product",lit("cost"),"cost")).alias("stru")).selectExpr("to_json(stru) as json").rdd.map(lambda x:x['json']).saveAsTextFile("<path>")

#cat part-00000
#[{"product":"pen","cost":"10"},{"product":"book","cost":"40"},{"product":"bottle","cost":"80"},{"product":"glass","cost":"55"}]
df.agg(collect_list(create_map(lit("product"),"product",lit("cost"),"cost")).alias("stru")).write.mode("overwrite").json("<path>")

#cat part-00000-3a19165e-219e-4485-adb8-ef91589d6e31-c000.json
#{"stru":[{"product":"pen","cost":"10"},{"product":"book","cost":"40"},{"product":"bottle","cost":"80"},{"product":"glass","cost":"55"}]}

创建JSON对象:

df.agg(collect_list(create_map(lit("product"),"product",lit("cost"),"cost")).alias("stru")).\
selectExpr("to_json(stru) as json").\
show(10,False)

#+-------------------------------------------------------------------------------------------------------------------------------+
#|json                                                                                                                           |
#+-------------------------------------------------------------------------------------------------------------------------------+
#|[{"product":"pen","cost":"10"},{"product":"book","cost":"40"},{"product":"bottle","cost":"80"},{"product":"glass","cost":"55"}]|
#+-------------------------------------------------------------------------------------------------------------------------------+


#write to hdfs use .saveAsTextFile
df.agg(collect_list(create_map(lit("product"),"product",lit("cost"),"cost")).alias("stru")).selectExpr("to_json(stru) as json").rdd.map(lambda x:x['json']).saveAsTextFile("<path>")

#cat part-00000
#[{"product":"pen","cost":"10"},{"product":"book","cost":"40"},{"product":"bottle","cost":"80"},{"product":"glass","cost":"55"}]
df.agg(collect_list(create_map(lit("product"),"product",lit("cost"),"cost")).alias("stru")).write.mode("overwrite").json("<path>")

#cat part-00000-3a19165e-219e-4485-adb8-ef91589d6e31-c000.json
#{"stru":[{"product":"pen","cost":"10"},{"product":"book","cost":"40"},{"product":"bottle","cost":"80"},{"product":"glass","cost":"55"}]}

我已经试过了,但输出将是pyspark数据帧,但我不想要pyspark数据帧。我只想将pyspark dataframe转换为包含json对象的列表,而不使用collect和toJSON函数@蜀