Pyspark 将拼花文件模式导出为JSON或CSV

Pyspark 将拼花文件模式导出为JSON或CSV,pyspark,schema,metadata,parquet,Pyspark,Schema,Metadata,Parquet,我需要将拼花地板文件的模式提取为JSON、TXT或CSV格式。 这应该包括来自拼花文件的列名和数据类型 例如: {"id", "type" : "integer" }, {"booking_date""type" : "timestamp", "format" : "%Y-%m-%d %H:%M:%S.%f" } 我们可以使用.sch

我需要将拼花地板文件的模式提取为JSON、TXT或CSV格式。 这应该包括来自拼花文件的列名和数据类型

例如:

{"id", "type" : "integer" },
 {"booking_date""type" : "timestamp", "format" : "%Y-%m-%d %H:%M:%S.%f" }

我们可以使用
.schema
拼花
文件中读取
模式
,并将其转换为
json
格式,最后另存为
文本文件

输入拼花地板文件:

spark.read.parquet("/tmp").printSchema()
 #root
 #|-- id: integer (nullable = true)
 #|-- name: string (nullable = true)
 #|-- booking_date: timestamp (nullable = true)
spark.sparkContext.parallelize( #converting from string to rdd
[spark.read.parquet("/tmp").schema.json()] #read schema of parquetfile
 ).repartition(1).\
saveAsTextFile("/tmp_schema/") #saving the file into HDFS
$ hdfs dfs -cat /tmp_schema/part-00000
{"fields":[{"metadata":{},"name":"id","nullable":true,"type":"integer"},{"metadata":{},"name":"name","nullable":true,"type":"string"},{"metadata":{},"name":"booking_date","nullable":true,"type":"timestamp"}],"type":"struct"}

提取架构并写入HDFS/本地文件系统:

spark.read.parquet("/tmp").printSchema()
 #root
 #|-- id: integer (nullable = true)
 #|-- name: string (nullable = true)
 #|-- booking_date: timestamp (nullable = true)
spark.sparkContext.parallelize( #converting from string to rdd
[spark.read.parquet("/tmp").schema.json()] #read schema of parquetfile
 ).repartition(1).\
saveAsTextFile("/tmp_schema/") #saving the file into HDFS
$ hdfs dfs -cat /tmp_schema/part-00000
{"fields":[{"metadata":{},"name":"id","nullable":true,"type":"integer"},{"metadata":{},"name":"name","nullable":true,"type":"string"},{"metadata":{},"name":"booking_date","nullable":true,"type":"timestamp"}],"type":"struct"}

从hdfs读取输出文件:

spark.read.parquet("/tmp").printSchema()
 #root
 #|-- id: integer (nullable = true)
 #|-- name: string (nullable = true)
 #|-- booking_date: timestamp (nullable = true)
spark.sparkContext.parallelize( #converting from string to rdd
[spark.read.parquet("/tmp").schema.json()] #read schema of parquetfile
 ).repartition(1).\
saveAsTextFile("/tmp_schema/") #saving the file into HDFS
$ hdfs dfs -cat /tmp_schema/part-00000
{"fields":[{"metadata":{},"name":"id","nullable":true,"type":"integer"},{"metadata":{},"name":"name","nullable":true,"type":"string"},{"metadata":{},"name":"booking_date","nullable":true,"type":"timestamp"}],"type":"struct"}