Pyspark 将拼花文件模式导出为JSON或CSV_Pyspark_Schema_Metadata_Parquet

Pyspark 将拼花文件模式导出为JSON或CSV

pyspark

Pyspark 将拼花文件模式导出为JSON或CSV,pyspark,schema,metadata,parquet,Pyspark,Schema,Metadata,Parquet,我需要将拼花地板文件的模式提取为JSON、TXT或CSV格式。这应该包括来自拼花文件的列名和数据类型例如： {"id", "type" : "integer" }, {"booking_date""type" : "timestamp", "format" : "%Y-%m-%d %H:%M:%S.%f" } 我们可以使用.sch

我需要将拼花地板文件的模式提取为JSON、TXT或CSV格式。这应该包括来自拼花文件的列名和数据类型

例如：

{"id", "type" : "integer" },
 {"booking_date""type" : "timestamp", "format" : "%Y-%m-%d %H:%M:%S.%f" }

我们可以使用

.schema

从

拼花

文件中读取

模式

，并将其转换为json
格式，最后另存为文本文件

输入拼花地板文件：

spark.read.parquet("/tmp").printSchema()
 #root
 #|-- id: integer (nullable = true)
 #|-- name: string (nullable = true)
 #|-- booking_date: timestamp (nullable = true)

spark.sparkContext.parallelize( #converting from string to rdd
[spark.read.parquet("/tmp").schema.json()] #read schema of parquetfile
 ).repartition(1).\
saveAsTextFile("/tmp_schema/") #saving the file into HDFS

$ hdfs dfs -cat /tmp_schema/part-00000
{"fields":[{"metadata":{},"name":"id","nullable":true,"type":"integer"},{"metadata":{},"name":"name","nullable":true,"type":"string"},{"metadata":{},"name":"booking_date","nullable":true,"type":"timestamp"}],"type":"struct"}

提取架构并写入HDFS/本地文件系统：

spark.read.parquet("/tmp").printSchema()
 #root
 #|-- id: integer (nullable = true)
 #|-- name: string (nullable = true)
 #|-- booking_date: timestamp (nullable = true)

spark.sparkContext.parallelize( #converting from string to rdd
[spark.read.parquet("/tmp").schema.json()] #read schema of parquetfile
 ).repartition(1).\
saveAsTextFile("/tmp_schema/") #saving the file into HDFS

$ hdfs dfs -cat /tmp_schema/part-00000
{"fields":[{"metadata":{},"name":"id","nullable":true,"type":"integer"},{"metadata":{},"name":"name","nullable":true,"type":"string"},{"metadata":{},"name":"booking_date","nullable":true,"type":"timestamp"}],"type":"struct"}

从hdfs读取输出文件：

spark.read.parquet("/tmp").printSchema()
 #root
 #|-- id: integer (nullable = true)
 #|-- name: string (nullable = true)
 #|-- booking_date: timestamp (nullable = true)

spark.sparkContext.parallelize( #converting from string to rdd
[spark.read.parquet("/tmp").schema.json()] #read schema of parquetfile
 ).repartition(1).\
saveAsTextFile("/tmp_schema/") #saving the file into HDFS

$ hdfs dfs -cat /tmp_schema/part-00000
{"fields":[{"metadata":{},"name":"id","nullable":true,"type":"integer"},{"metadata":{},"name":"name","nullable":true,"type":"string"},{"metadata":{},"name":"booking_date","nullable":true,"type":"timestamp"}],"type":"struct"}