Pyspark 将拼花文件模式导出为JSON或CSV
我需要将拼花地板文件的模式提取为JSON、TXT或CSV格式。 这应该包括来自拼花文件的列名和数据类型 例如:Pyspark 将拼花文件模式导出为JSON或CSV,pyspark,schema,metadata,parquet,Pyspark,Schema,Metadata,Parquet,我需要将拼花地板文件的模式提取为JSON、TXT或CSV格式。 这应该包括来自拼花文件的列名和数据类型 例如: {"id", "type" : "integer" }, {"booking_date""type" : "timestamp", "format" : "%Y-%m-%d %H:%M:%S.%f" } 我们可以使用.sch
{"id", "type" : "integer" },
{"booking_date""type" : "timestamp", "format" : "%Y-%m-%d %H:%M:%S.%f" }
我们可以使用
.schema
从拼花
文件中读取模式
,并将其转换为json
格式,最后另存为文本文件
输入拼花地板文件:
spark.read.parquet("/tmp").printSchema()
#root
#|-- id: integer (nullable = true)
#|-- name: string (nullable = true)
#|-- booking_date: timestamp (nullable = true)
spark.sparkContext.parallelize( #converting from string to rdd
[spark.read.parquet("/tmp").schema.json()] #read schema of parquetfile
).repartition(1).\
saveAsTextFile("/tmp_schema/") #saving the file into HDFS
$ hdfs dfs -cat /tmp_schema/part-00000
{"fields":[{"metadata":{},"name":"id","nullable":true,"type":"integer"},{"metadata":{},"name":"name","nullable":true,"type":"string"},{"metadata":{},"name":"booking_date","nullable":true,"type":"timestamp"}],"type":"struct"}
提取架构并写入HDFS/本地文件系统:
spark.read.parquet("/tmp").printSchema()
#root
#|-- id: integer (nullable = true)
#|-- name: string (nullable = true)
#|-- booking_date: timestamp (nullable = true)
spark.sparkContext.parallelize( #converting from string to rdd
[spark.read.parquet("/tmp").schema.json()] #read schema of parquetfile
).repartition(1).\
saveAsTextFile("/tmp_schema/") #saving the file into HDFS
$ hdfs dfs -cat /tmp_schema/part-00000
{"fields":[{"metadata":{},"name":"id","nullable":true,"type":"integer"},{"metadata":{},"name":"name","nullable":true,"type":"string"},{"metadata":{},"name":"booking_date","nullable":true,"type":"timestamp"}],"type":"struct"}
从hdfs读取输出文件:
spark.read.parquet("/tmp").printSchema()
#root
#|-- id: integer (nullable = true)
#|-- name: string (nullable = true)
#|-- booking_date: timestamp (nullable = true)
spark.sparkContext.parallelize( #converting from string to rdd
[spark.read.parquet("/tmp").schema.json()] #read schema of parquetfile
).repartition(1).\
saveAsTextFile("/tmp_schema/") #saving the file into HDFS
$ hdfs dfs -cat /tmp_schema/part-00000
{"fields":[{"metadata":{},"name":"id","nullable":true,"type":"integer"},{"metadata":{},"name":"name","nullable":true,"type":"string"},{"metadata":{},"name":"booking_date","nullable":true,"type":"timestamp"}],"type":"struct"}