Python 用于在PySpark中定义JSON模式结构的配置文件_Python_Apache Spark_Pyspark_Apache Spark Sql

Python 用于在PySpark中定义JSON模式结构的配置文件

python apache-spark pyspark

Python 用于在PySpark中定义JSON模式结构的配置文件,python,apache-spark,pyspark,apache-spark-sql,Python,Apache Spark,Pyspark,Apache Spark Sql,我创建了一个PySpark应用程序，它通过定义的模式读取数据帧中的JSON文件。下面是代码示例 schema = StructType([ StructField("domain", StringType(), True), StructField("timestamp", LongType(), True), ]) df= sqlContext.read.json(file, schema) 我需要一种方法来找到如何在

我创建了一个PySpark应用程序，它通过定义的模式读取数据帧中的JSON文件。下面是代码示例

schema = StructType([
    StructField("domain", StringType(), True),
     StructField("timestamp", LongType(), True),                            
])
df= sqlContext.read.json(file, schema)

我需要一种方法来找到如何在一种配置文件或ini文件等中定义这个模式，并在PySpark应用程序中读取它

如果将来有任何需要，这将有助于我在不更改PySpark主代码的情况下修改JSON的模式。

StructType

提供了

JSON

和

jsonValue

方法，可用于分别获得

JSON

和

dict

表示和

fromJson

可用于将Python字典转换为

StructType

schema = StructType([
    StructField("domain", StringType(), True),
    StructField("timestamp", LongType(), True),                            
])

StructType.fromJson(schema.jsonValue())

除此之外，您唯一需要的是内置模块来解析

dict

的输入，而

StructType

可以使用该模块

schema = StructType([
    StructField("domain", StringType(), True),
    StructField("timestamp", LongType(), True),                            
])

StructType.fromJson(schema.jsonValue())

有关Scala版本，请参见

，您可以按以下格式创建名为schema.JSON的JSON文件

{
  "fields": [
    {
      "metadata": {},
      "name": "first_fields",
      "nullable": true,
      "type": "string"
    },
    {
      "metadata": {},
      "name": "double_field",
      "nullable": true,
      "type": "double"
    }
  ],
  "type": "struct"
}

通过读取此文件创建结构架构

rdd = spark.sparkContext.wholeTextFiles("s3://<bucket>/schema.json")
text = rdd.collect()[0][1]
dict = json.loads(str(text))
custom_schema = StructType.fromJson(dict)