使用pyspark将多级JSON转换为数据帧_Pyspark_Apache Spark Sql_Pyspark Dataframes

使用pyspark将多级JSON转换为数据帧

pyspark

使用pyspark将多级JSON转换为数据帧,pyspark,apache-spark-sql,pyspark-dataframes,Pyspark,Apache Spark Sql,Pyspark Dataframes,我的输入JSON文件是 { "Name": "Test", "Mobile": 12345678, "Boolean": true, "Pets": ["Dog", "cat"], "Address": { "Permanent address": "USA", "current Address": "AU" } } 需求是使用pyspark将上述多级JSON转换为数据帧我试着使用代码 path_to_input = "/FileStore/tables/sample_json_

我的输入JSON文件是

{
"Name": "Test",
"Mobile": 12345678,
"Boolean": true,
"Pets": ["Dog", "cat"],
"Address": {
  "Permanent address": "USA",
  "current Address": "AU"
  }
}

需求是使用pyspark将上述多级JSON转换为数据帧

我试着使用代码

path_to_input = "/FileStore/tables/sample_json_file2-6c20f.json"
df = spark.read.json(sc.wholeTextFiles(path_to_input).values())
df.show()

我得到的输出是

+---------+-------+--------+----+----------+
|  Address|Boolean|  Mobile|Name|      Pets|
+---------+-------+--------+----+----------+
|[USA, AU]|   true|12345678|Test|[Dog, cat]|
+---------+-------+--------+----+----------+

在address和pets字段中，我在同一列中得到两个值。它不应该像一个数组。我的地址应该是美国的永久地址，AU的当前地址。

您可以尝试以下方式：

schema_json = StructType([StructField("Address", StringType(), True),
                      StructField("Boolean", BooleanType(), True),
                      StructField("Mobile", LongType(), True),
                      StructField("Name", StringType(), True),
                      StructField("Pets", StringType(), True)])

df = spark.read.json(path="/FileStore/tables/sample_json_file2-6c20f.json", schema = schema_json)
df.show(truncate=False)

这将有如下输出：

+--------------------------------------------------+-------+--------+----+-------------+
|Address                                           |Boolean|Mobile  |Name|Pets         |
+--------------------------------------------------+-------+--------+----+-------------+
|{"Permanent address":"USA","current Address":"AU"}|true   |12345678|Test|["Dog","cat"]|
+--------------------------------------------------+-------+--------+----+-------------+

 from pyspark.sql.functions import get_json_object

 df = spark.read.json(path="/FileStore/tables/sample_json_file2-6c20f.json", schema=schema_json)\
        .select("Boolean","Mobile","Name","Pets",get_json_object('Address', "$.Permanent address").alias('Permanent address'),get_json_object('Address', "$.current Address").alias('current Address'))
 df.show(truncate=False)

编辑

如果您希望将

永久地址

和

当前地址

放在单独的列中，则可以执行以下操作：

+--------------------------------------------------+-------+--------+----+-------------+
|Address                                           |Boolean|Mobile  |Name|Pets         |
+--------------------------------------------------+-------+--------+----+-------------+
|{"Permanent address":"USA","current Address":"AU"}|true   |12345678|Test|["Dog","cat"]|
+--------------------------------------------------+-------+--------+----+-------------+

 from pyspark.sql.functions import get_json_object

 df = spark.read.json(path="/FileStore/tables/sample_json_file2-6c20f.json", schema=schema_json)\
        .select("Boolean","Mobile","Name","Pets",get_json_object('Address', "$.Permanent address").alias('Permanent address'),get_json_object('Address', "$.current Address").alias('current Address'))
 df.show(truncate=False)

输出：

+-------+--------+----+-------------+-----------------+---------------+
|Boolean|Mobile  |Name|Pets         |Permanent address|current Address|
+-------+--------+----+-------------+-----------------+---------------+
|true   |12345678|Test|["Dog","cat"]|USA              |AU             |
+-------+--------+----+-------------+-----------------+---------------+

到目前为止你试过什么？你能在答案上贴些代码吗？这能回答你的问题吗？这回答了你的问题吗？谢谢你。。但我们必须将地址、永久地址和当前地址作为单独的列。