使用pyspark将多级JSON转换为数据帧
我的输入JSON文件是使用pyspark将多级JSON转换为数据帧,pyspark,apache-spark-sql,pyspark-dataframes,Pyspark,Apache Spark Sql,Pyspark Dataframes,我的输入JSON文件是 { "Name": "Test", "Mobile": 12345678, "Boolean": true, "Pets": ["Dog", "cat"], "Address": { "Permanent address": "USA", "current Address": "AU" } } 需求是使用pyspark将上述多级JSON转换为数据帧 我试着使用代码 path_to_input = "/FileStore/tables/sample_json_
{
"Name": "Test",
"Mobile": 12345678,
"Boolean": true,
"Pets": ["Dog", "cat"],
"Address": {
"Permanent address": "USA",
"current Address": "AU"
}
}
需求是使用pyspark将上述多级JSON转换为数据帧
我试着使用代码
path_to_input = "/FileStore/tables/sample_json_file2-6c20f.json"
df = spark.read.json(sc.wholeTextFiles(path_to_input).values())
df.show()
我得到的输出是
+---------+-------+--------+----+----------+
| Address|Boolean| Mobile|Name| Pets|
+---------+-------+--------+----+----------+
|[USA, AU]| true|12345678|Test|[Dog, cat]|
+---------+-------+--------+----+----------+
在address和pets字段中,我在同一列中得到两个值。它不应该像一个数组。我的地址应该是美国的永久地址,AU的当前地址。您可以尝试以下方式:
schema_json = StructType([StructField("Address", StringType(), True),
StructField("Boolean", BooleanType(), True),
StructField("Mobile", LongType(), True),
StructField("Name", StringType(), True),
StructField("Pets", StringType(), True)])
df = spark.read.json(path="/FileStore/tables/sample_json_file2-6c20f.json", schema = schema_json)
df.show(truncate=False)
这将有如下输出:
+--------------------------------------------------+-------+--------+----+-------------+
|Address |Boolean|Mobile |Name|Pets |
+--------------------------------------------------+-------+--------+----+-------------+
|{"Permanent address":"USA","current Address":"AU"}|true |12345678|Test|["Dog","cat"]|
+--------------------------------------------------+-------+--------+----+-------------+
from pyspark.sql.functions import get_json_object
df = spark.read.json(path="/FileStore/tables/sample_json_file2-6c20f.json", schema=schema_json)\
.select("Boolean","Mobile","Name","Pets",get_json_object('Address', "$.Permanent address").alias('Permanent address'),get_json_object('Address', "$.current Address").alias('current Address'))
df.show(truncate=False)
编辑
如果您希望将永久地址
和当前地址
放在单独的列中,则可以执行以下操作:
+--------------------------------------------------+-------+--------+----+-------------+
|Address |Boolean|Mobile |Name|Pets |
+--------------------------------------------------+-------+--------+----+-------------+
|{"Permanent address":"USA","current Address":"AU"}|true |12345678|Test|["Dog","cat"]|
+--------------------------------------------------+-------+--------+----+-------------+
from pyspark.sql.functions import get_json_object
df = spark.read.json(path="/FileStore/tables/sample_json_file2-6c20f.json", schema=schema_json)\
.select("Boolean","Mobile","Name","Pets",get_json_object('Address', "$.Permanent address").alias('Permanent address'),get_json_object('Address', "$.current Address").alias('current Address'))
df.show(truncate=False)
输出:
+-------+--------+----+-------------+-----------------+---------------+
|Boolean|Mobile |Name|Pets |Permanent address|current Address|
+-------+--------+----+-------------+-----------------+---------------+
|true |12345678|Test|["Dog","cat"]|USA |AU |
+-------+--------+----+-------------+-----------------+---------------+
到目前为止你试过什么?你能在答案上贴些代码吗?这能回答你的问题吗?这回答了你的问题吗?谢谢你。。但我们必须将地址、永久地址和当前地址作为单独的列。