Python 如何从pyspark中的spark数据帧行转换具有多个键的JSON字符串?
我正在寻找一个帮助,如何将带有多个键的json字符串解析为json结构,请参见Python 如何从pyspark中的spark数据帧行转换具有多个键的JSON字符串?,python,json,apache-spark,pyspark,apache-spark-sql,Python,Json,Apache Spark,Pyspark,Apache Spark Sql,我正在寻找一个帮助,如何将带有多个键的json字符串解析为json结构,请参见required output 下面的答案显示了如何使用一个Id转换JSON字符串: jstr1='{“id_1”:\[{“a”:1,“b”:2},{“a”:3,“b”:4}\]}' 当每个JSON字符串的ID数在每个字符串中发生变化时,如何转换jstr1,jstr2中的数千个ID 当前代码: jstr1 = """ {"id_1": [{&q
required output
下面的答案显示了如何使用一个Id
转换JSON字符串:
jstr1='{“id_1”:\[{“a”:1,“b”:2},{“a”:3,“b”:4}\]}'
jstr1
,jstr2
中的数千个ID
当前代码:
jstr1 = """
{"id_1": [{"a": 1, "b": 2}, {"a": 3, "b": 4}],
"id_2": [{"a": 5, "b": 6}, {"a": 7, "b": 8}]}
"""
jstr2 = """
{"id_3": [{"a": 9, "b": 10}, {"a": 11, "b": 12}],
"id_4": [{"a": 12, "b": 14}, {"a": 15, "b": 16}],
"id_5": [{"a": 17, "b": 18}, {"a": 19, "b": 10}]}
"""
schema = "map<string, array<struct<a:int,b:int>>>"
df = sqlContext.createDataFrame([Row(json=jstr1),Row(json=jstr2)]) \
.withColumn('json', F.from_json(F.col('json'), schema))
output = df.withColumn("id", F.map_keys("json").getItem(0)) \
.withColumn("json", F.map_values("json").getItem(0))
output.show(truncate=False)
所需输出:
+---------------------+------+
| json | id |
+---------------------+------+
|[[[1, 2], [3, 4]]] | id_1 |
|[[[5, 6], [7, 8]]] | id_2 |
|[[[9,10], [11,12]]] | id_3 |
|[[[13,14], [15,16]]] | id_4 |
|[[[17,18], [19,20]]] | id_5 |
+---------------------+------+
# NOTE: There is a large number of Ids in each JSON string
# so hard coded getItem(0), getItem(1) ... is not valid solution
...
|[[[1000,1001], [10002,1003 ]]] | id_100000 |
+-------------------------------+-----------+
地图列的
分解将完成以下工作:
import pyspark.sql.functions as F
df.select(F.explode('json').alias('id', 'json')).show()
+----+--------------------+
| id| json|
+----+--------------------+
|id_1| [[1, 2], [3, 4]]|
|id_2| [[5, 6], [7, 8]]|
|id_3| [[9, 10], [11, 12]]|
|id_4|[[12, 14], [15, 16]]|
|id_5|[[17, 18], [19, 10]]|
+----+--------------------+
要实现上一个问题中的其他所需输出,您可以再次分解。这次分解数组列,该列来自贴图的值
df.select(
F.explode('json').alias('id', 'json')
).select(
'id', F.explode('json').alias('json')
).select(
'id', 'json.*'
).show()
+----+---+---+
| id| a| b|
+----+---+---+
|id_1| 1| 2|
|id_1| 3| 4|
|id_2| 5| 6|
|id_2| 7| 8|
|id_3| 9| 10|
|id_3| 11| 12|
|id_4| 12| 14|
|id_4| 15| 16|
|id_5| 17| 18|
|id_5| 19| 10|
+----+---+---+
地图列的分解将完成以下工作:
import pyspark.sql.functions as F
df.select(F.explode('json').alias('id', 'json')).show()
+----+--------------------+
| id| json|
+----+--------------------+
|id_1| [[1, 2], [3, 4]]|
|id_2| [[5, 6], [7, 8]]|
|id_3| [[9, 10], [11, 12]]|
|id_4|[[12, 14], [15, 16]]|
|id_5|[[17, 18], [19, 10]]|
+----+--------------------+
要实现上一个问题中的其他所需输出,您可以再次分解。这次分解数组列,该列来自贴图的值
df.select(
F.explode('json').alias('id', 'json')
).select(
'id', F.explode('json').alias('json')
).select(
'id', 'json.*'
).show()
+----+---+---+
| id| a| b|
+----+---+---+
|id_1| 1| 2|
|id_1| 3| 4|
|id_2| 5| 6|
|id_2| 7| 8|
|id_3| 9| 10|
|id_3| 11| 12|
|id_4| 12| 14|
|id_4| 15| 16|
|id_5| 17| 18|
|id_5| 19| 10|
+----+---+---+