Dataframe Spark-将包含JSON字符串的coulmn从StringType转换为ArrayType(StringType())
我有一个dataframe df,它包含json字符串,如下所示Dataframe Spark-将包含JSON字符串的coulmn从StringType转换为ArrayType(StringType()),dataframe,apache-spark,pyspark,apache-spark-sql,Dataframe,Apache Spark,Pyspark,Apache Spark Sql,我有一个dataframe df,它包含json字符串,如下所示 '''[{"@id":"Party_1","@ObjectID":"Policy_1"},{"@id":"Party_2","@ObjectID":"Policy_2"},{"@id":"Party_3","@ObjectI
'''[{"@id":"Party_1","@ObjectID":"Policy_1"},{"@id":"Party_2","@ObjectID":"Policy_2"},{"@id":"Party_3","@ObjectID":"Policy_3"}]'''
df模式:
root
|-- col1: string (nullable = true)
如何将其转换为字符串数组(ArrayType(StringType())
结果应该是
['{"@id":"Party_1","@OriginatingObjectID":"Policy_1"}',
'{"@id":"Party_2","@OriginatingObjectID":"Policy_2"}',
'{"@id":"Party_3","@OriginatingObjectID":"Policy_3"}']
结果架构:
root
|-- arr_col: array (nullable = true)
| |-- element: string (containsNull = true)
任何帮助都将不胜感激。谢谢大家! 您可以使用from_json函数获取json字段,只需稍微修改以下值
data = [
('[{"@id":"Party_1","@ObjectID":"Policy_1"},{"@id":"Party_2","@ObjectID":"Policy_2"},{"@id":"Party_3","@ObjectID":"Policy_3"}]', 2767),
('[{"@id":"Party_1","@ObjectID":"Policy_1"},{"@id":"Party_2","@ObjectID":"Policy_2"},{"@id":"Party_3","@ObjectID":"Policy_3"}]', 4235)
]
df = spark.createDataFrame(data).toDF(*["value", "count"])\
.withColumn("value", f.regexp_replace(f.col("value"), "\\[\\{", "{\"arr\": [{"))\
.withColumn("value", f.regexp_replace(f.col("value"), "\\}\\]", "}]}"))
json_schema = spark.read.json(df.rdd.map(lambda row: row.value)).schema
resultDF = df.select(f.from_json("value",
schema=json_schema).alias("array_col"))\
.select("array_col.*")
resultDF.printSchema()
resultDF.show(truncate=False)
如果希望嵌套json作为字符串,也可以使用自定义模式
输出架构:
root
|-- arr: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- @ObjectID: string (nullable = true)
| | |-- @id: string (nullable = true)
输出:
+---------------------------------------------------------------+
|arr |
+---------------------------------------------------------------+
|[{Policy_1, Party_1}, {Policy_2, Party_2}, {Policy_3, Party_3}]|
|[{Policy_1, Party_1}, {Policy_2, Party_2}, {Policy_3, Party_3}]|
+---------------------------------------------------------------+
您的“预期结果”没有您正在描述的架构。您期望的结果是ArrayType(MapType(StringType()))。请更清楚地说明您的期望。对不起,我已经更新了结果。预期结果是JSON字符串数组。