Python PySpark to_json将丢失数组中结构的列名
我试图从嵌套的pyspark数据帧生成json字符串,但丢失了键值。 我的初始数据集类似于以下内容:Python PySpark to_json将丢失数组中结构的列名,python,dataframe,apache-spark,pyspark,apache-spark-sql,Python,Dataframe,Apache Spark,Pyspark,Apache Spark Sql,我试图从嵌套的pyspark数据帧生成json字符串,但丢失了键值。 我的初始数据集类似于以下内容: data = [ {"foo": [1, 2], "bar": [4, 5], "buzz": [7, 8]}, {"foo": [1], "bar": [4], "buzz": [7]}, {"foo": [1, 2, 3], &q
data = [
{"foo": [1, 2], "bar": [4, 5], "buzz": [7, 8]},
{"foo": [1], "bar": [4], "buzz": [7]},
{"foo": [1, 2, 3], "bar": [4, 5, 6], "buzz": [7, 8, 9]},
]
df = spark.read.json(sc.parallelize(data))
df.show()
## +---------+---------+---------+
## | bar| buzz| foo|
## +---------+---------+---------+
## | [4, 5]| [7, 8]| [1, 2]|
## | [4]| [7]| [1]|
## |[4, 5, 6]|[7, 8, 9]|[1, 2, 3]|
## +---------+---------+---------+
然后,我使用数组将每个列压缩在一起
df_zipped = (
df
.withColumn(
"zipped",
F.arrays_zip(
F.col("foo"),
F.col("bar"),
F.col("buzz"),
)
)
)
df_zipped.printSchema()
root
|-- bar: array (nullable = true)
| |-- element: long (containsNull = true)
|-- buzz: array (nullable = true)
| |-- element: long (containsNull = true)
|-- foo: array (nullable = true)
| |-- element: long (containsNull = true)
|-- zipped: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- foo: long (nullable = true)
| | |-- bar: long (nullable = true)
| | |-- buzz: long (nullable = true)
问题在于如何在压缩的数组中使用json。它会丢失foo、bar和buzz键值,而是将这些键保存为元素索引
(
df_zipped
.withColumn("zipped", F.to_json("zipped"))
.select("zipped")
.show(truncate=False)
)
+-------------------------------------------------------------+
|zipped |
+-------------------------------------------------------------+
|[{"0":1,"1":4,"2":7},{"0":2,"1":5,"2":8}] |
|[{"0":1,"1":4,"2":7}] |
|[{"0":1,"1":4,"2":7},{"0":2,"1":5,"2":8},{"0":3,"1":6,"2":9}]|
+-------------------------------------------------------------+
如何保持“bar”、“buzz”和“foo”而不是0、1、2?这不是一个非常漂亮的答案(因为您必须明确地指定键),但比 用于:
手动指定架构也有效: 对于foo、bar和buzz字段,元素顶部的数组必须已命名,而不是在实际的数据字段本身
data = [
{"foo": [1, 2], "bar": [4, 5], "buzz": [7, 8]},
{"foo": [1], "bar": [4], "buzz": [7]},
{"foo": [1, 2, 3], "bar": [4, 5, 6], "buzz": [7, 8, 9]},
]
df = spark.read.json(sc.parallelize(data))
df.show()
+---------+---------+---------+
| bar| buzz| foo|
+---------+---------+---------+
| [4, 5]| [7, 8]| [1, 2]|
| [4]| [7]| [1]|
|[4, 5, 6]|[7, 8, 9]|[1, 2, 3]|
+---------+---------+---------+
然后手动定义并强制转换到架构:
schema = StructType([
StructField("foo", IntegerType()),
StructField("bar", IntegerType()),
StructField("buzz", IntegerType()),
])
df_zipped = (
df_test
.select(
F.arrays_zip(
F.col("foo"),
F.col("bar"),
F.col("buzz"),
)
.alias("zipped")
)
.filter(F.col("zipped").isNotNull())
.select(F.col("zipped").cast(ArrayType(schema)))
)
这将产生所需的解决方案:
(
df_zipped
.withColumn("zipped", F.to_json("zipped"))
.select("zipped")
.show(truncate=False)
)
+----------------------------------------------------------------------------------+
|zipped |
+----------------------------------------------------------------------------------+
|[{"foo":1,"bar":4,"buzz":7},{"foo":2,"bar":5,"buzz":8}] |
|[{"foo":1,"bar":4,"buzz":7}] |
|[{"foo":1,"bar":4,"buzz":7},{"foo":2,"bar":5,"buzz":8},{"foo":3,"bar":6,"buzz":9}]|
+----------------------------------------------------------------------------------+
注意:在模式中转换为LongType不起作用,但您可以使用
transform
:类似于F.expr(“transform(zipped,x->concat(“{foo:”,x['foo'],“'bar:”,x['bar'],“'buzz:”,x['buzz'],“}”)”)的方式手动构建字符串。
是的,谢谢您的回答。这很好用。我希望有一种方法,您不必手动定义键/模式,但这可能是不可能的。您可以通过查询模式以获取键的名称来动态构建表达式。这很复杂,不确定是否值得付出努力。在任何情况下,都可以在数组中指定键
(
df_zipped
.withColumn("zipped", F.to_json("zipped"))
.select("zipped")
.show(truncate=False)
)
+----------------------------------------------------------------------------------+
|zipped |
+----------------------------------------------------------------------------------+
|[{"foo":1,"bar":4,"buzz":7},{"foo":2,"bar":5,"buzz":8}] |
|[{"foo":1,"bar":4,"buzz":7}] |
|[{"foo":1,"bar":4,"buzz":7},{"foo":2,"bar":5,"buzz":8},{"foo":3,"bar":6,"buzz":9}]|
+----------------------------------------------------------------------------------+