Python 3.x 如何使用Pyspark创建json列表?
我正在尝试使用Pyspark创建一个json文件,其结构如下 目标产出:Python 3.x 如何使用Pyspark创建json列表?,python-3.x,dataframe,pyspark,apache-spark-sql,Python 3.x,Dataframe,Pyspark,Apache Spark Sql,我正在尝试使用Pyspark创建一个json文件,其结构如下 目标产出: [{ "Loaded_data": [{ "Loaded_numeric_columns": ["id", "val"], "Loaded_category_columns": ["name", "branch"] }], "enriched_data": [{ "enriched_category_columns": ["country__4"],
[{
"Loaded_data": [{
"Loaded_numeric_columns": ["id", "val"],
"Loaded_category_columns": ["name", "branch"]
}],
"enriched_data": [{
"enriched_category_columns": ["country__4"],
"enriched_index_columns": ["id__1", "val__3"]
}]
}]
我可以为每个部分创建列表。请参考下面的代码。我有点困在这里了,你能帮我一下吗
样本数据:
您只需使用
struct
和array
创建新的列类型即可:
from pyspark.sql import functions as F
df.show()
+---+-----+-------+------+----------+-----+-------+
| id| val| name|branch|country__4|id__1| val__3|
+---+-----+-------+------+----------+-----+-------+
| 1|67.87|Shankar| a| 1|67.87|Shankar|
+---+-----+-------+------+----------+-----+-------+
df.select(
F.struct(
F.array(F.col("id"), F.col("val")).alias("Loaded_numeric_columns"),
F.array(F.col("name"), F.col("branch")).alias("Loaded_category_columns"),
).alias("Loaded_data"),
F.struct(
F.array(F.col("country__4")).alias("enriched_category_columns"),
F.array(F.col("id__1"), F.col("val__3")).alias("enriched_index_columns"),
).alias("enriched_data"),
).printSchema()
root
|-- Loaded_data: struct (nullable = false)
| |-- Loaded_numeric_columns: array (nullable = false)
| | |-- element: double (containsNull = true)
| |-- Loaded_category_columns: array (nullable = false)
| | |-- element: string (containsNull = true)
|-- enriched_data: struct (nullable = false)
| |-- enriched_category_columns: array (nullable = false)
| | |-- element: long (containsNull = true)
| |-- enriched_index_columns: array (nullable = false)
| | |-- element: string (containsNull = true)
您可以使用示例数据输出所需的json文件吗?
from pyspark.sql import functions as F
df.show()
+---+-----+-------+------+----------+-----+-------+
| id| val| name|branch|country__4|id__1| val__3|
+---+-----+-------+------+----------+-----+-------+
| 1|67.87|Shankar| a| 1|67.87|Shankar|
+---+-----+-------+------+----------+-----+-------+
df.select(
F.struct(
F.array(F.col("id"), F.col("val")).alias("Loaded_numeric_columns"),
F.array(F.col("name"), F.col("branch")).alias("Loaded_category_columns"),
).alias("Loaded_data"),
F.struct(
F.array(F.col("country__4")).alias("enriched_category_columns"),
F.array(F.col("id__1"), F.col("val__3")).alias("enriched_index_columns"),
).alias("enriched_data"),
).printSchema()
root
|-- Loaded_data: struct (nullable = false)
| |-- Loaded_numeric_columns: array (nullable = false)
| | |-- element: double (containsNull = true)
| |-- Loaded_category_columns: array (nullable = false)
| | |-- element: string (containsNull = true)
|-- enriched_data: struct (nullable = false)
| |-- enriched_category_columns: array (nullable = false)
| | |-- element: long (containsNull = true)
| |-- enriched_index_columns: array (nullable = false)
| | |-- element: string (containsNull = true)