Python 使用数组列展平数据帧_Python_Pyspark

Python 使用数组列展平数据帧

python pyspark

Python 使用数组列展平数据帧,python,pyspark,Python,Pyspark,假设我有一个PySpark数据帧，其df.printSchema（）为： root |-- shop_id: int (nullable = false) |-- items: array (nullable = true) | |-- element: struct (containsNull = false) | | |-- item_id: int (nullable = false) 如何将其转换为： root |-- shop_id: int (nulla

假设我有一个PySpark数据帧，其

df.printSchema（）

为：

root
 |-- shop_id: int (nullable = false)
 |-- items: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- item_id: int (nullable = false)

如何将其转换为：

root
 |-- shop_id: int (nullable = false)
 |-- item_id: int (nullable = false)

换句话说，在每个条目中，

shop\u id

被“附加”到每个

item\u id

，并且这些对被引导到单个流中

更直观的解释：

以前

[
   {
      "shop_id":42,
      "items":[{"item_id":101}, {"item_id":102}]
   },
   {
      "shop_id":43,
      "items":[{"item_id":203}]
   }
]

df.printSchema()

root
 |-- shop_id: integer (nullable = true)
 |-- items: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- item_id: integer (nullable = true)

之后

tl；dr

df.select('shop_id',F.explode('items.item_id').alias('item_id'))

测试

从pyspark.sql.types导入StructType、StructField、ArrayType、StructType、IntegerType
schema=StructType([
StructField（'shop_id'，IntegerType（）），
StructField（'items'，ArrayType(
结构类型([
StructField（'item_id'，IntegerType（）），
])
))
])
数据=[
{
“店铺id”：42，
“项目”：[{“项目id”：101}，{“项目id”：102}]
},
{
“店铺id”：43，
“项目”：[{“项目id”：203}]
}
]
df=spark_session.createDataFrame（数据，模式）

以前

[
   {
      "shop_id":42,
      "items":[{"item_id":101}, {"item_id":102}]
   },
   {
      "shop_id":43,
      "items":[{"item_id":203}]
   }
]

df.printSchema()

root
 |-- shop_id: integer (nullable = true)
 |-- items: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- item_id: integer (nullable = true)

之后