Python 使用数组列展平数据帧
假设我有一个PySpark数据帧,其Python 使用数组列展平数据帧,python,pyspark,Python,Pyspark,假设我有一个PySpark数据帧,其df.printSchema()为: root |-- shop_id: int (nullable = false) |-- items: array (nullable = true) | |-- element: struct (containsNull = false) | | |-- item_id: int (nullable = false) 如何将其转换为: root |-- shop_id: int (nulla
df.printSchema()
为:
root
|-- shop_id: int (nullable = false)
|-- items: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- item_id: int (nullable = false)
如何将其转换为:
root
|-- shop_id: int (nullable = false)
|-- item_id: int (nullable = false)
换句话说,在每个条目中,shop\u id
被“附加”到每个item\u id
,并且这些对被引导到单个流中
更直观的解释:
以前
[
{
"shop_id":42,
"items":[{"item_id":101}, {"item_id":102}]
},
{
"shop_id":43,
"items":[{"item_id":203}]
}
]
df.printSchema()
root
|-- shop_id: integer (nullable = true)
|-- items: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- item_id: integer (nullable = true)
之后
tl;dr
df.select('shop_id',F.explode('items.item_id').alias('item_id'))
测试
从pyspark.sql.types导入StructType、StructField、ArrayType、StructType、IntegerType
schema=StructType([
StructField('shop_id',IntegerType()),
StructField('items',ArrayType(
结构类型([
StructField('item_id',IntegerType()),
])
))
])
数据=[
{
“店铺id”:42,
“项目”:[{“项目id”:101},{“项目id”:102}]
},
{
“店铺id”:43,
“项目”:[{“项目id”:203}]
}
]
df=spark_session.createDataFrame(数据,模式)
以前
[
{
"shop_id":42,
"items":[{"item_id":101}, {"item_id":102}]
},
{
"shop_id":43,
"items":[{"item_id":203}]
}
]
df.printSchema()
root
|-- shop_id: integer (nullable = true)
|-- items: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- item_id: integer (nullable = true)
之后