Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/287.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 使用数组列展平数据帧_Python_Pyspark - Fatal编程技术网

Python 使用数组列展平数据帧

Python 使用数组列展平数据帧,python,pyspark,Python,Pyspark,假设我有一个PySpark数据帧,其df.printSchema()为: root |-- shop_id: int (nullable = false) |-- items: array (nullable = true) | |-- element: struct (containsNull = false) | | |-- item_id: int (nullable = false) 如何将其转换为: root |-- shop_id: int (nulla

假设我有一个PySpark数据帧,其
df.printSchema()
为:

root
 |-- shop_id: int (nullable = false)
 |-- items: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- item_id: int (nullable = false)
如何将其转换为:

root
 |-- shop_id: int (nullable = false)
 |-- item_id: int (nullable = false)
换句话说,在每个条目中,
shop\u id
被“附加”到每个
item\u id
,并且这些对被引导到单个流中

更直观的解释:

以前

[
   {
      "shop_id":42,
      "items":[{"item_id":101}, {"item_id":102}]
   },
   {
      "shop_id":43,
      "items":[{"item_id":203}]
   }
]
df.printSchema()

root
 |-- shop_id: integer (nullable = true)
 |-- items: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- item_id: integer (nullable = true)
之后

tl;dr

df.select('shop_id',F.explode('items.item_id').alias('item_id'))
测试

从pyspark.sql.types导入StructType、StructField、ArrayType、StructType、IntegerType
schema=StructType([
StructField('shop_id',IntegerType()),
StructField('items',ArrayType(
结构类型([
StructField('item_id',IntegerType()),
])
))
])
数据=[
{
“店铺id”:42,
“项目”:[{“项目id”:101},{“项目id”:102}]
},
{
“店铺id”:43,
“项目”:[{“项目id”:203}]
}
]
df=spark_session.createDataFrame(数据,模式)
以前

[
   {
      "shop_id":42,
      "items":[{"item_id":101}, {"item_id":102}]
   },
   {
      "shop_id":43,
      "items":[{"item_id":203}]
   }
]
df.printSchema()

root
 |-- shop_id: integer (nullable = true)
 |-- items: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- item_id: integer (nullable = true)
之后