Python 如何在spark中将一个拆分为多个?

Python 如何在spark中将一个拆分为多个?,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,我正在尝试将具有嵌套数据的记录拆分为多个记录 df = spark.createDataFrame([('1','[{price:100, quantity:1},{price:200, quantity:2},{price:900, quantity:3},{price:500, quantity:5},{price:100, quantity:1},{price:800, quantity:8},{price:700, quantity:7},{price:600, quantity:6}]

我正在尝试将具有嵌套数据的记录拆分为多个记录

df = spark.createDataFrame([('1','[{price:100, quantity:1},{price:200, quantity:2},{price:900, quantity:3},{price:500, quantity:5},{price:100, quantity:1},{price:800, quantity:8},{price:700, quantity:7},{price:600, quantity:6}]'),('2','[{price:100, quantity:1}]')],['id','data'])
输入数据看起来像

id,data
1,[{price:100, quantity:1},{price:200, quantity:2},{price:900, quantity:3},{price:500, quantity:5},{price:100, quantity:1},{price:800, quantity:8},{price:700, quantity:7},{price:600, quantity:6}]
2,[{price:100, quantity:1}]
如果数组列包含5条以上的记录,则应拆分这些记录,并为每行提供和id2

id,id2,data
1,1,[{price:100, quantity:1},{price:200, quantity:2},{price:900, quantity:3},{price:500, quantity:5},{price:100, quantity:1}]
1,2,[{price:800, quantity:8},{price:700, quantity:7},{price:600, quantity:6}]
2,1,[{price:100, quantity:1}]
我尝试分解数组列,但每个元素都有新行,即对于id 1,得到8行而不是2行


如何进行分解,使每行在数组中至少包含5条记录?

对于Spark 2.4+,您可以使用SparkSQL内置函数+,并对数组索引进行一些计算:

from pyspark.sql import functions as F

df = spark.createDataFrame([('1','[{price:100, quantity:1},{price:200, quantity:2},{price:900, quantity:3},{price:500, quantity:5},{price:100, quantity:1},{price:800, quantity:8},{price:700, quantity:7},{price:600, quantity:6}]'),('2','[{price:100, quantity:1}]')],['id','data'])

N = 5

# for data column, convert String into array of structs
df1 = df.withColumn("data", F.from_json("data", "array<struct<price:int,quantity:int>>",{"allowUnquotedFieldNames":"True"}))

df1.selectExpr("id", f"""
    inline_outer(
      transform(
        sequence(1,ceil(size(data)/{N})), i ->
        (i as id2, slice(data,(i-1)*{N}+1,{N}) as data)
      )
    )
 """).show(truncate=False)
+---+---+--------------------------------------------------+
|id |id2|data                                              |
+---+---+--------------------------------------------------+
|1  |1  |[[100, 1], [200, 2], [900, 3], [500, 5], [100, 1]]|
|1  |2  |[[800, 8], [700, 7], [600, 6]]                    |
|2  |1  |[[100, 1]]                                        |
+---+---+--------------------------------------------------+
从pyspark.sql导入函数为F
df=spark.createDataFrame([('1','[{price:100,数量:1},{price:200,数量:2},{price:900,数量:3},{price:500,数量:5},{price:100,数量:1},{price:800,数量:8},{price:700,数量:7},{price:600,数量:6}]),('2',[{price:100,数量:1}]),['id','data']
N=5
#对于数据列,将字符串转换为结构数组
df1=df.withColumn(“data”,F.from_json(“data”,“array”,“allowunkotedfeldnames:“True”}))
df1.选择EXPR(“id”,f”“”
内联外(
转化(
序列(1,ceil(大小(数据)/{N})),i->
(i作为id2,切片(数据,(i-1)*{N}+1,{N})作为数据)
)
)
“”)。显示(truncate=False)
+---+---+--------------------------------------------------+
|id | id2 |数据|
+---+---+--------------------------------------------------+
|1  |1  |[[100, 1], [200, 2], [900, 3], [500, 5], [100, 1]]|
|1  |2  |[[800, 8], [700, 7], [600, 6]]                    |
|2  |1  |[[100, 1]]                                        |
+---+---+--------------------------------------------------+

我能想到一种方法。首先进行分解,然后使用秩函数。它会在同一个id中给出等级(希望它会为同一个id增加ID2),然后将新等级除以5,并基于该do group by。我肯定会尝试这个,我希望有一个更简单的方法@shaileshguptaA flatmap应该可以正常工作,如果有更多记录,您可以生成行列表。您是在寻找StringType列还是structs列数组?它是structs@JxC数组,以前从未使用过,这提供了预期的输出。谢谢:)