Python 如何在spark中将一个拆分为多个?
我正在尝试将具有嵌套数据的记录拆分为多个记录Python 如何在spark中将一个拆分为多个?,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,我正在尝试将具有嵌套数据的记录拆分为多个记录 df = spark.createDataFrame([('1','[{price:100, quantity:1},{price:200, quantity:2},{price:900, quantity:3},{price:500, quantity:5},{price:100, quantity:1},{price:800, quantity:8},{price:700, quantity:7},{price:600, quantity:6}]
df = spark.createDataFrame([('1','[{price:100, quantity:1},{price:200, quantity:2},{price:900, quantity:3},{price:500, quantity:5},{price:100, quantity:1},{price:800, quantity:8},{price:700, quantity:7},{price:600, quantity:6}]'),('2','[{price:100, quantity:1}]')],['id','data'])
输入数据看起来像
id,data
1,[{price:100, quantity:1},{price:200, quantity:2},{price:900, quantity:3},{price:500, quantity:5},{price:100, quantity:1},{price:800, quantity:8},{price:700, quantity:7},{price:600, quantity:6}]
2,[{price:100, quantity:1}]
如果数组列包含5条以上的记录,则应拆分这些记录,并为每行提供和id2
id,id2,data
1,1,[{price:100, quantity:1},{price:200, quantity:2},{price:900, quantity:3},{price:500, quantity:5},{price:100, quantity:1}]
1,2,[{price:800, quantity:8},{price:700, quantity:7},{price:600, quantity:6}]
2,1,[{price:100, quantity:1}]
我尝试分解数组列,但每个元素都有新行,即对于id 1,得到8行而不是2行
如何进行分解,使每行在数组中至少包含5条记录?对于Spark 2.4+,您可以使用SparkSQL内置函数+,并对数组索引进行一些计算:
from pyspark.sql import functions as F
df = spark.createDataFrame([('1','[{price:100, quantity:1},{price:200, quantity:2},{price:900, quantity:3},{price:500, quantity:5},{price:100, quantity:1},{price:800, quantity:8},{price:700, quantity:7},{price:600, quantity:6}]'),('2','[{price:100, quantity:1}]')],['id','data'])
N = 5
# for data column, convert String into array of structs
df1 = df.withColumn("data", F.from_json("data", "array<struct<price:int,quantity:int>>",{"allowUnquotedFieldNames":"True"}))
df1.selectExpr("id", f"""
inline_outer(
transform(
sequence(1,ceil(size(data)/{N})), i ->
(i as id2, slice(data,(i-1)*{N}+1,{N}) as data)
)
)
""").show(truncate=False)
+---+---+--------------------------------------------------+
|id |id2|data |
+---+---+--------------------------------------------------+
|1 |1 |[[100, 1], [200, 2], [900, 3], [500, 5], [100, 1]]|
|1 |2 |[[800, 8], [700, 7], [600, 6]] |
|2 |1 |[[100, 1]] |
+---+---+--------------------------------------------------+
从pyspark.sql导入函数为F
df=spark.createDataFrame([('1','[{price:100,数量:1},{price:200,数量:2},{price:900,数量:3},{price:500,数量:5},{price:100,数量:1},{price:800,数量:8},{price:700,数量:7},{price:600,数量:6}]),('2',[{price:100,数量:1}]),['id','data']
N=5
#对于数据列,将字符串转换为结构数组
df1=df.withColumn(“data”,F.from_json(“data”,“array”,“allowunkotedfeldnames:“True”}))
df1.选择EXPR(“id”,f”“”
内联外(
转化(
序列(1,ceil(大小(数据)/{N})),i->
(i作为id2,切片(数据,(i-1)*{N}+1,{N})作为数据)
)
)
“”)。显示(truncate=False)
+---+---+--------------------------------------------------+
|id | id2 |数据|
+---+---+--------------------------------------------------+
|1 |1 |[[100, 1], [200, 2], [900, 3], [500, 5], [100, 1]]|
|1 |2 |[[800, 8], [700, 7], [600, 6]] |
|2 |1 |[[100, 1]] |
+---+---+--------------------------------------------------+
我能想到一种方法。首先进行分解,然后使用秩函数。它会在同一个id中给出等级(希望它会为同一个id增加ID2),然后将新等级除以5,并基于该do group by。我肯定会尝试这个,我希望有一个更简单的方法@shaileshguptaA flatmap应该可以正常工作,如果有更多记录,您可以生成行列表。您是在寻找StringType列还是structs列数组?它是structs@JxC数组,以前从未使用过,这提供了预期的输出。谢谢:)