Apache spark 从Pyspark中的数组映射列
当有数组存储在列中时,我不太熟悉Pyspark df,并希望在尝试基于2个Pyspark数据帧(其中一个是引用df)映射列时获得一些帮助 参考数据帧(每个组的子组数量不同): 源数据帧:Apache spark 从Pyspark中的数组映射列,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,当有数组存储在列中时,我不太熟悉Pyspark df,并希望在尝试基于2个Pyspark数据帧(其中一个是引用df)映射列时获得一些帮助 参考数据帧(每个组的子组数量不同): 源数据帧: | ID | Size | Type | | ---- | -------- | ---------| |ID_001 | 'Small' |'A' | |ID_002 | 'Medium' |'B' | |ID_003 | 'Small' |
| ID | Size | Type |
| ---- | -------- | ---------|
|ID_001 | 'Small' |'A' |
|ID_002 | 'Medium' |'B' |
|ID_003 | 'Small' |'D' |
在结果中,每个ID都属于每个组,但根据参考df对其子组是独占的,结果如下所示:
| ID | Size | Type | A_Subgroup | B_Subgroup |
| ---- | -------- | ---------| ---------- | ------------- |
|ID_001 | 'Small' |'A' | 'A1' | 'B1' |
|ID_002 | 'Medium' |'B' | 'A1' | Null |
|ID_003 | 'Small' |'D' | 'A2' | 'B1' |
您可以使用
array\u contains
条件进行连接,并透视结果:
import pyspark.sql.functions as F
result = source.alias('source').join(
ref.alias('ref'),
F.expr("""
array_contains(ref.Size, source.Size) and
array_contains(ref.Type, source.Type)
"""),
'left'
).groupBy(
'ID', source['Size'], source['Type']
).pivot('Group').agg(F.first('Subgroup'))
result.show()
+------+------+----+---+----+
| ID| Size|Type| A| B|
+------+------+----+---+----+
|ID_003| Small| D| A2| B1|
|ID_002|Medium| B| A1|null|
|ID_001| Small| A| A1| B1|
+------+------+----+---+----+
import pyspark.sql.functions as F
result = source.alias('source').join(
ref.alias('ref'),
F.expr("""
array_contains(ref.Size, source.Size) and
array_contains(ref.Type, source.Type)
"""),
'left'
).groupBy(
'ID', source['Size'], source['Type']
).pivot('Group').agg(F.first('Subgroup'))
result.show()
+------+------+----+---+----+
| ID| Size|Type| A| B|
+------+------+----+---+----+
|ID_003| Small| D| A2| B1|
|ID_002|Medium| B| A1|null|
|ID_001| Small| A| A1| B1|
+------+------+----+---+----+