Python Pyspark比较两个列表列
我有一个如下的数据帧。这两列是列表Python Pyspark比较两个列表列,python,pyspark,Python,Pyspark,我有一个如下的数据帧。这两列是列表 df= sc.parallelize([ {"subject_1":['A','B'],"subject_2":['A','B','C'] }, {"subject_1":['A','C'],"subject_2":['A','B','C'] }, {&qu
df= sc.parallelize([
{"subject_1":['A','B'],"subject_2":['A','B','C'] },
{"subject_1":['A','C'],"subject_2":['A','B','C'] },
{"subject_1":['A','B','D'],"subject_2":['A','B','E'] }
]).toDF()
df.show()
我需要如下转换数据帧。添加从前两列派生的三个新列。这需要比较两列列表中的项目
执行此操作的最佳方法是什么?对于
Spark2.4+
,使用
和array\u,但
除外:
from pyspark.sql import functions as F
df.withColumn("both", F.array_intersect("subject_1","subject_2"))\
.withColumn("only_1", F.array_except("subject_1","subject_2"))\
.withColumn("only_2", F.array_except("subject_2","subject_1")).show()
#+---------+---------+------+------+------+
#|subject_1|subject_2| both|only_1|only_2|
#+---------+---------+------+------+------+
#| [A, B]|[A, B, C]|[A, B]| []| [C]|
#| [A, C]|[A, B, C]|[A, C]| []| [B]|
#|[A, B, D]|[A, B, E]|[A, B]| [D]| [E]|
#+---------+---------+------+------+------+
我还在2.3版本。你需要为它写一个自定义项。类似于,但您的数据来自两列,而不是一列和预先确定的列表。@KeerikkattuChellappan如果没有2.4,udf是唯一的出路。建议更新到2.4,以便将来使用阵列或其他更高顺序的数据。