Python Pyspark比较两个列表列

Python Pyspark比较两个列表列,python,pyspark,Python,Pyspark,我有一个如下的数据帧。这两列是列表 df= sc.parallelize([ {"subject_1":['A','B'],"subject_2":['A','B','C'] }, {"subject_1":['A','C'],"subject_2":['A','B','C'] }, {&qu

我有一个如下的数据帧。这两列是列表

df= sc.parallelize([
            {"subject_1":['A','B'],"subject_2":['A','B','C']  },            
            {"subject_1":['A','C'],"subject_2":['A','B','C']  },             
            {"subject_1":['A','B','D'],"subject_2":['A','B','E']  }  
 ]).toDF()
df.show()

我需要如下转换数据帧。添加从前两列派生的三个新列。这需要比较两列列表中的项目


执行此操作的最佳方法是什么?

对于
Spark2.4+
,使用
array\u,但
除外:

from pyspark.sql import functions as F

df.withColumn("both", F.array_intersect("subject_1","subject_2"))\
  .withColumn("only_1", F.array_except("subject_1","subject_2"))\
  .withColumn("only_2", F.array_except("subject_2","subject_1")).show()

#+---------+---------+------+------+------+
#|subject_1|subject_2|  both|only_1|only_2|
#+---------+---------+------+------+------+
#|   [A, B]|[A, B, C]|[A, B]|    []|   [C]|
#|   [A, C]|[A, B, C]|[A, C]|    []|   [B]|
#|[A, B, D]|[A, B, E]|[A, B]|   [D]|   [E]|
#+---------+---------+------+------+------+

我还在2.3版本。你需要为它写一个自定义项。类似于,但您的数据来自两列,而不是一列和预先确定的列表。@KeerikkattuChellappan如果没有2.4,udf是唯一的出路。建议更新到2.4,以便将来使用阵列或其他更高顺序的数据。