基于其他数据帧创建指示符数组';PySpark中的s列值

基于其他数据帧创建指示符数组';PySpark中的s列值,pyspark,indicator,collect,Pyspark,Indicator,Collect,我有两个数据帧:df1 +---+-----------------+ |id1| items1| +---+-----------------+ | 0| [B, C, D, E]| | 1| [E, A, C]| | 2| [F, A, E, B]| | 3| [E, G, A]| | 4| [A, C, E, B, D]| +---+-----------------+ 和df2: +---+---------

我有两个数据帧:
df1

+---+-----------------+
|id1|           items1|
+---+-----------------+
|  0|     [B, C, D, E]|
|  1|        [E, A, C]|
|  2|     [F, A, E, B]|
|  3|        [E, G, A]|
|  4|  [A, C, E, B, D]|
+---+-----------------+ 
df2

+---+-----------------+
|id2|           items2|
+---+-----------------+
|001|           [A, C]|
|002|              [D]|
|003|        [E, A, B]|
|004|        [B, D, C]|
|005|           [F, B]|
|006|           [G, E]|
+---+-----------------+ 
我想根据
items2
中的值创建一个指示符向量(在
df1
中的新列
result\u array
中)。向量的长度应与
df2
中的行数相同(在本例中,向量应具有6个元素)。如果
items1
中的行包含
items2
对应行中的所有元素,则其元素的值应为1.0,否则为0.0。结果应如下所示:

+---+-----------------+-------------------------+
|id1|           items1|             result_array|
+---+-----------------+-------------------------+
|  0|     [B, C, D, E]|[0.0,1.0,0.0,1.0,0.0,0.0]|
|  1|        [E, A, C]|[1.0,0.0,0.0,0.0,0.0,0.0]|
|  2|     [F, A, E, B]|[0.0,0.0,1.0,0.0,1.0,0.0]|
|  3|        [E, G, A]|[0.0,0.0,0.0,0.0,0.0,1.0]|
|  4|  [A, C, E, B, D]|[1.0,1.0,1.0,1.0,0.0,0.0]|
+---+-----------------+-------------------------+
例如,在第0行中,第二个值是1.0,因为[D]是[B,C,D,E]的子集,第四个值是1.0,因为[B,D,C]是[B,C,D,E]的子集。
df2
中的所有其他项目组都不是[B、C、D、E]的子集,因此其指标值为0.0


我尝试使用collect()创建
items2
中所有项目组的列表,然后应用自定义项,但我的数据太大(超过1000万行)

你可以这样继续

导入pyspark.sql.F函数
从pyspark.sql.types导入*
df1=sql.createDataFrame([
(0,['B','C','D','E']),
(1,"E","A","C",,
(2、['F'、'A'、'E'、'B']),
(3,['E','G','A']),
(4、['A','C','E','B','D']),
['id1','items1'])
df2=sql.createDataFrame([
(001,['A','C']),
(002,['D']),
(003,['E','A','B']),
(004,['B','D','C']),
(005,['F','B']),
(006,['G','E']),
['id2','items2'])
这给了你数据帧

+---+---------------+
|id1 |项目1|
+---+---------------+
|0 |[B,C,D,E]|
|1 |[E,A,C]|
|2 |[F,A,E,B]|
|3 |[E,G,A]|
|4 |[A、C、E、B、D]|
+---+---------------+
+---+---------+
|id2 |项目2|
+---+---------+
|1 |[A,C]|
|2 |[D]|
|3 |[E,A,B]|
|4 |[B,D,C]|
|5 |[F,B]|
|6 |[G,E]|
+---+---------+
现在,
crossJoin
这两个数据帧,它给出了
df1
df2
的笛卡尔乘积。然后,在
'items1'
上执行
groupby
,并应用
udf
以获得
'result\u array'

get_array_udf=F.udf(lambda x,y:[1.0如果设置(z)
这将为您提供如下输出:

+--+--+--+-----------------------------------+
|id1 |项1 |结果|数组|
+---+---------------+------------------------------+
|1 |[E,A,C]|[1.0,0.0,0.0,0.0,0.0,0.0]|
|0 |[B,C,D,E]|[0.0,1.0,0.0,1.0,0.0,0.0]|
|4 |[A,C,E,B,D]|[1.0,1.0,1.0,1.0,0.0,0.0]|
|3 |[E,G,A]|[0.0,0.0,0.0,0.0,0.0,1.0]|
|2 |[F,A,E,B]|[0.0,0.0,1.0,0.0,1.0,0.0]|
+---+---------------+------------------------------+

非常感谢您的帮助。这是一个极好的解决方案。如此高效和简单。