基于其他数据帧创建指示符数组';PySpark中的s列值
我有两个数据帧:基于其他数据帧创建指示符数组';PySpark中的s列值,pyspark,indicator,collect,Pyspark,Indicator,Collect,我有两个数据帧:df1 +---+-----------------+ |id1| items1| +---+-----------------+ | 0| [B, C, D, E]| | 1| [E, A, C]| | 2| [F, A, E, B]| | 3| [E, G, A]| | 4| [A, C, E, B, D]| +---+-----------------+ 和df2: +---+---------
df1
+---+-----------------+
|id1| items1|
+---+-----------------+
| 0| [B, C, D, E]|
| 1| [E, A, C]|
| 2| [F, A, E, B]|
| 3| [E, G, A]|
| 4| [A, C, E, B, D]|
+---+-----------------+
和df2
:
+---+-----------------+
|id2| items2|
+---+-----------------+
|001| [A, C]|
|002| [D]|
|003| [E, A, B]|
|004| [B, D, C]|
|005| [F, B]|
|006| [G, E]|
+---+-----------------+
我想根据items2
中的值创建一个指示符向量(在df1
中的新列result\u array
中)。向量的长度应与df2
中的行数相同(在本例中,向量应具有6个元素)。如果items1
中的行包含items2
对应行中的所有元素,则其元素的值应为1.0,否则为0.0。结果应如下所示:
+---+-----------------+-------------------------+
|id1| items1| result_array|
+---+-----------------+-------------------------+
| 0| [B, C, D, E]|[0.0,1.0,0.0,1.0,0.0,0.0]|
| 1| [E, A, C]|[1.0,0.0,0.0,0.0,0.0,0.0]|
| 2| [F, A, E, B]|[0.0,0.0,1.0,0.0,1.0,0.0]|
| 3| [E, G, A]|[0.0,0.0,0.0,0.0,0.0,1.0]|
| 4| [A, C, E, B, D]|[1.0,1.0,1.0,1.0,0.0,0.0]|
+---+-----------------+-------------------------+
例如,在第0行中,第二个值是1.0,因为[D]是[B,C,D,E]的子集,第四个值是1.0,因为[B,D,C]是[B,C,D,E]的子集。df2
中的所有其他项目组都不是[B、C、D、E]的子集,因此其指标值为0.0
我尝试使用collect()创建
items2
中所有项目组的列表,然后应用自定义项,但我的数据太大(超过1000万行) 你可以这样继续
导入pyspark.sql.F函数
从pyspark.sql.types导入*
df1=sql.createDataFrame([
(0,['B','C','D','E']),
(1,"E","A","C",,
(2、['F'、'A'、'E'、'B']),
(3,['E','G','A']),
(4、['A','C','E','B','D']),
['id1','items1'])
df2=sql.createDataFrame([
(001,['A','C']),
(002,['D']),
(003,['E','A','B']),
(004,['B','D','C']),
(005,['F','B']),
(006,['G','E']),
['id2','items2'])
这给了你数据帧
+---+---------------+
|id1 |项目1|
+---+---------------+
|0 |[B,C,D,E]|
|1 |[E,A,C]|
|2 |[F,A,E,B]|
|3 |[E,G,A]|
|4 |[A、C、E、B、D]|
+---+---------------+
+---+---------+
|id2 |项目2|
+---+---------+
|1 |[A,C]|
|2 |[D]|
|3 |[E,A,B]|
|4 |[B,D,C]|
|5 |[F,B]|
|6 |[G,E]|
+---+---------+
现在,crossJoin
这两个数据帧,它给出了df1
与df2
的笛卡尔乘积。然后,在'items1'
上执行groupby
,并应用udf
以获得'result\u array'
get_array_udf=F.udf(lambda x,y:[1.0如果设置(z)
这将为您提供如下输出:
+--+--+--+-----------------------------------+
|id1 |项1 |结果|数组|
+---+---------------+------------------------------+
|1 |[E,A,C]|[1.0,0.0,0.0,0.0,0.0,0.0]|
|0 |[B,C,D,E]|[0.0,1.0,0.0,1.0,0.0,0.0]|
|4 |[A,C,E,B,D]|[1.0,1.0,1.0,1.0,0.0,0.0]|
|3 |[E,G,A]|[0.0,0.0,0.0,0.0,0.0,1.0]|
|2 |[F,A,E,B]|[0.0,0.0,1.0,0.0,1.0,0.0]|
+---+---------------+------------------------------+
非常感谢您的帮助。这是一个极好的解决方案。如此高效和简单。